RGIFE: a ranked guided iterative feature elimination heuristic for the identification of biomarkers.
ABSTRACT: Current -omics technologies are able to sense the state of a biological sample in a very wide variety of ways. Given the high dimensionality that typically characterises these data, relevant knowledge is often hidden and hard to identify. Machine learning methods, and particularly feature selection algorithms, have proven very effective over the years at identifying small but relevant subsets of variables from a variety of application domains, including -omics data. Many methods exist with varying trade-off between the size of the identified variable subsets and the predictive power of such subsets. In this paper we focus on an heuristic for the identification of biomarkers called RGIFE: Rank Guided Iterative Feature Elimination. RGIFE is guided in its biomarker identification process by the information extracted from machine learning models and incorporates several mechanisms to ensure that it creates minimal and highly predictive features sets.We compare RGIFE against five well-known feature selection algorithms using both synthetic and real (cancer-related transcriptomics) datasets. First, we assess the ability of the methods to identify relevant and highly predictive features. Then, using a prostate cancer dataset as a case study, we look at the biological relevance of the identified biomarkers.We propose RGIFE, a heuristic for the inference of reduced panels of biomarkers that obtains similar predictive performance to widely adopted feature selection methods while selecting significantly fewer feature. Furthermore, focusing on the case study, we show the higher biological relevance of the biomarkers selected by our approach. The RGIFE source code is available at: http://ico2s.org/software/rgife.html .
Project description:Investigations into novel biomarkers using omics techniques generate large amounts of data. Due to their size and numbers of attributes, these data are suitable for analysis with machine learning methods. A key component of typical machine learning pipelines for omics data is feature selection, which is used to reduce the raw high-dimensional data into a tractable number of features. Feature selection needs to balance the objective of using as few features as possible, while maintaining high predictive power. This balance is crucial when the goal of data analysis is the identification of highly accurate but small panels of biomarkers with potential clinical utility. In this paper we propose a heuristic for the selection of very small feature subsets, via an iterative feature elimination process that is guided by rule-based machine learning, called RGIFE (Rule-guided Iterative Feature Elimination). We use this heuristic to identify putative biomarkers of osteoarthritis (OA), articular cartilage degradation and synovial inflammation, using both proteomic and transcriptomic datasets.Our RGIFE heuristic increased the classification accuracies achieved for all datasets when no feature selection is used, and performed well in a comparison with other feature selection methods. Using this method the datasets were reduced to a smaller number of genes or proteins, including those known to be relevant to OA, cartilage degradation and joint inflammation. The results have shown the RGIFE feature reduction method to be suitable for analysing both proteomic and transcriptomics data. Methods that generate large 'omics' datasets are increasingly being used in the area of rheumatology.Feature reduction methods are advantageous for the analysis of omics data in the field of rheumatology, as the applications of such techniques are likely to result in improvements in diagnosis, treatment and drug discovery.
Project description:Various feature selection algorithms have been proposed to identify cancer prognostic biomarkers. In recent years, however, their reproducibility is criticized. The performance of feature selection algorithms is shown to be affected by the datasets, underlying networks and evaluation metrics. One of the causes is the curse of dimensionality, which makes it hard to select the features that generalize well on independent data. Even the integration of biological networks does not mitigate this issue because the networks are large and many of their components are not relevant for the phenotype of interest. With the availability of multi-omics data, integrative approaches are being developed to build more robust predictive models. In this scenario, the higher data dimensions create greater challenges. We proposed a phenotype relevant network-based feature selection (PRNFS) framework and demonstrated its advantages in lung cancer prognosis prediction. We constructed cancer prognosis relevant networks based on epithelial mesenchymal transition (EMT) and integrated them with different types of omics data for feature selection. With less than 2.5% of the total dimensionality, we obtained EMT prognostic signatures that achieved remarkable prediction performance (average AUC values >0.8), very significant sample stratifications, and meaningful biological interpretations. In addition to finding EMT signatures from different omics data levels, we combined these single-omics signatures into multi-omics signatures, which improved sample stratifications significantly. Both single- and multi-omics EMT signatures were tested on independent multi-omics lung cancer datasets and significant sample stratifications were obtained.
Project description:The identification of biomarker signatures in omics molecular profiling is usually performed to predict outcomes in a precision medicine context, such as patient disease susceptibility, diagnosis, prognosis, and treatment response. To identify these signatures, we have developed a biomarker discovery tool, called BioDiscML. From a collection of samples and their associated characteristics, i.e., the biomarkers (e.g., gene expression, protein levels, clinico-pathological data), BioDiscML exploits various feature selection procedures to produce signatures associated to machine learning models that will predict efficiently a specified outcome. To this purpose, BioDiscML uses a large variety of machine learning algorithms to select the best combination of biomarkers for predicting categorical or continuous outcomes from highly unbalanced datasets. The software has been implemented to automate all machine learning steps, including data pre-processing, feature selection, model selection, and performance evaluation. BioDiscML is delivered as a stand-alone program and is available for download at https://github.com/mickaelleclercq/BioDiscML.
Project description:Selecting relevant features is a common task in most OMICs data analysis, where the aim is to identify a small set of key features to be used as biomarkers. To this end, two alternative but equally valid methods are mainly available, namely the univariate (filter) or the multivariate (wrapper) approach. The stability of the selected lists of features is an often neglected but very important requirement. If the same features are selected in multiple independent iterations, they more likely are reliable biomarkers. In this study, we developed and evaluated the performance of a novel method for feature selection and prioritization, aiming at generating robust and stable sets of features with high predictive power. The proposed method uses the fuzzy logic for a first unbiased feature selection and a Random Forest built from conditional inference trees to prioritize the candidate discriminant features. Analyzing several multi-class gene expression microarray data sets, we demonstrate that our technique provides equal or better classification performance and a greater stability as compared to other Random Forest-based feature selection methods.
Project description:BACKGROUND:Large-scale collaborative precision medicine initiatives (e.g., The Cancer Genome Atlas (TCGA)) are yielding rich multi-omics data. Integrative analyses of the resulting multi-omics data, such as somatic mutation, copy number alteration (CNA), DNA methylation, miRNA, gene expression, and protein expression, offer tantalizing possibilities for realizing the promise and potential of precision medicine in cancer prevention, diagnosis, and treatment by substantially improving our understanding of underlying mechanisms as well as the discovery of novel biomarkers for different types of cancers. However, such analyses present a number of challenges, including heterogeneity, and high-dimensionality of omics data. METHODS:We propose a novel framework for multi-omics data integration using multi-view feature selection. We introduce a novel multi-view feature selection algorithm, MRMR-mv, an adaptation of the well-known Min-Redundancy and Maximum-Relevance (MRMR) single-view feature selection algorithm to the multi-view setting. RESULTS:We report results of experiments using an ovarian cancer multi-omics dataset derived from the TCGA database on the task of predicting ovarian cancer survival. Our results suggest that multi-view models outperform both view-specific models (i.e., models trained and tested using a single type of omics data) and models based on two baseline data fusion methods. CONCLUSIONS:Our results demonstrate the potential of multi-view feature selection in integrative analyses and predictive modeling from multi-omics data.
Project description:In the wake of recent advances in machine learning research, the study of pharmacogenomics using predictive algorithms serves as a new paradigmatic application. In this work, our goal was to explore an ensemble machine learning approach which aims to predict probable antidepressant treatment response and remission in major depressive disorder (MDD). To discover the status of antidepressant treatments, we established an ensemble predictive model with a feature selection algorithm resulting from the analysis of genetic variants and clinical variables of 421 patients who were treated with selective serotonin reuptake inhibitors. We also compared our ensemble machine learning framework with other state-of-the-art models including multi-layer feedforward neural networks (MFNNs), logistic regression, support vector machine, C4.5 decision tree, naïve Bayes, and random forests. Our data revealed that the ensemble predictive algorithm with feature selection (using fewer biomarkers) performed comparably to other predictive algorithms (such as MFNNs and logistic regression) to derive the perplexing relationship between biomarkers and the status of antidepressant treatments. Our study demonstrates that the ensemble machine learning framework may present a useful technique to create bioinformatics tools for discriminating non-responders from responders prior to antidepressant treatments.
Project description:With widespread availability of omics profiling techniques, the analysis and interpretation of high-dimensional omics data, for example, for biomarkers, is becoming an increasingly important part of clinical medicine because such datasets constitute a promising resource for predicting survival outcomes. However, early experience has shown that biomarkers often generalize poorly. Thus, it is crucial that models are not overfitted and give accurate results with new data. In addition, reliable detection of multivariate biomarkers with high predictive power (feature selection) is of particular interest in clinical settings. We present an approach that addresses both aspects in high-dimensional survival models. Within a nested cross-validation (CV), we fit a survival model, evaluate a dataset in an unbiased fashion, and select features with the best predictive power by applying a weighted combination of CV runs. We evaluate our approach using simulated toy data, as well as three breast cancer datasets, to predict the survival of breast cancer patients after treatment. In all datasets, we achieve more reliable estimation of predictive power for unseen cases and better predictive performance compared to the standard CoxLasso model. Taken together, we present a comprehensive and flexible framework for survival models, including performance estimation, final feature selection, and final model construction. The proposed algorithm is implemented in an open source R package (SurvRank) available on CRAN.
Project description:Untargeted metabolomics is a powerful phenotyping tool for better understanding biological mechanisms involved in human pathology development and identifying early predictive biomarkers. This approach, based on multiple analytical platforms, such as mass spectrometry (MS), chemometrics and bioinformatics, generates massive and complex data that need appropriate analyses to extract the biologically meaningful information. Despite various tools available, it is still a challenge to handle such large and noisy datasets with limited number of individuals without risking overfitting. Moreover, when the objective is focused on the identification of early predictive markers of clinical outcome, few years before occurrence, it becomes essential to use the appropriate algorithms and workflow to be able to discover subtle effects among this large amount of data. In this context, this work consists in studying a workflow describing the general feature selection process, using knowledge discovery and data mining methodologies to propose advanced solutions for predictive biomarker discovery. The strategy was focused on evaluating a combination of numeric-symbolic approaches for feature selection with the objective of obtaining the best combination of metabolites producing an effective and accurate predictive model. Relying first on numerical approaches, and especially on machine learning methods (SVM-RFE, RF, RF-RFE) and on univariate statistical analyses (ANOVA), a comparative study was performed on an original metabolomic dataset and reduced subsets. As resampling method, LOOCV was applied to minimize the risk of overfitting. The best k-features obtained with different scores of importance from the combination of these different approaches were compared and allowed determining the variable stabilities using Formal Concept Analysis. The results revealed the interest of RF-Gini combined with ANOVA for feature selection as these two complementary methods allowed selecting the 48 best candidates for prediction. Using linear logistic regression on this reduced dataset enabled us to obtain the best performances in terms of prediction accuracy and number of false positive with a model including 5 top variables. Therefore, these results highlighted the interest of feature selection methods and the importance of working on reduced datasets for the identification of predictive biomarkers issued from untargeted metabolomics data.
Project description:BACKGROUND:Predicting progression from Mild Cognitive Impairment (MCI) to Alzheimer's Disease (AD) is an utmost open issue in AD-related research. Neuropsychological assessment has proven to be useful in identifying MCI patients who are likely to convert to dementia. However, the large battery of neuropsychological tests (NPTs) performed in clinical practice and the limited number of training examples are challenge to machine learning when learning prognostic models. In this context, it is paramount to pursue approaches that effectively seek for reduced sets of relevant features. Subsets of NPTs from which prognostic models can be learnt should not only be good predictors, but also stable, promoting generalizable and explainable models. METHODS:We propose a feature selection (FS) ensemble combining stability and predictability to choose the most relevant NPTs for prognostic prediction in AD. First, we combine the outcome of multiple (filter and embedded) FS methods. Then, we use a wrapper-based approach optimizing both stability and predictability to compute the number of selected features. We use two large prospective studies (ADNI and the Portuguese Cognitive Complaints Cohort, CCC) to evaluate the approach and assess the predictive value of a large number of NPTs. RESULTS:The best subsets of features include approximately 30 and 20 (from the original 79 and 40) features, for ADNI and CCC data, respectively, yielding stability above 0.89 and 0.95, and AUC above 0.87 and 0.82. Most NPTs learnt using the proposed feature selection ensemble have been identified in the literature as strong predictors of conversion from MCI to AD. CONCLUSIONS:The FS ensemble approach was able to 1) identify subsets of stable and relevant predictors from a consensus of multiple FS methods using baseline NPTs and 2) learn reliable prognostic models of conversion from MCI to AD using these subsets of features. The machine learning models learnt from these features outperformed the models trained without FS and achieved competitive results when compared to commonly used FS algorithms. Furthermore, the selected features are derived from a consensus of methods thus being more robust, while releasing users from choosing the most appropriate FS method to be used in their classification task.
Project description:BACKGROUND: We introduce Iterative Feature Removal (IFR) as an unbiased approach for selecting features with diagnostic capacity from large data sets. The algorithm is based on recently developed tools in machine learning that are driven by sparse feature selection goals. When applied to genomic data, our method is designed to identify genes that can provide deeper insight into complex interactions while remaining directly connected to diagnostic utility. We contrast this approach with the search for a minimal best set of discriminative genes, which can provide only an incomplete picture of the biological complexity. RESULTS: Microarray data sets typically contain far more features (genes) than samples. For this type of data, we demonstrate that there are many equivalently-predictive subsets of genes. We iteratively train a classifier using features identified via a sparse support vector machine. At each iteration, we remove all the features that were previously selected. We found that we could iterate many times before a sustained drop in accuracy occurs, with each iteration removing approximately 30 genes from consideration. The classification accuracy on test data remains essentially flat even as hundreds of top-genes are removed.Our method identifies sets of genes that are highly predictive, even when comprised of genes that individually are not. Through automated and manual analysis of the selected genes, we demonstrate that the selected features expose relevant pathways that other approaches would have missed. CONCLUSIONS: Our results challenge the paradigm of using feature selection techniques to design parsimonious classifiers from microarray and similar high-dimensional, small-sample-size data sets. The fact that there are many subsets of genes that work equally well to classify the data provides a strong counter-result to the notion that there is a small number of "top genes" that should be used to build classifiers. In our results, the best classifiers were formed using genes with limited univariate power, thus illustrating that deeper mining of features using multivariate techniques is important.