Prognostic transcriptional association networks: a new supervised approach based on regression trees.
ABSTRACT: MOTIVATION: The application of information encoded in molecular networks for prognostic purposes is a crucial objective of systems biomedicine. This approach has not been widely investigated in the cardiovascular research area. Within this area, the prediction of clinical outcomes after suffering a heart attack would represent a significant step forward. We developed a new quantitative prediction-based method for this prognostic problem based on the discovery of clinically relevant transcriptional association networks. This method integrates regression trees and clinical class-specific networks, and can be applied to other clinical domains. RESULTS: Before analyzing our cardiovascular disease dataset, we tested the usefulness of our approach on a benchmark dataset with control and disease patients. We also compared it to several algorithms to infer transcriptional association networks and classification models. Comparative results provided evidence of the prediction power of our approach. Next, we discovered new models for predicting good and bad outcomes after myocardial infarction. Using blood-derived gene expression data, our models reported areas under the receiver operating characteristic curve above 0.70. Our model could also outperform different techniques based on co-expressed gene modules. We also predicted processes that may represent novel therapeutic targets for heart disease, such as the synthesis of leucine and isoleucine. AVAILABILITY: The SATuRNo software is freely available at http://www.lsi.us.es/isanepo/toolsSaturno/.
Project description:BACKGROUND: Clear-cell Renal Cell Carcinoma (ccRCC) is the most- prevalent, chemotherapy resistant and lethal adult kidney cancer. There is a need for novel diagnostic and prognostic biomarkers for ccRCC, due to its heterogeneous molecular profiles and asymptomatic early stage. This study aims to develop classification models to distinguish early stage and late stage of ccRCC based on gene expression profiles. We employed supervised learning algorithms- J48, Random Forest, SMO and Naïve Bayes; with enriched model learning by fast correlation based feature selection to develop classification models trained on sequencing based gene expression data of RNAseq experiments, obtained from The Cancer Genome Atlas. RESULTS: Different models developed in the study were evaluated on the basis of 10 fold cross validations and independent dataset testing. Random Forest based prediction model performed best amongst the models developed in the study, with a sensitivity of 89%, accuracy of 77% and area under Receivers Operating Curve of 0.8. CONCLUSIONS: We anticipate that the prioritized subset of 62 genes and prediction models developed in this study will aid experimental oncologists to expedite understanding of the molecular mechanisms of stage progression and discovery of prognostic factors for ccRCC tumors.
Project description:Current approaches to predicting a cardiovascular disease (CVD) event rely on conventional risk factors and cross-sectional data. In this study, we applied machine learning and deep learning models to 10-year CVD event prediction by using longitudinal electronic health record (EHR) and genetic data. Our study cohort included 109, 490 individuals. In the first experiment, we extracted aggregated and longitudinal features from EHR. We applied logistic regression, random forests, gradient boosting trees, convolutional neural networks (CNN) and recurrent neural networks with long short-term memory (LSTM) units. In the second experiment, we applied a late-fusion approach to incorporate genetic features. We compared the performance with approaches currently utilized in routine clinical practice - American College of Cardiology and the American Heart Association (ACC/AHA) Pooled Cohort Risk Equation. Our results indicated that incorporating longitudinal feature lead to better event prediction. Combining genetic features through a late-fusion approach can further improve CVD prediction, underscoring the importance of integrating relevant genetic data whenever available.
Project description:BACKGROUND: Inference of protein interaction networks from various sources of data has become an important topic of both systems and computational biology. Here we present a supervised approach to identification of gene expression regulatory networks. RESULTS: The method is based on a kernel approach accompanied with genetic programming. As a data source, the method utilizes gene expression time series for prediction of interactions among regulatory proteins and their target genes. The performance of the method was verified using Saccharomyces cerevisiae cell cycle and DNA/RNA/protein biosynthesis gene expression data. The results were compared with independent data sources. Finally, a prediction of novel interactions within yeast gene expression circuits has been performed. CONCLUSION: Results show that our algorithm gives, in most cases, results identical with the independent experiments, when compared with the YEASTRACT database. In several cases our algorithm gives predictions of novel interactions which have not been reported.
Project description:<h4>Background</h4>Cardiovascular prognostic models guide treatment allocation and support clinical decisions. Whether there are valid models for Latin American and Caribbean (LAC) populations is unknown.<h4>Objective</h4>This study sought to identify and critically appraise cardiovascular prognostic models developed, tested, or recalibrated in LAC populations.<h4>Methods</h4>The systematic review followed the CHARMS (CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies) framework (PROSPERO [International Prospective Register of Systemic Reviews]: CRD42018096553). Reports were included if they followed a prospective design and presented a multivariable prognostic model; reports were excluded if they studied symptomatic individuals or patients. The following search engines were used: EMBASE, MEDLINE, Scopus, SciELO, and LILACS. Risk of bias assessment was conducted with PROBAST (Prediction model Risk Of Bias ASsessment Tool). No quantitative summary was conducted due to large heterogeneity.<h4>Results</h4>From 2,506 search results, 8 studies (N = 130,482 participants) were included for qualitative synthesis. We could not identify any cardiovascular prognostic model developed for LAC populations; reviewed reports evaluated available models or conducted a recalibration analysis. Only 1 study included a Caribbean population (Puerto Rico); 3 studies were retrieved from Chile; 2 from Argentina, Brazil, Colombia, and Uruguay; and 1 from Mexico. Four studies included population-based samples, and the other 4 included people affiliated to a health facility (e.g., prevention clinics). Most studied participants were older than 50 years, and there were more women in 5 reports. The Framingham model was assessed 6 times, and the American College of Cardiology/American Heart Association pooled equation was assessed twice. Across the prognostic models assessed, calibration varied widely from one population to another, showing great overestimation particularly in some subgroups (e.g., highest risk). Discrimination (e.g., C-statistic) was acceptable for most models; for Framingham it ranged from 0.66 to 0.76. The American College of Cardiology/American Heart Association pooled equation showed the best discrimination (0.78). That there were few outcome events was the most important methodological limitation of the identified studies.<h4>Conclusions</h4>No cardiovascular prognostic models have been developed in LAC, hampering key evidence to inform public health and clinical practice. Validation studies need to improve methodological issues.
Project description:Essential gene prediction helps to find minimal genes indispensable for the survival of any organism. Machine learning (ML) algorithms have been useful for the prediction of gene essentiality. However, currently available ML pipelines perform poorly for organisms with limited experimental data. The objective is the development of a new ML pipeline to help in the annotation of essential genes of less explored disease-causing organisms for which minimal experimental data is available. The proposed strategy combines unsupervised feature selection technique, dimension reduction using the Kamada-Kawai algorithm, and semi-supervised ML algorithm employing Laplacian Support Vector Machine (LapSVM) for prediction of essential and non-essential genes from genome-scale metabolic networks using very limited labeled dataset. A novel scoring technique, Semi-Supervised Model Selection Score, equivalent to area under the ROC curve (auROC), has been proposed for the selection of the best model when supervised performance metrics calculation is difficult due to lack of data. The unsupervised feature selection followed by dimension reduction helped to observe a distinct circular pattern in the clustering of essential and non-essential genes. LapSVM then created a curve that dissected this circle for the classification and prediction of essential genes with high accuracy (auROC > 0.85) even with 1% labeled data for model training. After successful validation of this ML pipeline on both Eukaryotes and Prokaryotes that show high accuracy even when the labeled dataset is very limited, this strategy is used for the prediction of essential genes of organisms with inadequate experimentally known data, such as Leishmania sp. Using a graph-based semi-supervised machine learning scheme, a novel integrative approach has been proposed for essential gene prediction that shows universality in application to both Prokaryotes and Eukaryotes with limited labeled data. The essential genes predicted using the pipeline provide an important lead for the prediction of gene essentiality and identification of novel therapeutic targets for antibiotic and vaccine development against disease-causing parasites.
Project description:BACKGROUND: Hot spots are residues contributing the most of binding free energy yet accounting for a small portion of a protein interface. Experimental approaches to identify hot spots such as alanine scanning mutagenesis are expensive and time-consuming, while computational methods are emerging as effective alternatives to experimental approaches. RESULTS: In this study, we propose a semi-supervised boosting SVM, which is called sbSVM, to computationally predict hot spots at protein-protein interfaces by combining protein sequence and structure features. Here, feature selection is performed using random forests to avoid over-fitting. Due to the deficiency of positive samples, our approach samples useful unlabeled data iteratively to boost the performance of hot spots prediction. The performance evaluation of our method is carried out on a dataset generated from the ASEdb database for cross-validation and a dataset from the BID database for independent test. Furthermore, a balanced dataset with similar amounts of hot spots and non-hot spots (65 and 66 respectively) derived from the first training dataset is used to further validate our method. All results show that our method yields good sensitivity, accuracy and F1 score comparing with the existing methods. CONCLUSION: Our method boosts prediction performance of hot spots by using unlabeled data to overcome the deficiency of available training data. Experimental results show that our approach is more effective than the traditional supervised algorithms and major existing hot spot prediction methods.
Project description:INTRODUCTION:Cardiovascular disease (CVD) is the leading cause of morbidity and mortality globally. With advances in early diagnosis and treatment of CVD and increasing life expectancy, more people are surviving initial CVD events. However, models for stratifying disease severity risk in patients with established CVD for effective secondary prevention strategies are inadequate. Multivariable prognostic models to stratify CVD risk may allow personalised treatment interventions. This review aims to systematically review the existing multivariable prognostic models for the recurrence of CVD or major adverse cardiovascular events in adults with established CVD diagnosis. METHODS AND ANALYSIS:Bibliographic databases (Ovid MEDLINE, EMBASE, PsycINFO and Web of Science) will be searched, from database inception to April 2020, using terms relating to the clinical area and prognosis. A hand search of the reference lists of included studies will also be done to identify additional published studies. No restrictions on language of publications will be applied. Eligible studies present multivariable models (derived or validated) of adults (aged 16 years and over) with an established diagnosis of CVD, reporting at least one of the components of the primary outcome of major adverse cardiovascular events (defined as either coronary heart disease, stroke, peripheral artery disease, heart failure or CVD-related mortality). Reviewing will be done by two reviewers independently using the pre-defined criteria. Data will be extracted for included full-text articles. Risk of bias will be assessed using the Prediction model study Risk Of Bias ASsessment Tool (PROBAST). Prognostic models will be summarised narratively. If a model is tested in multiple validation studies, the predictive performance will be summarised using a random-effects meta-analysis model to account for any between-study heterogeneity. ETHICS AND DISSEMINATION:Ethics approval is not required. The results of this study will be submitted to relevant conferences for presentation and a peer-reviewed journal for publication. PROSPERO REGISTRATION NUMBER:CRD42019149111.
Project description:BACKGROUND:Assigning every human gene to specific functions, diseases and traits is a grand challenge in modern genetics. Key to addressing this challenge are computational methods, such as supervised learning and label propagation, that can leverage molecular interaction networks to predict gene attributes. In spite of being a popular machine-learning technique across fields, supervised learning has been applied only in a few network-based studies for predicting pathway-, phenotype- or disease-associated genes. It is unknown how supervised learning broadly performs across different networks and diverse gene classification tasks, and how it compares to label propagation, the widely benchmarked canonical approach for this problem. RESULTS:In this study, we present a comprehensive benchmarking of supervised learning for network-based gene classification, evaluating this approach and a classic label propagation technique on hundreds of diverse prediction tasks and multiple networks using stringent evaluation schemes. We demonstrate that supervised learning on a gene's full network connectivity outperforms label propagaton and achieves high prediction accuracy by efficiently capturing local network properties, rivaling label propagation's appeal for naturally using network topology. We further show that supervised learning on the full network is also superior to learning on node embeddings (derived using node2vec), an increasingly popular approach for concisely representing network connectivity. These results show that supervised learning is an accurate approach for prioritizing genes associated with diverse functions, diseases and traits and should be considered a staple of network-based gene classification workflows. AVAILABILITY AND IMPLEMENTATION:The datasets and the code used to reproduce the results and add new gene classification methods have been made freely available. CONTACT:firstname.lastname@example.org. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:Natriuretic peptides are recognized as important predictors of cardiovascular events in patients with heart failure, but less is known about their prognostic importance in patients with acute coronary syndrome. We sought to determine whether B-type natriuretic peptide (BNP) and N-terminal prohormone B-type natriuretic peptide (NT-proBNP) could enhance risk prediction of a broad range of cardiovascular outcomes in patients with acute coronary syndrome and type 2 diabetes mellitus.Patients with a recent acute coronary syndrome and type 2 diabetes mellitus were prospectively enrolled in the ELIXA trial (n=5525, follow-up time 26 months). Best risk models were constructed from relevant baseline variables with and without BNP/NT-proBNP. C statistics, Net Reclassification Index, and Integrated Discrimination Index were analyzed to estimate the value of adding BNP or NT-proBNP to best risk models. Overall, BNP and NT-proBNP were the most important predictors of all outcomes examined, irrespective of history of heart failure or any prior cardiovascular disease. BNP significantly improved C statistics when added to risk models for each outcome examined, the strongest increments being in death (0.77-0.82, P<0.001), cardiovascular death (0.77-0.83, P<0.001), and heart failure (0.84-0.87, P<0.001). BNP or NT-proBNP alone predicted death as well as all other variables combined (0.77 versus 0.77).In patients with a recent acute coronary syndrome and type 2 diabetes mellitus, BNP and NT-proBNP were powerful predictors of cardiovascular outcomes beyond heart failure and death, ie, were also predictive of MI and stroke. Natriuretic peptides added as much predictive information about death as all other conventional variables combined.URL: http://www.clinicaltrials.gov. Unique identifier: NCT01147250.
Project description:Single nucleotide polymorphisms (SNPs) are among the most common types of genetic variation in complex genetic disorders. A growing number of studies link the functional role of SNPs with the networks and pathways mediated by the disease-associated genes. For example, many non-synonymous missense SNPs (nsSNPs) have been found near or inside the protein-protein interaction (PPI) interfaces. Determining whether such nsSNP will disrupt or preserve a PPI is a challenging task to address, both experimentally and computationally. Here, we present this task as three related classification problems, and develop a new computational method, called the SNP-IN tool (non-synonymous SNP INteraction effect predictor). Our method predicts the effects of nsSNPs on PPIs, given the interaction's structure. It leverages supervised and semi-supervised feature-based classifiers, including our new Random Forest self-learning protocol. The classifiers are trained based on a dataset of comprehensive mutagenesis studies for 151 PPI complexes, with experimentally determined binding affinities of the mutant and wild-type interactions. Three classification problems were considered: (1) a 2-class problem (strengthening/weakening PPI mutations), (2) another 2-class problem (mutations that disrupt/preserve a PPI), and (3) a 3-class classification (detrimental/neutral/beneficial mutation effects). In total, 11 different supervised and semi-supervised classifiers were trained and assessed resulting in a promising performance, with the weighted f-measure ranging from 0.87 for Problem 1 to 0.70 for the most challenging Problem 3. By integrating prediction results of the 2-class classifiers into the 3-class classifier, we further improved its performance for Problem 3. To demonstrate the utility of SNP-IN tool, it was applied to study the nsSNP-induced rewiring of two disease-centered networks. The accurate and balanced performance of SNP-IN tool makes it readily available to study the rewiring of large-scale protein-protein interaction networks, and can be useful for functional annotation of disease-associated SNPs. SNIP-IN tool is freely accessible as a web-server at http://korkinlab.org/snpintool/.