Block Forests: random forests for blocks of clinical and omics covariate data.
ABSTRACT: BACKGROUND:In the last years more and more multi-omics data are becoming available, that is, data featuring measurements of several types of omics data for each patient. Using multi-omics data as covariate data in outcome prediction is both promising and challenging due to the complex structure of such data. Random forest is a prediction method known for its ability to render complex dependency patterns between the outcome and the covariates. Against this background we developed five candidate random forest variants tailored to multi-omics covariate data. These variants modify the split point selection of random forest to incorporate the block structure of multi-omics data and can be applied to any outcome type for which a random forest variant exists, such as categorical, continuous and survival outcomes. Using 20 publicly available multi-omics data sets with survival outcome we compared the prediction performances of the block forest variants with alternatives. We also considered the common special case of having clinical covariates and measurements of a single omics data type available. RESULTS:We identify one variant termed "block forest" that outperformed all other approaches in the comparison study. In particular, it performed significantly better than standard random survival forest (adjusted p-value: 0.027). The two best performing variants have in common that the block choice is randomized in the split point selection procedure. In the case of having clinical covariates and a single omics data type available, the improvements of the variants over random survival forest were larger than in the case of the multi-omics data. The degrees of improvements over random survival forest varied strongly across data sets. Moreover, considering all clinical covariates mandatorily improved the performance. This result should however be interpreted with caution, because the level of predictive information contained in clinical covariates depends on the specific application. CONCLUSIONS:The new prediction method block forest for multi-omics data can significantly improve the prediction performance of random forest and outperformed alternatives in the comparison. Block forest is particularly effective for the special case of using clinical covariates in combination with measurements of a single omics data type.
Project description:Background:Both histopathological image features and genomics data were associated with survival outcome of cancer patients. However, integrating features of histopathological images, genomics and other omics for improving prognosis prediction has not been reported in head and neck squamous cell carcinoma (HNSCC). Methods:A dataset of 216 HNSCC patients was derived from the Cancer Genome Atlas (TCGA) with information of clinical characteristics, genetic mutation, RNA sequencing, protein expression and histopathological images. Patients were randomly assigned into training (n = 108) or validation (n = 108) sets. We extracted 593 quantitative image features, and used random forest algorithm with 10-fold cross-validation to build prognostic models for overall survival (OS) in training set, then compared the area under the time-dependent receiver operating characteristic curve (AUC) in validation set. Results:In validation set, histopathological image features had significant predictive value for OS (5-year AUC = 0.784). The histopathology + omics models showed better predictive performance than genomics, transcriptomics or proteomics alone. Moreover, the multi-omics model incorporating image features, genomics, transcriptomics and proteomics reached the maximal 1-, 3-, and 5-year AUC of 0.871, 0.908, and 0.929, with most significant survival difference (HR = 10.66, 95%CI: 5.06-26.8, p < 0.001). Decision curve analysis also revealed a better net benefit of multi-omics model. Conclusion:The histopathological images could provide complementary features to improve prognostic performance for HNSCC patients. The integrative model of histopathological image features and omics data might serve as an effective tool for survival prediction and risk stratification in clinical practice.
Project description:Glioblastoma multiforme (GBM) has been recognized as the most lethal type of malignant brain tumor. Despite efforts of the medical and research community, patients' survival remains extremely low. Multi-omic profiles (including DNA sequence, methylation and gene expression) provide rich information about the tumor. These profiles are likely to reveal processes that may be predictive of patient survival. However, the integration of multi-omic profiles, which are high dimensional and heterogeneous in nature, poses great challenges. The goal of this work was to develop models for prediction of survival of GBM patients that can integrate clinical information and multi-omic profiles, using multi-layered Bayesian regressions. We apply the methodology to data from GBM patients from The Cancer Genome Atlas (TCGA, n = 501) to evaluate whether integrating multi-omic profiles (SNP-genotypes, methylation, copy number variants and gene expression) with clinical information (demographics as well as treatments) leads to an improved ability to predict patient survival. The proposed Bayesian models were used to estimate the proportion of variance explained by clinical covariates and omics and to evaluate prediction accuracy in cross validation (using the area under the Receiver Operating Characteristic curve, AUC). Among clinical and demographic covariates, age (AUC = 0.664) and the use of temozolomide (AUC = 0.606) were the most predictive of survival. Among omics, methylation (AUC = 0.623) and gene expression (AUC = 0.593) were more predictive than either SNP (AUC = 0.539) or CNV (AUC = 0.547). While there was a clear association between age and methylation, the integration of age, the use of temozolomide, and either gene expression or methylation led to a substantial increase in AUC in cross-validaton (AUC = 0.718). Finally, among the genes whose methylation was higher in aging brains, we observed a higher enrichment of these genes being also differentially methylated in cancer.
Project description:Predicting the prognosis of pancreatic cancer is important because of the very low survival rates of patients with this particular cancer. Although several studies have used microRNA and gene expression profiles and clinical data, as well as images of tissues and cells, to predict cancer survival and recurrence, the accuracies of these approaches in the prediction of high-risk pancreatic adenocarcinoma (PAAD) still need to be improved. Accordingly, in this study, we proposed two biological features based on multi-omics datasets to predict survival and recurrence among patients with PAAD. First, the clonal expansion of cancer cells with somatic mutations was used to predict prognosis. Using whole-exome sequencing data from 134 patients with PAAD from The Cancer Genome Atlas (TCGA), we found five candidate genes that were mutated in the early stages of tumorigenesis with high cellular prevalence (CP). CDKN2A, TP53, TTN, KCNJ18, and KRAS had the highest CP values among the patients with PAAD, and survival and recurrence rates were significantly different between the patients harboring mutations in these candidate genes and those harboring mutations in other genes (p = 2.39E-03, p = 8.47E-04, respectively). Second, we generated an autoencoder to integrate the RNA sequencing, microRNA sequencing, and DNA methylation data from 134 patients with PAAD from TCGA. The autoencoder robustly reduced the dimensions of these multi-omics data, and the K-means clustering method was then used to cluster the patients into two subgroups. The subgroups of patients had significant differences in survival and recurrence (p = 1.41E-03, p = 4.43E-04, respectively). Finally, we developed a prediction model for prognosis using these two biological features and clinical data. When support vector machines, random forest, logistic regression, and L2 regularized logistic regression were used as prediction models, logistic regression analysis generally revealed the best performance for both disease-free survival (DFS) and overall survival (OS) (accuracy [ACC] = 0.762 and area under the curve [AUC] = 0.795 for DFS; ACC = 0.776 and AUC = 0.769 for OS). Thus, we could classify patients with a high probability of recurrence and at a high risk of poor outcomes. Our study provides insights into new personalized therapies on the basis of mutation status and multi-omics data.
Project description:BACKGROUND:The analysis of integrated multi-omics data enables the identification of disease-related biomarkers that cannot be identified from a single omics profile. Although protein-level data reflects the cellular status of cancer tissue more directly than gene-level data, past studies have mainly focused on multi-omics integration using gene-level data as opposed to protein-level data. However, the use of protein-level data (such as mass spectrometry) in multi-omics integration has some limitations. For example, the correlation between the characteristics of gene-level data (such as mRNA) and protein-level data is weak, and it is difficult to detect low-abundance signaling proteins that are used to target cancer. The reverse phase protein array (RPPA) is a highly sensitive antibody-based quantification method for signaling proteins. However, the number of protein features in RPPA data is extremely low compared to the number of gene features in gene-level data. In this study, we present a new method for integrating RPPA profiles with RNA-Seq and DNA methylation profiles for survival prediction based on the integrative directed random walk (iDRW) framework proposed in our previous study. In the iDRW framework, each omics profile is merged into a single pathway profile that reflects the topological information of the pathway. In order to address the sparsity of RPPA profiles, we employ the random walk with restart (RWR) approach on the pathway network. RESULTS:Our model was validated using survival prediction analysis for a breast cancer dataset from The Cancer Genome Atlas. Our proposed model exhibited improved performance compared with other methods that utilize pathway information and also out-performed models that did not include the RPPA data utilized in our study. The risk pathways identified for breast cancer in this study were closely related to well-known breast cancer risk pathways. CONCLUSIONS:Our results indicated that RPPA data is useful for survival prediction for breast cancer patients under our framework. We also observed that iDRW effectively integrates RNA-Seq, DNA methylation, and RPPA profiles, while variation in the composition of the omics data can affect both prediction performance and risk pathway identification. These results suggest that omics data composition is a critical parameter for iDRW.
Project description:Breast cancer (BC) is the second most common type of cancer and a major cause of death for women. Commonly, BC patients are assigned to risk groups based on the combination of prognostic and prediction factors (eg, patient age, tumor size, tumor grade, hormone receptor status, etc). Although this approach is able to identify risk groups with different prognosis, patients are highly heterogeneous in their response to treatments. To improve the prediction of BC patients, we extended clinical models (including prognostic and prediction factors with whole-omic data) to integrate omics profiles for gene expression and copy number variants (CNVs). We describe a modeling framework that is able to incorporate clinical risk factors, high-dimensional omics profiles, and interactions between omics and non-omic factors (eg, treatment). We used the proposed modeling framework and data from METABRIC (Molecular Taxonomy of Breast Cancer Consortium) to assess the impact on the accuracy of BC patient survival predictions when omics and omic-by-treatment interactions are being considered. Our analysis shows that omics and omic-by-treatment interactions explain a sizable fraction of the variance on survival time that is not explained by commonly used clinical covariates. The sizable interaction effects observed, together with the increase in prediction accuracy, suggest that whole-omic profiles could be used to improve prognosis prediction among BC patients.
Project description:BACKGROUND:Breast cancer is the most prevalent and among the most deadly cancers in females. Patients with breast cancer have highly variable survival lengths, indicating a need to identify prognostic biomarkers for personalized diagnosis and treatment. With the development of new technologies such as next-generation sequencing, multi-omics information are becoming available for a more thorough evaluation of a patient's condition. In this study, we aim to improve breast cancer overall survival prediction by integrating multi-omics data (e.g., gene expression, DNA methylation, miRNA expression, and copy number variations (CNVs)). METHODS:Motivated by multi-view learning, we propose a novel strategy to integrate multi-omics data for breast cancer survival prediction by applying complementary and consensus principles. The complementary principle assumes each -omics data contains modality-unique information. To preserve such information, we develop a concatenation autoencoder (ConcatAE) that concatenates the hidden features learned from each modality for integration. The consensus principle assumes that the disagreements among modalities upper bound the model errors. To get rid of the noises or discrepancies among modalities, we develop a cross-modality autoencoder (CrossAE) to maximize the agreement among modalities to achieve a modality-invariant representation. We first validate the effectiveness of our proposed models on the MNIST simulated data. We then apply these models to the TCCA breast cancer multi-omics data for overall survival prediction. RESULTS:For breast cancer overall survival prediction, the integration of DNA methylation and miRNA expression achieves the best overall performance of 0.641?±?0.031 with ConcatAE, and 0.63?±?0.081 with CrossAE. Both strategies outperform baseline single-modality models using only DNA methylation (0.583?±?0.058) or miRNA expression (0.616?±?0.057). CONCLUSIONS:In conclusion, we achieve improved overall survival prediction performance by utilizing either the complementary or consensus information among multi-omics data. The proposed ConcatAE and CrossAE models can inspire future deep representation-based multi-omics integration techniques. We believe these novel multi-omics integration models can benefit the personalized diagnosis and treatment of breast cancer patients.
Project description:BACKGROUND:Colon cancer is common worldwide and is the leading cause of cancer-related death. Multiple levels of omics data are available due to the development of sequencing technologies. In this study, we proposed an integrative prognostic model for colon cancer based on the integration of clinical and multi-omics data. METHODS:In total, 344 patients were included in this study. Clinical, gene expression, DNA methylation and miRNA expression data were retrieved from The Cancer Genome Atlas (TCGA). To accommodate the high dimensionality of omics data, unsupervised clustering was used as dimension reduction method. The bias-corrected Harrell's concordance index was used to verify which clustering result provided the best prognostic performance. Finally, we proposed a prognostic prediction model based on the integration of clinical data and multi-omics data. Uno's concordance index with cross-validation was used to compare the discriminative performance of the prognostic model constructed with different covariates. RESULTS:Combinations of clinical and multi-omics data can improve prognostic performance, as shown by the increase of the bias-corrected Harrell's concordance of the prognostic model from 0.7424 (clinical features only) to 0.7604 (clinical features and three types of omics features). Additionally, 2-year, 3-year and 5-year Uno's concordance statistics increased from 0.7329, 0.7043, and 0.7002 (clinical features only) to 0.7639, 0.7474 and 0.7597 (clinical features and three types of omics features), respectively. CONCLUSION:In conclusion, this study successfully combined clinical and multi-omics data for better prediction of colon cancer prognosis.
Project description:More than 300 million people worldwide experience depression; annually, ~800,000 people die by suicide. Unfortunately, conventional interview-based diagnosis is insufficient to accurately predict a psychiatric status. We developed machine learning models to predict depression and suicide risk using blood methylome and transcriptome data from 56 suicide attempters (SAs), 39 patients with major depressive disorder (MDD), and 87 healthy controls. Our random forest classifiers showed accuracies of 92.6% in distinguishing SAs from MDD patients, 87.3% in distinguishing MDD patients from controls, and 86.7% in distinguishing SAs from controls. We also developed regression models for predicting psychiatric scales with R2 values of 0.961 and 0.943 for Hamilton Rating Scale for Depression-17 and Scale for Suicide Ideation, respectively. Multi-omics data were used to construct psychiatric status prediction models for improved mental health treatment.
Project description:Mortality attributed to lung cancer accounts for a large fraction of cancer deaths worldwide. With increasing mortality figures, the accurate prediction of prognosis has become essential. In recent years, multi-omics analysis has emerged as a useful survival prediction tool. However, the methodology relevant to multi-omics analysis has not yet been fully established and further improvements are required for clinical applications. In this study, we developed a novel method to accurately predict the survival of patients with lung cancer using multi-omics data. With unsupervised learning techniques, survival-associated subtypes in non-small cell lung cancer were first detected using the multi-omics datasets from six categories in The Cancer Genome Atlas (TCGA). The new subtypes, referred to as integration survival subtypes, clearly divided patients into longer and shorter-surviving groups (log-rank test: p = 0.003) and we confirmed that this is independent of histopathological classification (Chi-square test of independence: p = 0.94). Next, an attempt was made to detect the integration survival subtypes using only one categorical dataset. Our machine learning model that was only trained on the reverse phase protein array (RPPA) could accurately predict the integration survival subtypes (AUC = 0.99). The predicted subtypes could also distinguish between high and low risk patients (log-rank test: p = 0.012). Overall, this study explores novel potentials of multi-omics analysis to accurately predict the prognosis of patients with lung cancer.
Project description:Recent technological advances and international efforts, such as The Cancer Genome Atlas (TCGA), have made available several pan-cancer datasets encompassing multiple omics layers with detailed clinical information in large collection of samples. The need has thus arisen for the development of computational methods aimed at improving cancer subtyping and biomarker identification from multi-modal data. Here we apply the Integrative Network Fusion (INF) pipeline, which combines multiple omics layers exploiting Similarity Network Fusion (SNF) within a machine learning predictive framework. INF includes a feature ranking scheme (rSNF) on SNF-integrated features, used by a classifier over juxtaposed multi-omics features (juXT). In particular, we show instances of INF implementing Random Forest (RF) and linear Support Vector Machine (LSVM) as the classifier, and two baseline RF and LSVM models are also trained on juXT. A compact RF model, called rSNFi, trained on the intersection of top-ranked biomarkers from the two approaches juXT and rSNF is finally derived. All the classifiers are run in a 10x5-fold cross-validation schema to warrant reproducibility, following the guidelines for an unbiased Data Analysis Plan by the US FDA-led initiatives MAQC/SEQC. INF is demonstrated on four classification tasks on three multi-modal TCGA oncogenomics datasets. Gene expression, protein expression and copy number variants are used to predict estrogen receptor status (BRCA-ER, N = 381) and breast invasive carcinoma subtypes (BRCA-subtypes, N = 305), while gene expression, miRNA expression and methylation data is used as predictor layers for acute myeloid leukemia and renal clear cell carcinoma survival (AML-OS, N = 157; KIRC-OS, N = 181). In test, INF achieved similar Matthews Correlation Coefficient (MCC) values and 97% to 83% smaller feature sizes (FS), compared with juXT for BRCA-ER (MCC: 0.83 vs. 0.80; FS: 56 vs. 1801) and BRCA-subtypes (0.84 vs. 0.80; 302 vs. 1801), improving KIRC-OS performance (0.38 vs. 0.31; 111 vs. 2319). INF predictions are generally more accurate in test than one-dimensional omics models, with smaller signatures too, where transcriptomics consistently play the leading role. Overall, the INF framework effectively integrates multiple data levels in oncogenomics classification tasks, improving over the performance of single layers alone and naive juxtaposition, and provides compact signature sizes.