Deep learning based feature-level integration of multi-omics data for breast cancer patients survival analysis.
ABSTRACT: BACKGROUND:Breast cancer is the most prevalent and among the most deadly cancers in females. Patients with breast cancer have highly variable survival lengths, indicating a need to identify prognostic biomarkers for personalized diagnosis and treatment. With the development of new technologies such as next-generation sequencing, multi-omics information are becoming available for a more thorough evaluation of a patient's condition. In this study, we aim to improve breast cancer overall survival prediction by integrating multi-omics data (e.g., gene expression, DNA methylation, miRNA expression, and copy number variations (CNVs)). METHODS:Motivated by multi-view learning, we propose a novel strategy to integrate multi-omics data for breast cancer survival prediction by applying complementary and consensus principles. The complementary principle assumes each -omics data contains modality-unique information. To preserve such information, we develop a concatenation autoencoder (ConcatAE) that concatenates the hidden features learned from each modality for integration. The consensus principle assumes that the disagreements among modalities upper bound the model errors. To get rid of the noises or discrepancies among modalities, we develop a cross-modality autoencoder (CrossAE) to maximize the agreement among modalities to achieve a modality-invariant representation. We first validate the effectiveness of our proposed models on the MNIST simulated data. We then apply these models to the TCCA breast cancer multi-omics data for overall survival prediction. RESULTS:For breast cancer overall survival prediction, the integration of DNA methylation and miRNA expression achieves the best overall performance of 0.641?±?0.031 with ConcatAE, and 0.63?±?0.081 with CrossAE. Both strategies outperform baseline single-modality models using only DNA methylation (0.583?±?0.058) or miRNA expression (0.616?±?0.057). CONCLUSIONS:In conclusion, we achieve improved overall survival prediction performance by utilizing either the complementary or consensus information among multi-omics data. The proposed ConcatAE and CrossAE models can inspire future deep representation-based multi-omics integration techniques. We believe these novel multi-omics integration models can benefit the personalized diagnosis and treatment of breast cancer patients.
Project description:Breast cancer is the most prevalent and among the most deadly cancers in females. Patients with breast cancer have highly variable survival rates, indicating a need to identify prognostic biomarkers. By integrating multi-omics data (e.g., gene expression, DNA methylation, miRNA expression, and copy number variations (CNVs)), it is likely to improve the accuracy of patient survival predictions compared to prediction using single modality data. Therefore, we propose to develop a machine learning pipeline using decision-level integration of multi-omics tumor data from The Cancer Genome Atlas (TCGA) to predict the overall survival of breast cancer patients. With multi-omics data consisting of gene expression, methylation, miRNA expression, and CNVs, the top performing model predicted survival with an accuracy of 85% and area under the curve (AUC) of 87%. Furthermore, the model was able to identify which modalities best contributed to prediction performance, identifying methylation, miRNA, and gene expression as the best integrated classification combination. Our method not only recapitulated several breast cancer-specific prognostic biomarkers that were previously reported in the literature but also yielded several novel biomarkers. Further analysis of these biomarkers could lend insight into the molecular mechanisms that lead to poor survival.
Project description:High-risk neuroblastoma is a very aggressive disease, with excessive tumor growth and poor outcomes. A proper stratification of the high-risk patients by prognostic outcome is important for treatment. However, there is still a lack of survival stratification for the high-risk neuroblastoma. To fill the gap, we adopt a deep learning algorithm, Autoencoder, to integrate multi-omics data, and combine it with K-means clustering to identify two subtypes with significant survival differences. By comparing the Autoencoder with PCA, iCluster, and DGscore about the classification based on multi-omics data integration, Autoencoder-based classification outperforms the alternative approaches. Furthermore, we also validated the classification in two independent datasets by training machine-learning classification models, and confirmed its robustness. Functional analysis revealed that MYCN amplification was more frequently occurred in the ultra-high-risk subtype, in accordance with the overexpression of MYC/MYCN targets in this subtype. In summary, prognostic subtypes identified by deep learning-based multi-omics integration could not only improve our understanding of molecular mechanism, but also help the clinicians make decisions.
Project description:BACKGROUND:The analysis of integrated multi-omics data enables the identification of disease-related biomarkers that cannot be identified from a single omics profile. Although protein-level data reflects the cellular status of cancer tissue more directly than gene-level data, past studies have mainly focused on multi-omics integration using gene-level data as opposed to protein-level data. However, the use of protein-level data (such as mass spectrometry) in multi-omics integration has some limitations. For example, the correlation between the characteristics of gene-level data (such as mRNA) and protein-level data is weak, and it is difficult to detect low-abundance signaling proteins that are used to target cancer. The reverse phase protein array (RPPA) is a highly sensitive antibody-based quantification method for signaling proteins. However, the number of protein features in RPPA data is extremely low compared to the number of gene features in gene-level data. In this study, we present a new method for integrating RPPA profiles with RNA-Seq and DNA methylation profiles for survival prediction based on the integrative directed random walk (iDRW) framework proposed in our previous study. In the iDRW framework, each omics profile is merged into a single pathway profile that reflects the topological information of the pathway. In order to address the sparsity of RPPA profiles, we employ the random walk with restart (RWR) approach on the pathway network. RESULTS:Our model was validated using survival prediction analysis for a breast cancer dataset from The Cancer Genome Atlas. Our proposed model exhibited improved performance compared with other methods that utilize pathway information and also out-performed models that did not include the RPPA data utilized in our study. The risk pathways identified for breast cancer in this study were closely related to well-known breast cancer risk pathways. CONCLUSIONS:Our results indicated that RPPA data is useful for survival prediction for breast cancer patients under our framework. We also observed that iDRW effectively integrates RNA-Seq, DNA methylation, and RPPA profiles, while variation in the composition of the omics data can affect both prediction performance and risk pathway identification. These results suggest that omics data composition is a critical parameter for iDRW.
Project description:Mortality attributed to lung cancer accounts for a large fraction of cancer deaths worldwide. With increasing mortality figures, the accurate prediction of prognosis has become essential. In recent years, multi-omics analysis has emerged as a useful survival prediction tool. However, the methodology relevant to multi-omics analysis has not yet been fully established and further improvements are required for clinical applications. In this study, we developed a novel method to accurately predict the survival of patients with lung cancer using multi-omics data. With unsupervised learning techniques, survival-associated subtypes in non-small cell lung cancer were first detected using the multi-omics datasets from six categories in The Cancer Genome Atlas (TCGA). The new subtypes, referred to as integration survival subtypes, clearly divided patients into longer and shorter-surviving groups (log-rank test: p = 0.003) and we confirmed that this is independent of histopathological classification (Chi-square test of independence: p = 0.94). Next, an attempt was made to detect the integration survival subtypes using only one categorical dataset. Our machine learning model that was only trained on the reverse phase protein array (RPPA) could accurately predict the integration survival subtypes (AUC = 0.99). The predicted subtypes could also distinguish between high and low risk patients (log-rank test: p = 0.012). Overall, this study explores novel potentials of multi-omics analysis to accurately predict the prognosis of patients with lung cancer.
Project description:Predicting the prognosis of pancreatic cancer is important because of the very low survival rates of patients with this particular cancer. Although several studies have used microRNA and gene expression profiles and clinical data, as well as images of tissues and cells, to predict cancer survival and recurrence, the accuracies of these approaches in the prediction of high-risk pancreatic adenocarcinoma (PAAD) still need to be improved. Accordingly, in this study, we proposed two biological features based on multi-omics datasets to predict survival and recurrence among patients with PAAD. First, the clonal expansion of cancer cells with somatic mutations was used to predict prognosis. Using whole-exome sequencing data from 134 patients with PAAD from The Cancer Genome Atlas (TCGA), we found five candidate genes that were mutated in the early stages of tumorigenesis with high cellular prevalence (CP). CDKN2A, TP53, TTN, KCNJ18, and KRAS had the highest CP values among the patients with PAAD, and survival and recurrence rates were significantly different between the patients harboring mutations in these candidate genes and those harboring mutations in other genes (p = 2.39E-03, p = 8.47E-04, respectively). Second, we generated an autoencoder to integrate the RNA sequencing, microRNA sequencing, and DNA methylation data from 134 patients with PAAD from TCGA. The autoencoder robustly reduced the dimensions of these multi-omics data, and the K-means clustering method was then used to cluster the patients into two subgroups. The subgroups of patients had significant differences in survival and recurrence (p = 1.41E-03, p = 4.43E-04, respectively). Finally, we developed a prediction model for prognosis using these two biological features and clinical data. When support vector machines, random forest, logistic regression, and L2 regularized logistic regression were used as prediction models, logistic regression analysis generally revealed the best performance for both disease-free survival (DFS) and overall survival (OS) (accuracy [ACC] = 0.762 and area under the curve [AUC] = 0.795 for DFS; ACC = 0.776 and AUC = 0.769 for OS). Thus, we could classify patients with a high probability of recurrence and at a high risk of poor outcomes. Our study provides insights into new personalized therapies on the basis of mutation status and multi-omics data.
Project description:BACKGROUND:High-throughput methodologies such as microarrays and next-generation sequencing are routinely used in cancer research, generating complex data at different omics layers. The effective integration of omics data could provide a broader insight into the mechanisms of cancer biology, helping researchers and clinicians to develop personalized therapies. RESULTS:In the context of CAMDA 2017 Neuroblastoma Data Integration challenge, we explore the use of Integrative Network Fusion (INF), a bioinformatics framework combining a similarity network fusion with machine learning for the integration of multiple omics data. We apply the INF framework for the prediction of neuroblastoma patient outcome, integrating RNA-Seq, microarray and array comparative genomic hybridization data. We additionally explore the use of autoencoders as a method to integrate microarray expression and copy number data. CONCLUSIONS:The INF method is effective for the integration of multiple data sources providing compact feature signatures for patient classification with performances comparable to other methods. Latent space representation of the integrated data provided by the autoencoder approach gives promising results, both by improving classification on survival endpoints and by providing means to discover two groups of patients characterized by distinct overall survival (OS) curves. REVIEWERS:This article was reviewed by Djork-Arné Clevert and Tieliu Shi.
Project description:BACKGROUND:Recent years have witnessed an increasing interest in multi-omics data, because these data allow for better understanding complex diseases such as cancer on a molecular system level. In addition, multi-omics data increase the chance to robustly identify molecular patient sub-groups and hence open the door towards a better personalized treatment of diseases. Several methods have been proposed for unsupervised clustering of multi-omics data. However, a number of challenges remain, such as the magnitude of features and the large difference in dimensionality across different omics data sources. RESULTS:We propose a multi-modal sparse denoising autoencoder framework coupled with sparse non-negative matrix factorization to robustly cluster patients based on multi-omics data. The proposed model specifically leverages pathway information to effectively reduce the dimensionality of omics data into a pathway and patient specific score profile. In consequence, our method allows us to understand, which pathway is a feature of which particular patient cluster. Moreover, recently proposed machine learning techniques allow us to disentangle the specific impact of each individual omics feature on a pathway score. We applied our method to cluster patients in several cancer datasets using gene expression, miRNA expression, DNA methylation and CNVs, demonstrating the possibility to obtain biologically plausible disease subtypes characterized by specific molecular features. Comparison against several competing methods showed a competitive clustering performance. In addition, post-hoc analysis of somatic mutations and clinical data provided supporting evidence and interpretation of the identified clusters. CONCLUSIONS:Our suggested multi-modal sparse denoising autoencoder approach allows for an effective and interpretable integration of multi-omics data on pathway level while addressing the high dimensional character of omics data. Patient specific pathway score profiles derived from our model allow for a robust identification of disease subgroups.
Project description:With the high prevalence of breast cancer, it is urgent to find out the intrinsic difference between various subtypes, so as to infer the underlying mechanisms. Given the available multi-omics data, their proper integration can improve the accuracy of breast cancer subtype recognition. In this study, DeepMO, a model using deep neural networks based on multi-omics data, was employed for classifying breast cancer subtypes. Three types of omics data including mRNA data, DNA methylation data, and copy number variation (CNV) data were collected from The Cancer Genome Atlas (TCGA). After data preprocessing and feature selection, each type of omics data was input into the deep neural network, which consists of an encoding subnetwork and a classification subnetwork. The results of DeepMO based on multi-omics on binary classification are better than other methods in terms of accuracy and area under the curve (AUC). Moreover, compared with other methods using single omics data and multi-omics data, DeepMO also had a higher prediction accuracy on multi-classification. We also validated the effect of feature selection on DeepMO. Finally, we analyzed the enrichment gene ontology (GO) terms and biological pathways of these significant genes, which were discovered during the feature selection process. We believe that the proposed model is useful for multi-omics data analysis.
Project description:International initiatives such as the Molecular Taxonomy of Breast Cancer International Consortium are collecting multiple data sets at different genome-scales with the aim to identify novel cancer bio-markers and predict patient survival. To analyze such data, several machine learning, bioinformatics, and statistical methods have been applied, among them neural networks such as autoencoders. Although these models provide a good statistical learning framework to analyze multi-omic and/or clinical data, there is a distinct lack of work on how to integrate diverse patient data and identify the optimal design best suited to the available data.In this paper, we investigate several autoencoder architectures that integrate a variety of cancer patient data types (e.g., multi-omics and clinical data). We perform extensive analyses of these approaches and provide a clear methodological and computational framework for designing systems that enable clinicians to investigate cancer traits and translate the results into clinical applications. We demonstrate how these networks can be designed, built, and, in particular, applied to tasks of integrative analyses of heterogeneous breast cancer data. The results show that these approaches yield relevant data representations that, in turn, lead to accurate and stable diagnosis.
Project description:Cancer subtypes can improve our understanding of cancer, and suggest more precise treatment for patients. Multi-omics molecular data can characterize cancers at different levels. Up to now, many computational methods that integrate multi-omics data for cancer subtyping have been proposed. However, there are no consistent criteria to evaluate the integration methods due to the lack of gold standards (e.g., the number of subtypes in a specific cancer). Since comprehensive evaluation and comparison between different methods serves as a useful tool or guideline for users to select an optimal method for their own purpose, we develop a scalable platform, CEPICS, for comprehensively evaluating and comparing multi-omics data integration methods in cancer subtyping. Given a user-specified maximum number of subtypes, k-max, CEPICS provides (1) cancer subtyping results using up to five built-in state-of-the-art integration methods under the number of subtypes from two to k-max, (2) a report including the evaluation of each user-selected method and comparisons across them using clustering performance metrics and clinical survival analysis, and (3) an overall analysis of subtyping results by different methods representing a robust cancer subtype prediction for samples. Furthermore, users can upload subtyping results of their own methods to compare with the built-in methods. CEPICS is implemented as an R package and is freely available at https://github.com/GaoLabXDU/CEPICS.