Multi-view singular value decomposition for disease subtyping and genetic associations.
ABSTRACT: Accurate classification of patients with a complex disease into subtypes has important implications for medicine and healthcare. Using more homogeneous disease subtypes in genetic association analysis will facilitate the detection of new genetic variants that are not detectible using the non-differentiated disease phenotype. Subtype differentiation can also improve diagnostic classification, which can in turn inform clinical decision making and treatment matching. Currently, the most sophisticated methods for disease subtyping perform cluster analysis using patients' clinical features. Without guidance from genetic information, the resultant subtypes are likely to be suboptimal and efforts at genetic association may fail.We propose a multi-view matrix decomposition approach that integrates clinical features with genetic markers to detect confirmatory evidence for a disease subtype. This approach groups patients into clusters that are consistent between the clinical and genetic dimensions of data; it simultaneously identifies the clinical features that define the subtype and the genotypes associated with the subtype. A simulation study validated the proposed approach, showing that it identified hypothesized subtypes and associated features. In comparison to the latest biclustering and multi-view data analytics using real-life disease data, the proposed approach identified clinical subtypes of a disease that differed from each other more significantly in the genetic markers, thus demonstrating the superior performance of the proposed approach.The proposed algorithm is an effective and superior alternative to the disease subtyping methods employed to date. Integration of phenotypic features with genetic markers in the subtyping analysis is a promising approach to identify concurrently disease subtypes and their genetic associations.
Project description:Genetic association analysis of complex diseases has been limited by heterogeneity in their clinical manifestations and genetic etiology. Research has made it possible to differentiate homogeneous subtypes of the disease phenotype. Currently, the most sophisticated subtyping methods perform unsupervised cluster analysis using only clinical features of a disorder, resulting in subtypes for which genetic association may be limited. In this study, we seek to derive a novel multiview data analytic method that integrates two views of the data: the clinical features and the genetic markers of the same set of patients. Our method is based on multiobjective programming that is capable of clinically categorizing a disease phenotype so as to discover genetically different subtypes.We optimize two objectives jointly: 1) in cluster analysis, the derived clusters should differ significantly in clinical features; 2) these clusters can be well separated using genetic markers by constructed classifiers. Extensive computational experiments with two substance-use disorders using two populations show that the proposed algorithm is superior to existing subtyping methods.
Project description:Complex diseases are caused by a combination of genetic and environmental factors, creating a difficult challenge for diagnosis and defining subtypes. This review article describes how distinct disease subtypes can be identified through integration and analysis of clinical and multi-omics data. A broad shift toward molecular subtyping of disease using genetic and omics data has yielded successful results in cancer and other complex diseases. To determine molecular subtypes, patients are first classified by applying clustering methods to different types of omics data, then these results are integrated with clinical data to characterize distinct disease subtypes. An example of this molecular-data-first approach is in research on Autism Spectrum Disorder (ASD), a spectrum of social communication disorders marked by tremendous etiological and phenotypic heterogeneity. In the case of ASD, omics data such as exome sequences and gene and protein expression data are combined with clinical data such as psychometric testing and imaging to enable subtype identification. Novel ASD subtypes have been proposed, such as CHD8, using this molecular subtyping approach. Broader use of molecular subtyping in complex disease research is impeded by data heterogeneity, diversity of standards, and ineffective analysis tools. The future of molecular subtyping for ASD and other complex diseases calls for an integrated resource to identify disease mechanisms, classify new patients, and inform effective treatment options. This in turn will empower and accelerate precision medicine and personalized healthcare.
Project description:PURPOSE:Breast cancer is a heterogeneous disease, and although advances in molecular subtyping have been achieved in recent years, most subtyping strategies target individual genes independent of one another and primarily concentrate on proliferative markers. The contributions of biological processes and immune patterns have been neglected in breast cancer subtype stratification. METHODS:We performed a gene set variation analysis to simplify the information on biological processes using hallmark terms and to decompose immune cell data using the immune cell gene terms on 985 breast invasive ductal/lobular carcinoma RNAseq samples in the TCGA database. RESULTS:The samples were gathered into three clusters following implementation of the t-SNE and DBSCAN algorithms and were categorized as 'hallmark-tsne' subtypes. Here, we identified a high-risk luminal A dominant breast cancer subtype (C3) that displayed increased motility, cancer stem cell-like features, a higher expression of hormone/luminal-related genes, a lower expression of proliferation-related genes and immune dysfunction. With regard to immune dysfunction, we observed that the motility-increased C3 subtype exhibited high granulocyte colony stimulating factor (G-CSF) expression accompanied by neutrophil aggregation. Cancer cells that produce high levels of G-CSF can stimulate neutrophils to form neutrophil extracellular traps, which promote cancer cell migration. This finding sheds light on one potential explanation for why the C3 subtype correlates with poor prognosis. CONCLUSIONS:The hallmark-tsne subtypes confirmed again that even the luminal A subtype is heterogeneous and can be further subdivided. The biological processes and immune heterogeneity of breast cancer must be understood to facilitate the improvement of clinical treatments.
Project description:BACKGROUND:Molecular subtyping of triple-negative breast cancers (TNBCs) via gene expression profiling is essential for understanding the molecular essence of this heterogeneous disease and for guiding individualized treatment. We aim to devise a clinically practical method based on immunohistochemistry (IHC) for the molecular subtyping of TNBCs. MATERIALS AND METHODS:By analyzing the RNA sequencing data on TNBCs from Fudan University Shanghai Cancer Center (FUSCC) (n =?360) and The Cancer Genome Atlas data set (n =?158), we determined markers that can identify specific molecular subtypes. We performed immunohistochemical staining on tumor sections of 210 TNBCs from FUSCC, established an IHC-based classifier, and applied it to another two cohorts (n =?183 and 214). RESULTS:We selected androgen receptor (AR), CD8, FOXC1, and DCLK1 as immunohistochemical markers and classified TNBCs into five subtypes based on the staining results: (a) IHC-based luminal androgen receptor (IHC-LAR; AR-positive [+]), (b) IHC-based immunomodulatory (IHC-IM; AR-negative [-], CD8+), (c) IHC-based basal-like immune-suppressed (IHC-BLIS; AR-, CD8-, FOXC1+), (d) IHC-based mesenchymal (IHC-MES; AR-, CD8-, FOXC1-, DCLK1+), and (e) IHC-based unclassifiable (AR-, CD8-, FOXC1-, DCLK1-). The ? statistic indicated substantial agreement between the IHC-based classification and mRNA-based classification. Multivariate survival analysis suggested that our IHC-based classification was an independent prognostic factor for relapse-free survival. Transcriptomic data and pathological observations implied potential treatment strategies for different subtypes. The IHC-LAR subtype showed relative activation of HER2 pathway. The IHC-IM subtype tended to exhibit an immune-inflamed phenotype characterized by the infiltration of CD8+ T cells into tumor parenchyma. The IHC-BLIS subtype showed high expression of a VEGF signature. The IHC-MES subtype displayed activation of JAK/STAT3 signaling pathway. CONCLUSION:We developed an IHC-based approach to classify TNBCs into molecular subtypes. This IHC-based classification can provide additional information for prognostic evaluation. It allows for subgrouping of TNBC patients in clinical trials and evaluating the efficacy of targeted therapies within certain subtypes. IMPLICATIONS FOR PRACTICE:An immunohistochemistry (IHC)-based classification approach was developed for triple-negative breast cancer (TNBC), which exhibited substantial agreement with the mRNA expression-based classification. This IHC-based classification (a) allows for subgrouping of TNBC patients in large clinical trials and evaluating the efficacy of targeted therapies within certain subtypes, (b) will contribute to the practical application of subtype-specific treatment for patients with TNBC, and (c) can provide additional information beyond traditional prognostic factors in relapse prediction.
Project description:The distinct molecular subtypes of lung cancer are defined by monogenic biomarkers, such as EGFR, KRAS, and ALK rearrangement. Tumor mutation burden (TMB) is a potential biomarker for response to immunotherapy, which is one of the measures for genomic instability. The molecular subtyping based on TMB has not been well characterized in lung adenocarcinomas in the Chinese population. Here we performed molecular subtyping based on TMB with the published whole exome sequencing data of 101 lung adenocarcinomas and compared the different features of the classified subtypes, including clinical features, somatic driver genes, and mutational signatures. We found that patients with lower TMB have a longer disease-free survival, and higher TMB is associated with smoking and aging. Analysis of somatic driver genes and mutational signatures demonstrates a significant association between somatic RYR2 mutations and the subtype with higher TMB. Molecular subtyping based on TMB is a potential prognostic marker for lung adenocarcinoma. Signature 4 and the mutation of RYR2 are highlighted in the TMB-High group. The mutation of RYR2 is a significant biomarker associated with high TMB in lung adenocarcinoma.
Project description:PURPOSE:Molecular subtyping for pancreatic cancer has made substantial progress in recent years, facilitating the optimization of existing therapeutic approaches to improve clinical outcomes in pancreatic cancer. With advances in treatment combinations and choices, it is becoming increasingly important to determine ways to place patients on the best therapies upfront. Although various molecular subtyping systems for pancreatic cancer have been proposed, consensus regarding proposed subtypes, as well as their relative clinical utility, remains largely unknown and presents a natural barrier to wider clinical adoption. EXPERIMENTAL DESIGN:We assess three major subtype classification schemas in the context of results from two clinical trials and by meta-analysis of publicly available expression data to assess statistical criteria of subtype robustness and overall clinical relevance. We then developed a single-sample classifier (SSC) using penalized logistic regression based on the most robust and replicable schema. RESULTS:We demonstrate that a tumor-intrinsic two-subtype schema is most robust, replicable, and clinically relevant. We developed Purity Independent Subtyping of Tumors (PurIST), a SSC with robust and highly replicable performance on a wide range of platforms and sample types. We show that PurIST subtypes have meaningful associations with patient prognosis and have significant implications for treatment response to FOLIFIRNOX. CONCLUSIONS:The flexibility and utility of PurIST on low-input samples such as tumor biopsies allows it to be used at the time of diagnosis to facilitate the choice of effective therapies for patients with pancreatic ductal adenocarcinoma and should be considered in the context of future clinical trials.
Project description:Classification of ovarian cancer by morphologic features has a limited effect on serous ovarian cancer (SOC) treatment and prognosis. Here, we proposed a new system for SOC subtyping based on the molecular categories from the Cancer Genome Atlas project. We analyzed the DNA methylation, protein, microRNA, and gene expression of 1203 samples from 599 serous ovarian cancer patients. These samples were divided into nine subtypes based on RNA-seq data, and each subtype was found to be associated with the activation and/or suppression of the following four biological processes: immunoactivity, hormone metabolic, mesenchymal development and the MAPK signaling pathway. We also identified four DNA methylation, two protein expression, six microRNA sequencing and four pathway subtypes. By integrating the subtyping results across different omics platforms, we found that most RNA-seq subtypes overlapped with one or two subtypes from other omics data. Our study sheds light on the molecular mechanisms of SOC and provides a new perspective for the more accurate stratification of its subtypes.
Project description:MOTIVATION:Recent technology developments have made it possible to generate various kinds of omics data, which provides opportunities to better solve problems such as disease subtyping or disease mapping using more comprehensive omics data jointly. Among many developed data-integration methods, the similarity network fusion (SNF) method has shown a great potential to identify new disease subtypes through separating similar subjects using multi-omics data. SNF effectively fuses similarity networks with pairwise patient similarity measures from different types of omics data into one fused network using both shared and complementary information across multiple types of omics data. RESULTS:In this article, we proposed an association-signal-annotation boosted similarity network fusion (ab-SNF) method, adding feature-level association signal annotations as weights aiming to up-weight signal features and down-weight noise features when constructing subject similarity networks to boost the performance in disease subtyping. In various simulation studies, the proposed ab-SNF outperforms the original SNF approach without weights. Most importantly, the improvement in the subtyping performance due to association-signal-annotation weights is amplified in the integration process. Applications to somatic mutation data, DNA methylation data and gene expression data of three cancer types from The Cancer Genome Atlas project suggest that the proposed ab-SNF method consistently identifies new subtypes in each cancer that more accurately predict patient survival and are more biologically meaningful. AVAILABILITY AND IMPLEMENTATION:The R package abSNF is freely available for downloading from https://github.com/pfruan/abSNF. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:Purpose: Molecular subtyping for pancreatic cancer has made substantial progress in recent years, facilitating the optimization of existing therapeutic approaches to improve clinical outcomes in pancreatic cancer. With advances in treatment combinations and choices, it is becoming increasingly important to determine ways to place patients on the best therapies upfront. Although various molecular subtyping systems for pancreatic cancer have been proposed, consensus regarding proposed subtypes, as well as their relative clinical utility, remains largely unknown and presents a natural barrier to wider clinical adoption. Methods: We assess three major subtype classification schemas in the context of results from two clinical trials and by meta-analysis of publicly available expression data to assess statistical criteria of subtype robustness and overall clinical relevance. We then developed a single-sample classifier (SSC) using penalized logistic regression based on the most robust and replicable schema. Results: We demonstrate that a tumor-intrinsic two-subtype schema is most robust, replicable, and clinically relevant. We developed Purity Independent Subtyping of Tumors (PurIST), a SSC with robust and highly replicable performance on a wide range of platforms and sample types. We show that PurIST subtypes have meaningful associations with patient prognosis and have significant implications for treatment response to FOLIFIRNOX. Conclusions: The flexibility and utility of PurIST on low-input samples such as tumor biopsies allows it to be used at the time of diagnosis to facilitate the choice of effective therapies for patients with pancreatic ductal adenocarcinoma and should be considered in the context of future clinical trials. Overall design: Analysis of gene expression in pancreatic adenocarcinoma (PDAC). For primary PDAC tumors, data include 47 flash frozen (FF), 5 formalin-fixed and paraffin embedded (FFPE) and 45 fine-needle aspiration (FNA) ; for patient derived xenograft (PDX), data include 18 FF, 7 FFPE and 3 FNA samples.
Project description:Genetically diverse pathogens (such as Human Immunodeficiency virus type 1, HIV-1) are frequently stratified into phylogenetically or immunologically defined subtypes for classification purposes. Computational identification of such subtypes is helpful in surveillance, epidemiological analysis and detection of novel variants, e.g., circulating recombinant forms in HIV-1. A number of conceptually and technically different techniques have been proposed for determining the subtype of a query sequence, but there is not a universally optimal approach. We present a model-based phylogenetic method for automatically subtyping an HIV-1 (or other viral or bacterial) sequence, mapping the location of breakpoints and assigning parental sequences in recombinant strains as well as computing confidence levels for the inferred quantities. Our Subtype Classification Using Evolutionary ALgorithms (SCUEAL) procedure is shown to perform very well in a variety of simulation scenarios, runs in parallel when multiple sequences are being screened, and matches or exceeds the performance of existing approaches on typical empirical cases. We applied SCUEAL to all available polymerase (pol) sequences from two large databases, the Stanford Drug Resistance database and the UK HIV Drug Resistance Database. Comparing with subtypes which had previously been assigned revealed that a minor but substantial (approximately 5%) fraction of pure subtype sequences may in fact be within- or inter-subtype recombinants. A free implementation of SCUEAL is provided as a module for the HyPhy package and the Datamonkey web server. Our method is especially useful when an accurate automatic classification of an unknown strain is desired, and is positioned to complement and extend faster but less accurate methods. Given the increasingly frequent use of HIV subtype information in studies focusing on the effect of subtype on treatment, clinical outcome, pathogenicity and vaccine design, the importance of accurate, robust and extensible subtyping procedures is clear.