Derivation of molecular signatures for breast cancer recurrence prediction using a two-way validation approach.
ABSTRACT: Previous studies have demonstrated the potential value of gene expression signatures in assessing the risk of post-surgical breast cancer recurrence, however, many of these predictive models have been derived using simple computational algorithms and validated internally or using one-way validation on a single dataset. We have recently developed a new feature selection algorithm that overcomes some limitations inherent to high-dimensional data analysis. In this study, we applied this algorithm to two publicly available gene expression datasets obtained from over 400 patients with breast cancer to investigate whether we could derive more accurate prognostic signatures and reveal common predictive factors across independent datasets. We compared the performance of three advanced computational algorithms using a robust two-way validation method, where one dataset was used for training and to establish a prediction model that was then blindly tested on the other dataset. The experiment was then repeated in the reverse direction. Analyses identified prognostic signatures that while comprised of only 10-13 genes, significantly outperformed previously reported signatures for breast cancer evaluation. The cross-validation approach revealed CEGP1 and PRAME as major candidates for breast cancer biomarker development.
Project description:We considered the prediction of cancer classes (e.g. subtypes) using patient gene expression profiles that contain both systematic and condition-specific biases when compared with the training reference dataset. The conventional normalization-based approaches cannot guarantee that the gene signatures in the reference and prediction datasets always have the same distribution for all different conditions as the class-specific gene signatures change with the condition. Therefore, the trained classifier would work well under one condition but not under another.To address the problem of current normalization approaches, we propose a novel algorithm called CrossLink (CL). CL recognizes that there is no universal, condition-independent normalization mapping of signatures. In contrast, it exploits the fact that the signature is unique to its associated class under any condition and thus employs an unsupervised clustering algorithm to discover this unique signature.We assessed the performance of CL for cross-condition predictions of PAM50 subtypes of breast cancer by using a simulated dataset modeled after TCGA BRCA tumor samples with a cross-validation scheme, and datasets with known and unknown PAM50 classification. CL achieved prediction accuracy >73 %, highest among other methods we evaluated. We also applied the algorithm to a set of breast cancer tumors derived from Arabic population to assign a PAM50 classification to each tumor based on their gene expression profiles.A novel algorithm CrossLink for cross-condition prediction of cancer classes was proposed. In all test datasets, CL showed robust and consistent improvement in prediction performance over other state-of-the-art normalization and classification algorithms.
Project description:BACKGROUND: Novel prognostic markers are needed so newly diagnosed breast cancer patients do not undergo any unnecessary therapy. Various microarray gene expression datasets based studies have generated gene signatures to predict the prognosis outcomes, while ignoring the large amount of information contained in established clinical markers. Nevertheless, small sample sizes in individual microarray datasets remain a bottleneck in generating robust gene signatures that show limited predictive power. The aim of this study is to achieve high classification accuracy for the good prognosis group and then achieve high classification accuracy for the poor prognosis group. METHODS: We propose a novel algorithm called the IPRE (integrated prognosis risk estimation) algorithm. We used integrated microarray datasets from multiple studies to increase the sample sizes (? 2,700 samples). The IPRE algorithm consists of a virtual chromosome for the extraction of the prognostic gene signature that has 79 genes, and a multivariate logistic regression model that incorporates clinical data along with expression data to generate the risk score formula that accurately categorizes breast cancer patients into two prognosis groups. RESULTS: The evaluation on two testing datasets showed that the IPRE algorithm achieved high classification accuracies of 82% and 87%, which was far greater than any existing algorithms.
Project description:BACKGROUND: Neuroblastoma is the most common pediatric solid tumor of the sympathetic nervous system. Development of improved predictive tools for patients stratification is a crucial requirement for neuroblastoma therapy. Several studies utilized gene expression-based signatures to stratify neuroblastoma patients and demonstrated a clear advantage of adding genomic analysis to risk assessment. There is little overlapping among signatures and merging their prognostic potential would be advantageous. Here, we describe a new strategy to merge published neuroblastoma related gene signatures into a single, highly accurate, Multi-Signature Ensemble (MuSE)-classifier of neuroblastoma (NB) patients outcome. METHODS: Gene expression profiles of 182 neuroblastoma tumors, subdivided into three independent datasets, were used in the various phases of development and validation of neuroblastoma NB-MuSE-classifier. Thirty three signatures were evaluated for patients' outcome prediction using 22 classification algorithms each and generating 726 classifiers and prediction results. The best-performing algorithm for each signature was selected, validated on an independent dataset and the 20 signatures performing with an accuracy > = 80% were retained. RESULTS: We combined the 20 predictions associated to the corresponding signatures through the selection of the best performing algorithm into a single outcome predictor. The best performance was obtained by the Decision Table algorithm that produced the NB-MuSE-classifier characterized by an external validation accuracy of 94%. Kaplan-Meier curves and log-rank test demonstrated that patients with good and poor outcome prediction by the NB-MuSE-classifier have a significantly different survival (p < 0.0001). Survival curves constructed on subgroups of patients divided on the bases of known prognostic marker suggested an excellent stratification of localized and stage 4s tumors but more data are needed to prove this point. CONCLUSIONS: The NB-MuSE-classifier is based on an ensemble approach that merges twenty heterogeneous, neuroblastoma-related gene signatures to blend their discriminating power, rather than numeric values, into a single, highly accurate patients' outcome predictor. The novelty of our approach derives from the way to integrate the gene expression signatures, by optimally associating them with a single paradigm ultimately integrated into a single classifier. This model can be exported to other types of cancer and to diseases for which dedicated databases exist.
Project description:BACKGROUND: Michiels et al. (Lancet 2005; 365: 488-92) employed a resampling strategy to show that the genes identified as predictors of prognosis from resamplings of a single gene expression dataset are highly variable. The genes most frequently identified in the separate resamplings were put forward as a 'gold standard'. On a higher level, breast cancer datasets collected by different institutions can be considered as resamplings from the underlying breast cancer population. The limited overlap between published prognostic signatures confirms the trend of signature instability identified by the resampling strategy. Six breast cancer datasets, totaling 947 samples, all measured on the Affymetrix platform, are currently available. This provides a unique opportunity to employ a substantial dataset to investigate the effects of pooling datasets on classifier accuracy, signature stability and enrichment of functional categories. RESULTS: We show that the resampling strategy produces a suboptimal ranking of genes, which can not be considered to be a 'gold standard'. When pooling breast cancer datasets, we observed a synergetic effect on the classification performance in 73% of the cases. We also observe a significant positive correlation between the number of datasets that is pooled, the validation performance, the number of genes selected, and the enrichment of specific functional categories. In addition, we have evaluated the support for five explanations that have been postulated for the limited overlap of signatures. CONCLUSION: The limited overlap of current signature genes can be attributed to small sample size. Pooling datasets results in more accurate classification and a convergence of signature genes. We therefore advocate the analysis of new data within the context of a compendium, rather than analysis in isolation.
Project description:Lung cancer has one of the highest mortality rates of malignant neoplasms. Lung adenocarcinoma (LUAD) is one of the most common types of lung cancer. DNA methylation is more stable than gene expression and could be used as a biomarker for early tumor diagnosis. This study is aimed to screen potential DNA methylation signatures to facilitate the diagnosis and prognosis of LUAD and integrate gene expression and DNA methylation data of LUAD to identify functional epigenetic modules. We systematically integrated gene expression and DNA methylation data from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO), bioinformatic models and algorithms were implemented to identify signatures and functional modules for LUAD. Three promising diagnostic and five potential prognostic signatures for LUAD were screened by rigorous filtration, and our tumor-normal classifier and prognostic model were validated in two separate data sets. Additionally, we identified functional epigenetic modules in the TCGA LUAD dataset and GEO independent validation data set. Interestingly, the MUC1 module was identified in both datasets. The potential biomarkers for the diagnosis and prognosis of LUAD are expected to be further verified in clinical practice to aid in the diagnosis and treatment of LUAD.
Project description:The emerging role of the cancer cell-immune cell interface in shaping tumorigenesis/anticancer immunotherapy has increased the need to identify prognostic biomarkers. Henceforth, our primary aim was to identify the immunogenic cell death (ICD)-derived metagene signatures in breast, lung and ovarian cancer that associate with improved patient survival. To this end, we analyzed the prognostic impact of differential gene-expression of 33 pre-clinically-validated ICD-parameters through a large-scale meta-analysis involving 3,983 patients ('discovery' dataset) across lung (1,432), breast (1,115) and ovarian (1,436) malignancies. The main results were also substantiated in 'validation' datasets consisting of 818 patients of same cancer-types (i.e. 285 breast/274 lung/259 ovarian). The ICD-associated parameters exhibited a highly-clustered and largely cancer type-specific prognostic impact. Interestingly, we delineated ICD-derived consensus-metagene signatures that exhibited a positive prognostic impact that was either cancer type-independent or specific. Importantly, most of these ICD-derived consensus-metagenes (acted as attractor-metagenes and thereby) 'attracted' highly co-expressing sets of genes or convergent-metagenes. These convergent-metagenes also exhibited positive prognostic impact in respective cancer types. Remarkably, we found that the cancer type-independent consensus-metagene acted as an 'attractor' for cancer-specific convergent-metagenes. This reaffirms that the immunological prognostic landscape of cancer tends to segregate between cancer-independent and cancer-type specific gene signatures. Moreover, this prognostic landscape was largely dominated by the classical T cell activity/infiltration/function-related biomarkers. Interestingly, each cancer type tended to associate with biomarkers representing a specific T cell activity or function rather than pan-T cell biomarkers. Thus, our analysis confirms that ICD can serve as a platform for discovery of novel prognostic metagenes.
Project description:<h4>Background</h4>Robust transcriptional signatures in cancer can be identified by data similarity-driven meta-analysis of gene expression profiles. An unbiased data integration and interrogation strategy has not previously been available.<h4>Methods and findings</h4>We implemented and performed a large meta-analysis of breast cancer gene expression profiles from 223 datasets containing 10,581 human breast cancer samples using a novel data similarity-based approach (iterative EXALT). Cancer gene expression signatures extracted from individual datasets were clustered by data similarity and consolidated into a meta-signature with a recurrent and concordant gene expression pattern. A retrospective survival analysis was performed to evaluate the predictive power of a novel meta-signature deduced from transcriptional profiling studies of human breast cancer. Validation cohorts consisting of 6,011 breast cancer patients from 21 different breast cancer datasets and 1,110 patients with other malignancies (lung and prostate cancer) were used to test the robustness of our findings. During the iterative EXALT analysis, 633 signatures were grouped by their data similarity and formed 121 signature clusters. From the 121 signature clusters, we identified a unique meta-signature (BRmet50) based on a cluster of 11 signatures sharing a phenotype related to highly aggressive breast cancer. In patients with breast cancer, there was a significant association between BRmet50 and disease outcome, and the prognostic power of BRmet50 was independent of common clinical and pathologic covariates. Furthermore, the prognostic value of BRmet50 was not specific to breast cancer, as it also predicted survival in prostate and lung cancers.<h4>Conclusions</h4>We have established and implemented a novel data similarity-driven meta-analysis strategy. Using this approach, we identified a transcriptional meta-signature (BRmet50) in breast cancer, and the prognostic performance of BRmet50 was robust and applicable across a wide range of cancer-patient populations.
Project description:Background:Hepatocellular carcinoma (HCC) is one of the most universal malignant liver tumors worldwide. However, there were no systematic studies to establish glycolysis?related gene pairs (GRGPs) signatures for the patients with HCC. Therefore, the study aimed to establish novel GRGPs signatures to better predict the prognosis of HCC. Methods:Based on the data from Gene Expression Omnibus, The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium databases, glycolysis-related mRNAs were used to construct GRGPs. Cox regression was applied to establish a seventeen GRGPs signature in TCGA dataset, which was verified in two validation (European and American, and Asian) datasets. Results:Seventeen prognostic GRGPs (HMMR_PFKFB1, CHST1_GYS2, MERTK_GYS2, GPC1_GYS2, LDHA_GOT2, IDUA_GNPDA1, IDUA_ME2, IDUA_G6PD, IDUA_GPC1, MPI_GPC1, SDC2_LDHA, PRPS1_PLOD2, GALK1_IER3, MET_PLOD2, GUSB_IGFBP3, IL13RA1_IGFBP3 and CYB5A_IGFBP3) were identified to be significantly progressive factors for the patients with HCC in the TCGA dataset, which constituted a GRGPs signature. The patients with HCC were classified into low-risk group and high-risk group based on the GRGPs signature. The GRGPs signature was a significantly independent prognostic indicator for the patients with HCC in TCGA (log-rank P = 2.898e-14). Consistent with the TCGA dataset, the patients in low-risk group had a longer OS in two validation datasets (European and American: P = 1.143e-02, and Asian: P = 6.342e-08). Additionally, the GRGPs signature was also validated as a significantly independent prognostic indicator in two validation datasets. Conclusion:The seventeen GRGPs and their signature might be molecular biomarkers and therapeutic targets for the patients with HCC.
Project description:BACKGROUND: Highly parallel analysis of gene expression has recently been used to identify gene sets or 'signatures' to improve patient diagnosis and risk stratification. Once a signature is generated, traditional statistical testing is used to evaluate its prognostic performance. However, due to the dimensionality of microarrays, this can lead to false interpretation of these signatures. PRINCIPAL FINDINGS: A method was developed to test batches of a user-specified number of randomly chosen signatures in patient microarray datasets. The percentage of random generated signatures yielding prognostic value was assessed using ROC analysis by calculating the area under the curve (AUC) in six public available cancer patient microarray datasets. We found that a signature consisting of randomly selected genes has an average 10% chance of reaching significance when assessed in a single dataset, but can range from 1% to ?40% depending on the dataset in question. Increasing the number of validation datasets markedly reduces this number. CONCLUSIONS: We have shown that the use of an arbitrary cut-off value for evaluation of signature significance is not suitable for this type of research, but should be defined for each dataset separately. Our method can be used to establish and evaluate signature performance of any derived gene signature in a dataset by comparing its performance to thousands of randomly generated signatures. It will be of most interest for cases where few data are available and testing in multiple datasets is limited.
Project description:BACKGROUND:Following visible successes on a wide range of predictive tasks, machine learning techniques are attracting substantial interest from medical researchers and clinicians. We address the need for capacity development in this area by providing a conceptual introduction to machine learning alongside a practical guide to developing and evaluating predictive algorithms using freely-available open source software and public domain data. METHODS:We demonstrate the use of machine learning techniques by developing three predictive models for cancer diagnosis using descriptions of nuclei sampled from breast masses. These algorithms include regularized General Linear Model regression (GLMs), Support Vector Machines (SVMs) with a radial basis function kernel, and single-layer Artificial Neural Networks. The publicly-available dataset describing the breast mass samples (N=683) was randomly split into evaluation (n=456) and validation (n=227) samples. We trained algorithms on data from the evaluation sample before they were used to predict the diagnostic outcome in the validation dataset. We compared the predictions made on the validation datasets with the real-world diagnostic decisions to calculate the accuracy, sensitivity, and specificity of the three models. We explored the use of averaging and voting ensembles to improve predictive performance. We provide a step-by-step guide to developing algorithms using the open-source R statistical programming environment. RESULTS:The trained algorithms were able to classify cell nuclei with high accuracy (.94 -.96), sensitivity (.97 -.99), and specificity (.85 -.94). Maximum accuracy (.96) and area under the curve (.97) was achieved using the SVM algorithm. Prediction performance increased marginally (accuracy =.97, sensitivity =.99, specificity =.95) when algorithms were arranged into a voting ensemble. CONCLUSIONS:We use a straightforward example to demonstrate the theory and practice of machine learning for clinicians and medical researchers. The principals which we demonstrate here can be readily applied to other complex tasks including natural language processing and image recognition.