Breast Cancer Gene Expression Data from Frankfurt Series
Ontology highlight
ABSTRACT: Pooling of microarray datasets seems to be a reasonable approach to increase sample size when a heterogeneous disease like breast cancer is concerned. Different methods for the adaption of datasets have been used in the literature. We have analyzed influences of these strategies using a pool of 3,030 Affymetrix U133A microarrays from breast cancer samples. We present data on the resulting concordance with biochemical assays of well known parameters and highlight critical pitfalls. We further propose a method for the inference of cutoff values directly from the data without prior knowledge of the true result. The cutoffs derived by this method displayed high specificity and sensitivity. Markers with a bimodal distribution like ER, PgR, and HER2 discriminate different biological subtypes of disease with distinct clinical courses. In contrast, markers displaying a continuous distribution like proliferation markers as Ki67 rather describe the composition of the mixture of cells in the tumor. Fresh frozen surgical biopsy samples from consecutive patients were analyzed on Affymetrix HGU133A
Project description:Pooling of microarray datasets seems to be a reasonable approach to increase sample size when a heterogeneous disease like breast cancer is concerned. Different methods for the adaption of datasets have been used in the literature. We have analyzed influences of these strategies using a pool of 3,030 Affymetrix U133A microarrays from breast cancer samples. We present data on the resulting concordance with biochemical assays of well known parameters and highlight critical pitfalls. We further propose a method for the inference of cutoff values directly from the data without prior knowledge of the true result. The cutoffs derived by this method displayed high specificity and sensitivity. Markers with a bimodal distribution like ER, PgR, and HER2 discriminate different biological subtypes of disease with distinct clinical courses. In contrast, markers displaying a continuous distribution like proliferation markers as Ki67 rather describe the composition of the mixture of cells in the tumor.
Project description:Pooling of microarray datasets seems to be a reasonable approach to increase sample size when a heterogeneous disease like breast cancer is concerned. Different methods for the adaption of datasets have been used in the literature. We have analyzed influences of these strategies using a pool of 3,030 Affymetrix U133A microarrays from breast cancer samples. We present data on the resulting concordance with biochemical assays of well known parameters and highlight critical pitfalls. We further propose a method for the inference of cutoff values directly from the data without prior knowledge of the true result. The cutoffs derived by this method displayed high specificity and sensitivity. Markers with a bimodal distribution like ER, PgR, and HER2 discriminate different biological subtypes of disease with distinct clinical courses. In contrast, markers displaying a continuous distribution like proliferation markers as Ki67 rather describe the composition of the mixture of cells in the tumor. Fresh frozen surgical biopsy samples from consecutive patients were analyzed on Affymetrix HGU133A
Project description:Gene expression profiling of surgical biopsies from 74 breast cancer patients of different subtypes from Hamburg dataset. Fresh frozen pre-treatment surgical biopsy samples from breast cancer patients were analyzed on Affymetrix HGU133A.
Project description:BackgroundEstrogen is a chemical messenger that has an influence on many breast cancers as it helps cells to grow and divide. These cancers are often known as estrogen responsive cancers in which estrogen receptor occupies the surface of the cells. The successful treatment of breast cancers requires understanding gene expression, identifying of tumor markers, acquiring knowledge of cellular pathways, etc. In this paper we introduce our proposed triclustering algorithm δ-TRIMAX that aims to find genes that are coexpressed over subset of samples across a subset of time points. Here we introduce a novel mean-squared residue for such 3D dataset. Our proposed algorithm yields triclusters that have a mean-squared residue score below a threshold δ.ResultsWe have applied our algorithm on one simulated dataset and one real-life dataset. The real-life dataset is a time-series dataset in estrogen induced breast cancer cell line. To establish the biological significance of genes belonging to resultant triclusters we have performed gene ontology, KEGG pathway and transcription factor binding site enrichment analysis. Additionally, we represent each resultant tricluster by computing its eigengene and verify whether its eigengene is also differentially expressed at early, middle and late estrogen responsive stages. We also identified hub-genes for each resultant triclusters and verified whether the hub-genes are found to be associated with breast cancer. Through our analysis CCL2, CD47, NFIB, BRD4, HPGD, CSNK1E, NPC1L1, PTEN, PTPN2 and ADAM9 are identified as hub-genes which are already known to be associated with breast cancer. The other genes that have also been identified as hub-genes might be associated with breast cancer or estrogen responsive elements. The TFBS enrichment analysis also reveals that transcription factor POU2F1 binds to the promoter region of ESR1 that encodes estrogen receptor α. Transcription factor E2F1 binds to the promoter regions of coexpressed genes MCM7, ANAPC1 and WEE1.ConclusionsThus our integrative approach provides insights into breast cancer prognosis.
Project description:BackgroundWhile global breast cancer gene expression data sets have considerable commonality in terms of their data content, the populations that they represent and the data collection methods utilized can be quite disparate. We sought to assess the extent and consequence of these systematic differences with respect to identifying clinically significant prognostic groups.MethodsWe ascertained how effectively unsupervised clustering employing randomly generated sets of genes could segregate tumors into prognostic groups using four well-characterized breast cancer data sets.ResultsUsing a common set of 5,000 randomly generated lists (70 genes/list), the percentages of clusters with significant differences in metastasis latencies (HR p-value<0.01) was 62%, 15%, 21% and 0% in the NKI2 (Netherlands Cancer Institute), Wang, TRANSBIG and KJX64/KJ125 data sets, respectively. Among ER positive tumors, the percentages were 38%, 11%, 4% and 0%, respectively. Few random lists were predictive among ER negative tumors in any data set. Clustering was associated with ER status and, after globally adjusting for the effects of ER-alpha gene expression, the percentages were 25%, 33%, 1% and 0%, respectively. The impact of adjusting for ER status depended on the extent of confounding between ER-alpha gene expression and markers of proliferation.ConclusionIt is highly probable to identify a statistically significant association between a given gene list and prognosis in the NKI2 dataset due to its large sample size and the interrelationship between ER-alpha expression and markers of proliferation. In most respects, the TRANSBIG data set generated similar outcomes as the NKI2 data set, although its smaller sample size led to fewer statistically significant results.
Project description:Gene expression profiling of surgical biopsies from 74 breast cancer patients of different subtypes from Hamburg dataset. Fresh frozen pre-treatment surgical biopsy samples from breast cancer patients were analyzed on Affymetrix HGU133A.
Project description:IntroductionThe potential of applying data analysis tools to microarray data for diagnosis and prognosis is illustrated on the recent breast cancer dataset of van 't Veer and coworkers. We re-examine that dataset using the novel technique of logical analysis of data (LAD), with the double objective of discovering patterns characteristic for cases with good or poor outcome, using them for accurate and justifiable predictions; and deriving novel information about the role of genes, the existence of special classes of cases, and other factors.MethodData were analyzed using the combinatorics and optimization-based method of LAD, recently shown to provide highly accurate diagnostic and prognostic systems in cardiology, cancer proteomics, hematology, pulmonology, and other disciplines.ResultsLAD identified a subset of 17 of the 25,000 genes, capable of fully distinguishing between patients with poor, respectively good prognoses. An extensive list of 'patterns' or 'combinatorial biomarkers' (that is, combinations of genes and limitations on their expression levels) was generated, and 40 patterns were used to create a prognostic system, shown to have 100% and 92.9% weighted accuracy on the training and test sets, respectively. The prognostic system uses fewer genes than other methods, and has similar or better accuracy than those reported in other studies. Out of the 17 genes identified by LAD, three (respectively, five) were shown to play a significant role in determining poor (respectively, good) prognosis. Two new classes of patients (described by similar sets of covering patterns, gene expression ranges, and clinical features) were discovered. As a by-product of the study, it is shown that the training and the test sets of van 't Veer have differing characteristics.ConclusionThe study shows that LAD provides an accurate and fully explanatory prognostic system for breast cancer using genomic data (that is, a system that, in addition to predicting good or poor prognosis, provides an individualized explanation of the reasons for that prognosis for each patient). Moreover, the LAD model provides valuable insights into the roles of individual and combinatorial biomarkers, allows the discovery of new classes of patients, and generates a vast library of biomedical research hypotheses.
Project description:Methods to model dynamic changes in gene expression at a genome-wide level are not currently sufficient for large (temporally rich or single-cell) datasets. Variational autoencoders offer means to characterize large datasets and have been used effectively to characterize features of single-cell datasets. Here we extend these methods for use with gene expression time series data. We present RVAgene: a recurrent variational autoencoder to model gene expression dynamics. RVAgene learns to accurately and efficiently reconstruct temporal gene profiles. It also learns a low dimensional representation of the data via a recurrent encoder network that can be used for biological feature discovery, and from which we can generate new gene expression data by sampling the latent space. We test RVAgene on simulated and real biological datasets, including embryonic stem cell differentiation and kidney injury response dynamics. In all cases, RVAgene accurately reconstructed complex gene expression temporal profiles. Via cross validation, we show that a low-error latent space representation can be learnt using only a fraction of the data. Through clustering and gene ontology term enrichment analysis on the latent space, we demonstrate the potential of RVAgene for unsupervised discovery. In particular, RVAgene identifies new programs of shared gene regulation of Lox family genes in response to kidney injury. RVAgene is available in Python, at gihub: https://github.com/maclean-lab/RVAgene; Zenodo archive: http://doi.org/10.5281/zenodo.4271097. Supplementary data are available at Bioinformatics online.
Project description:BackgroundOne of the primary objectives in cancer research is to identify causal genomic alterations, such as somatic copy number variation (CNV) and somatic mutations, during tumor development. Many valuable studies lack genomic data to detect CNV; therefore, methods that are able to infer CNVs from gene expression data would help maximize the value of these studies.ResultsWe developed a framework for identifying recurrent regions of CNV and distinguishing the cancer driver genes from the passenger genes in the regions. By inferring CNV regions across many datasets we were able to identify 109 recurrent amplified/deleted CNV regions. Many of these regions are enriched for genes involved in many important processes associated with tumorigenesis and cancer progression. Genes in these recurrent CNV regions were then examined in the context of gene regulatory networks to prioritize putative cancer driver genes. The cancer driver genes uncovered by the framework include not only well-known oncogenes but also a number of novel cancer susceptibility genes validated via siRNA experiments.ConclusionsTo our knowledge, this is the first effort to systematically identify and validate drivers for expression based CNV regions in breast cancer. The framework where the wavelet analysis of copy number alteration based on expression coupled with the gene regulatory network analysis, provides a blueprint for leveraging genomic data to identify key regulatory components and gene targets. This integrative approach can be applied to many other large-scale gene expression studies and other novel types of cancer data such as next-generation sequencing based expression (RNA-Seq) as well as CNV data.