Project description:Methylation is closely involved in the development of various carcinomas. However, little datasets are available for small cell lung carcinoma (SCLC) due to the scarcity of fresh tumor samples. The aim of this study is to investigate the comprehensive genome-wide methylation profile of SCLC to predict the prognosis after surgical treatment. We investigated the high DNA methylated and low gene expression sites using 25 SCLC tumor tissues. First, we selected most differentially methylated CpG sites across the tumor tissues. Following hierarchical clustering (HC) and non-negative matrix factorization (NMF), gene ontology analysis was performed using DAVID software. Clustering of SCLC tumors led to the important identification of a CpG island methylator phenotype (CIMP) of SCLC, and showed that CIMP-high tumors had a significantly poorer prognosis (p = 0.001). Multivariate analysis revealed that postoperative chemotherapy, low neuroendocrine expression and the CIMP-low state were significantly good prognostic factors. Association analyses of methylation and gene expression provided 46 genes with significant correlation. Ontology studies to these genes showed that genes involved in extrinsic apoptosis pathway were suppressed, including TNFRSF1A, TNFRSF10A and TRADD, in CIMP-high tumors, prognosis of which was poorer. By comprehensive DNA methylation profiling, two distinct subgroups were identified to evoke a CIMP of SCLC as a useful marker for determination of treatment. Delineation of this phenotype may also be useful for the development of novel apoptosis-related chemotherapeutic agents for the treatment of an aggressive subtype of SCLC. Comprehensive genome-wide methylation analyses
Project description:Methylation is closely involved in the development of various carcinomas. However, little datasets are available for small cell lung carcinoma (SCLC) due to the scarcity of fresh tumor samples. The aim of this study is to investigate the comprehensive genome-wide methylation profile of SCLC to predict the prognosis after surgical treatment. We investigated the high DNA methylated and low gene expression sites using 25 SCLC tumor tissues. First, we selected most differentially methylated CpG sites across the tumor tissues. Following hierarchical clustering (HC) and non-negative matrix factorization (NMF), gene ontology analysis was performed using DAVID software. Clustering of SCLC tumors led to the important identification of a CpG island methylator phenotype (CIMP) of SCLC, and showed that CIMP-high tumors had a significantly poorer prognosis (p = 0.001). Multivariate analysis revealed that postoperative chemotherapy, low neuroendocrine expression and the CIMP-low state were significantly good prognostic factors. Association analyses of methylation and gene expression provided 46 genes with significant correlation. Ontology studies to these genes showed that genes involved in extrinsic apoptosis pathway were suppressed, including TNFRSF1A, TNFRSF10A and TRADD, in CIMP-high tumors, prognosis of which was poorer. By comprehensive DNA methylation profiling, two distinct subgroups were identified to evoke a CIMP of SCLC as a useful marker for determination of treatment. Delineation of this phenotype may also be useful for the development of novel apoptosis-related chemotherapeutic agents for the treatment of an aggressive subtype of SCLC.
Project description:Clustering is a common methodology for the analysis of array data, and many research laboratories are generating array data with repeated measurements. We evaluated several clustering algorithms that incorporate repeated measurements, and show that algorithms that take advantage of repeated measurements yield more accurate and more stable clusters. In particular, we show that the infinite mixture model-based approach with a built-in error model produces superior results.
Project description:Clustering analysis is an important tool in studying gene expression data. The Bayesian hierarchical clustering (BHC) algorithm can automatically infer the number of clusters and uses Bayesian model selection to improve clustering quality. In this paper, we present an extension of the BHC algorithm. Our Gaussian BHC (GBHC) algorithm represents data as a mixture of Gaussian distributions. It uses normal-gamma distribution as a conjugate prior on the mean and precision of each of the Gaussian components. We tested GBHC over 11 cancer and 3 synthetic datasets. The results on cancer datasets show that in sample clustering, GBHC on average produces a clustering partition that is more concordant with the ground truth than those obtained from other commonly used algorithms. Furthermore, GBHC frequently infers the number of clusters that is often close to the ground truth. In gene clustering, GBHC also produces a clustering partition that is more biologically plausible than several other state-of-the-art methods. This suggests GBHC as an alternative tool for studying gene expression data. The implementation of GBHC is available at https://sites.google.com/site/gaussianbhc/
Project description:BackgroundA cluster analysis is the most commonly performed procedure (often regarded as a first step) on a set of gene expression profiles. In most cases, a post hoc analysis is done to see if the genes in the same clusters can be functionally correlated. While past successes of such analyses have often been reported in a number of microarray studies (most of which used the standard hierarchical clustering, UPGMA, with one minus the Pearson's correlation coefficient as a measure of dissimilarity), often times such groupings could be misleading. More importantly, a systematic evaluation of the entire set of clusters produced by such unsupervised procedures is necessary since they also contain genes that are seemingly unrelated or may have more than one common function. Here we quantify the performance of a given unsupervised clustering algorithm applied to a given microarray study in terms of its ability to produce biologically meaningful clusters using a reference set of functional classes. Such a reference set may come from prior biological knowledge specific to a microarray study or may be formed using the growing databases of gene ontologies (GO) for the annotated genes of the relevant species.ResultsIn this paper, we introduce two performance measures for evaluating the results of a clustering algorithm in its ability to produce biologically meaningful clusters. The first measure is a biological homogeneity index (BHI). As the name suggests, it is a measure of how biologically homogeneous the clusters are. This can be used to quantify the performance of a given clustering algorithm such as UPGMA in grouping genes for a particular data set and also for comparing the performance of a number of competing clustering algorithms applied to the same data set. The second performance measure is called a biological stability index (BSI). For a given clustering algorithm and an expression data set, it measures the consistency of the clustering algorithm's ability to produce biologically meaningful clusters when applied repeatedly to similar data sets. A good clustering algorithm should have high BHI and moderate to high BSI. We evaluated the performance of ten well known clustering algorithms on two gene expression data sets and identified the optimal algorithm in each case. The first data set deals with SAGE profiles of differentially expressed tags between normal and ductal carcinoma in situ samples of breast cancer patients. The second data set contains the expression profiles over time of positively expressed genes (ORF's) during sporulation of budding yeast. Two separate choices of the functional classes were used for this data set and the results were compared for consistency.ConclusionFunctional information of annotated genes available from various GO databases mined using ontology tools can be used to systematically judge the results of an unsupervised clustering algorithm as applied to a gene expression data set in clustering genes. This information could be used to select the right algorithm from a class of clustering algorithms for the given data set.
Project description:Many studies have established gene expression-based prognostic signatures for lung cancer. All of these signatures were built from training data sets by learning the correlation of gene expression with the patients' survival time. They require all new sample data to be normalized to the training data, ultimately resulting in common problems of low reproducibility and impracticality. To overcome these problems, we propose a new signature model which does not involve data training. We hypothesize that the imbalance of two opposing effects in lung cancer cells, represented by Yin and Yang genes, determines a patient's prognosis. We selected the Yin and Yang genes by comparing expression data from normal lung and lung cancer tissue samples using both unsupervised clustering and pathways analyses. We calculated the Yin and Yang gene expression mean ratio (YMR) as patient risk scores. Thirty-one Yin and thirty-two Yang genes were identified and selected for the signature development. In normal lung tissues, the YMR is less than 1.0; in lung cancer cases, the YMR is greater than 1.0. The YMR was tested for lung cancer prognosis prediction in four independent data sets and it significantly stratified patients into high- and low-risk survival groups (p?=?0.02, HR?=?2.72; p?=?0.01, HR?=?2.70; p?=?0.007, HR?=?2.73; p?=?0.005, HR?=?2.63). It also showed prediction of the chemotherapy outcomes for stage II & III. In multivariate analysis, the YMR risk factor was more successful at predicting clinical outcomes than other commonly used clinical factors, with the exception of tumor stage. The YMR can be measured in an individual patient in the clinic independent of gene expression platform. This study provided a novel insight into the biology of lung cancer and shed light on the clinical applicability.
Project description:Methylation is closely involved in the development of various carcinomas. However, little datasets are available for small cell lung carcinoma (SCLC) due to the scarcity of fresh tumor samples. The aim of this study is to investigate the comprehensive genome-wide methylation profile of SCLC to predict the prognosis after surgical treatment. We investigated the high DNA methylated and low gene expression sites using 25 SCLC tumor tissues. First, we selected most differentially methylated CpG sites across the tumor tissues. Following hierarchical clustering (HC) and non-negative matrix factorization (NMF), gene ontology analysis was performed using DAVID software. Clustering of SCLC tumors led to the important identification of a CpG island methylator phenotype (CIMP) of SCLC, and showed that CIMP-high tumors had a significantly poorer prognosis (p = 0.001). Multivariate analysis revealed that postoperative chemotherapy, low neuroendocrine expression and the CIMP-low state were significantly good prognostic factors. Association analyses of methylation and gene expression provided 46 genes with significant correlation. Ontology studies to these genes showed that genes involved in extrinsic apoptosis pathway were suppressed, including TNFRSF1A, TNFRSF10A and TRADD, in CIMP-high tumors, prognosis of which was poorer. By comprehensive DNA methylation profiling, two distinct subgroups were identified to evoke a CIMP of SCLC as a useful marker for determination of treatment. Delineation of this phenotype may also be useful for the development of novel apoptosis-related chemotherapeutic agents for the treatment of an aggressive subtype of SCLC. Comprehensive genome-wide methylation analyses
Project description:BackgroundLung adenocarcinoma (LUAD) is a subtype of lung cancer with high morbidity and mortality. While genotyping is an important determinant for the prognosis of LUAD patients, there is a paucity of studies on gene set-based expression (GSE) typing for LUAD. This current study used GSE methodology to perform gene typing of LUAD patients.MethodsClinical and genomic information of the LUAD patients were downloaded from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) databases. Patients with LUAD were clustered into different molecular subtypes depending on the clinical and gene set expression characteristics. The survival rate and silhouette widths were compared between each molecular subtype. Differences in survival rate between gene sets were analyzed using Kaplan-Meier survival curves. Cox regression and Lasso regression were used to establish the prognostic gene set model based on the TCGA database, and the results were validated using the GEO dataset.ResultsA total of 10 hub genes were finally identified and clustered into 3 subtypes with a mean contour width of 0.96. There were significant differences in survival rates among the 3 subtypes (P<0.05). Gene Ontology (GO) analysis indicated that the related biological processes (BP) were mainly involved in regulation of cell cycle, mitotic cell cycle phase transition, and proteasome-mediated ubiquitin-dependent protein catabolic process. The cellular components (CC) were related to the spindle, chromosomal region, and midbody. Molecular function (MF) mainly focused on ubiquitin-like protein ligase binding, translation regulator activity, and oxidation activity. Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis showed that the main pathways included the Epstein Barr virus infection pathway of neurogeneration, the p53 signaling pathway, and the proteome pathways. In addition, the protein-protein interaction network was analyzed using the STRING and Cytospace software, and the top 9 hub genes identified were KIF2C, DLGAP5, KIF20A, PSMC1, PSMD1, PSMB7, SNAI2, FGF13, and BMP2.ConclusionsPatients with LUAD can be clustered into three subtypes based on the expression of gene sets. These findings contribute to understanding the pathogenesis and molecular mechanisms in LUAD, and may lead to potential individualized pharmacogenetic therapy for patients with LUAD.
Project description:IntroductionThe potential of applying data analysis tools to microarray data for diagnosis and prognosis is illustrated on the recent breast cancer dataset of van 't Veer and coworkers. We re-examine that dataset using the novel technique of logical analysis of data (LAD), with the double objective of discovering patterns characteristic for cases with good or poor outcome, using them for accurate and justifiable predictions; and deriving novel information about the role of genes, the existence of special classes of cases, and other factors.MethodData were analyzed using the combinatorics and optimization-based method of LAD, recently shown to provide highly accurate diagnostic and prognostic systems in cardiology, cancer proteomics, hematology, pulmonology, and other disciplines.ResultsLAD identified a subset of 17 of the 25,000 genes, capable of fully distinguishing between patients with poor, respectively good prognoses. An extensive list of 'patterns' or 'combinatorial biomarkers' (that is, combinations of genes and limitations on their expression levels) was generated, and 40 patterns were used to create a prognostic system, shown to have 100% and 92.9% weighted accuracy on the training and test sets, respectively. The prognostic system uses fewer genes than other methods, and has similar or better accuracy than those reported in other studies. Out of the 17 genes identified by LAD, three (respectively, five) were shown to play a significant role in determining poor (respectively, good) prognosis. Two new classes of patients (described by similar sets of covering patterns, gene expression ranges, and clinical features) were discovered. As a by-product of the study, it is shown that the training and the test sets of van 't Veer have differing characteristics.ConclusionThe study shows that LAD provides an accurate and fully explanatory prognostic system for breast cancer using genomic data (that is, a system that, in addition to predicting good or poor prognosis, provides an individualized explanation of the reasons for that prognosis for each patient). Moreover, the LAD model provides valuable insights into the roles of individual and combinatorial biomarkers, allows the discovery of new classes of patients, and generates a vast library of biomedical research hypotheses.
Project description:BackgroundDNA microarrays, which determine the expression levels of tens of thousands of genes from a sample, are an important research tool. However, the volume of data they produce can be an obstacle to interpretation of the results. Clustering the genes on the basis of similarity of their expression profiles can simplify the data, and potentially provides an important source of biological inference, but these methods have not been tested systematically on datasets from complex human tissues. In this paper, four clustering methods, CRC, k-means, ISA and memISA, are used upon three brain expression datasets. The results are compared on speed, gene coverage and GO enrichment. The effects of combining the clusters produced by each method are also assessed.Resultsk-means outperforms the other methods, with 100% gene coverage and GO enrichments only slightly exceeded by memISA and ISA. Those two methods produce greater GO enrichments on the datasets used, but at the cost of much lower gene coverage, fewer clusters produced, and speed. The clusters they find are largely different to those produced by k-means. Combining clusters produced by k-means and memISA or ISA leads to increased GO enrichment and number of clusters produced (compared to k-means alone), without negatively impacting gene coverage. memISA can also find potentially disease-related clusters. In two independent dorsolateral prefrontal cortex datasets, it finds three overlapping clusters that are either enriched for genes associated with schizophrenia, genes differentially expressed in schizophrenia, or both. Two of these clusters are enriched for genes of the MAP kinase pathway, suggesting a possible role for this pathway in the aetiology of schizophrenia.ConclusionConsidered alone, k-means clustering is the most effective of the four methods on typical microarray brain expression datasets. However, memISA and ISA can add extra high-quality clusters to the set produced by k-means, so combining these three methods is the method of choice.