Meta-analysis of gene expression signatures reveals hidden links among diverse biological processes in Arabidopsis.
ABSTRACT: The model plant Arabidopsis has been well-studied using high-throughput genomics technologies, which usually generate lists of differentially expressed genes under various conditions. Our group recently collected 1065 gene lists from 397 gene expression studies as a knowledgebase for pathway analysis. Here we systematically analyzed these gene lists by computing overlaps in all-vs.-all comparisons. We identified 16,261 statistically significant overlaps, represented by an undirected network in which nodes correspond to gene lists and edges indicate significant overlaps. The network highlights the correlation across the gene expression signatures of the diverse biological processes. We also partitioned the main network into 20 sub-networks, representing groups of highly similar expression signatures. These are common sets of genes that were co-regulated under different treatments or conditions and are often related to specific biological themes. Overall, our result suggests that diverse gene expression signatures are highly interconnected in a modular fashion.
Project description:Cells must respond to various perturbations using their limited available gene repertoires. In order to study how cells coordinate various responses, we conducted a comprehensive comparison of 1,186 gene expression signatures (gene lists) associated with various genetic and chemical perturbations.We identified 7,419 statistically significant overlaps between various published gene lists. Most (80%) of the overlaps can be represented by a highly connected network, a "molecular signature map," that highlights the correlation of various expression signatures. By dissecting this network, we identified sub-networks that define clusters of gene sets related to common biological processes (cell cycle, immune response, etc). Examination of these sub-networks has confirmed relationships among various pathways and also generated new hypotheses. For example, our result suggests that glutamine deficiency might suppress cellular growth by inhibiting the MYC pathway. Interestingly, we also observed 1,369 significant overlaps between a set of genes upregulated by factor X and a set of genes downregulated by factor Y, suggesting a repressive interaction between X and Y factors.Our results suggest that molecular-level responses to diverse chemical and genetic perturbations are heavily interconnected in a modular fashion. Also, shared molecular pathways can be identified by comparing newly defined gene expression signatures with databases of previously published gene expression signatures.
Project description:Gene expression technology has become a routine application in many laboratories and has provided large amounts of gene expression signatures that have been identified in a variety of cancer types. Interpretation of gene expression signatures would profit from the availability of a procedure capable of assigning differentially regulated genes or entire gene signatures to defined cancer signaling pathways. Here we describe a graph-based approach that identifies cancer signaling pathways from published gene expression signatures. Published gene expression signatures are collected in a database (PubLiME: Published Lists of Microarray Experiments) enabled for cross-platform gene annotation. Significant co-occurrence modules composed of up to 10 genes in different gene expression signatures are identified. Significantly co-occurring genes are linked by an edge in an undirected graph. Edge-betweenness and k-clique clustering combined with graph modularity as a quality measure are used to identify communities in the resulting graph. The identified communities consist of cell cycle, apoptosis, phosphorylation cascade, extra cellular matrix, interferon and immune response regulators as well as communities of unknown function. The genes constituting different communities are characterized by common genomic features and strongly enriched cis-regulatory modules in their upstream regulatory regions that are consistent with pathway assignment of those genes.
Project description:All tools in the DAVID Bioinformatics Resources aim to provide functional interpretation of large lists of genes derived from genomic studies. The newly updated DAVID Bioinformatics Resources consists of the DAVID Knowledgebase and five integrated, web-based functional annotation tool suites: the DAVID Gene Functional Classification Tool, the DAVID Functional Annotation Tool, the DAVID Gene ID Conversion Tool, the DAVID Gene Name Viewer and the DAVID NIAID Pathogen Genome Browser. The expanded DAVID Knowledgebase now integrates almost all major and well-known public bioinformatics resources centralized by the DAVID Gene Concept, a single-linkage method to agglomerate tens of millions of diverse gene/protein identifiers and annotation terms from a variety of public bioinformatics databases. For any uploaded gene list, the DAVID Resources now provides not only the typical gene-term enrichment analysis, but also new tools and functions that allow users to condense large gene lists into gene functional groups, convert between gene/protein identifiers, visualize many-genes-to-many-terms relationships, cluster redundant and heterogeneous terms into groups, search for interesting and related genes or terms, dynamically view genes from their lists on bio-pathways and more. With DAVID (http://david.niaid.nih.gov), investigators gain more power to interpret the biological mechanisms associated with large gene lists.
Project description:Studying plants using high-throughput genomics technologies is becoming routine, but interpretation of genome-wide expression data in terms of biological pathways remains a challenge, partly due to the lack of pathway databases. To create a knowledgebase for plant pathway analysis, we collected 1683 lists of differentially expressed genes from 397 gene-expression studies, which constitute a molecular signature database of various genetic and environmental perturbations of Arabidopsis. In addition, we extracted 1909 gene sets from various sources such as Gene Ontology, KEGG, AraCyc, Plant Ontology, predicted target genes of microRNAs and transcription factors, and computational gene clusters defined by meta-analysis. With this knowledgebase, we applied Gene Set Enrichment Analysis to an expression profile of cold acclimation and identified expected functional categories and pathways. Our results suggest that the AraPath database can be used to generate specific, testable hypotheses regarding plant molecular pathways from gene expression data.http://bioinformatics.sdstate.edu/arapath/.
Project description:BACKGROUND: DNA microarray technology has had a great impact on muscle research and microarray gene expression data has been widely used to identify gene signatures characteristic of the studied conditions. With the rapid accumulation of muscle microarray data, it is of great interest to understand how to compare and combine data across multiple studies. Meta-analysis of transcriptome data is a valuable method to achieve it. It enables to highlight conserved gene signatures between multiple independent studies. However, using it is made difficult by the diversity of the available data: different microarray platforms, different gene nomenclature, different species studied, etc. DESCRIPTION: We have developed a system tool dedicated to muscle transcriptome data. This system comprises a collection of microarray data as well as a query tool. This latter allows the user to extract similar clusters of co-expressed genes from the database, using an input gene list. Common and relevant gene signatures can thus be searched more easily. The dedicated database consists in a large compendium of public data (more than 500 data sets) related to muscle (skeletal and heart). These studies included seven different animal species from invertebrates (Drosophila melanogaster, Caenorhabditis elegans) and vertebrates (Homo sapiens, Mus musculus, Rattus norvegicus, Canis familiaris, Gallus gallus). After a renormalization step, clusters of co-expressed genes were identified in each dataset. The lists of co-expressed genes were annotated using a unified re-annotation procedure. These gene lists were compared to find significant overlaps between studies. CONCLUSIONS: Applied to this large compendium of data sets, meta-analyses demonstrated that conserved patterns between species could be identified. Focusing on a specific pathology (Duchenne Muscular Dystrophy) we validated results across independent studies and revealed robust biomarkers and new pathways of interest. The meta-analyses performed with MADMuscle show the usefulness of this approach. Our method can be applied to all public transcriptome data.
Project description:The genetic, proteomic, disease and pharmacological studies have generated rich data in protein interaction, disease regulation and drug activities useful for systems-level study of the biological, disease and drug therapeutic processes. These studies are facilitated by the established and the emerging computational methods. More recently, the network descriptors developed in other disciplines have become more increasingly used for studying the protein-protein, gene regulation, metabolic, disease networks. There is an inadequate coverage of these useful network features in the public web servers. We therefore introduced upto 313 literature-reported network descriptors in PROFEAT web server, for describing the topological, connectivity and complexity characteristics of undirected unweighted (uniform binding constants and molecular levels), undirected edge-weighted (varying binding constants), undirected node-weighted (varying molecular levels), undirected edge-node-weighted (varying binding constants and molecular levels) and directed unweighted (oriented process) networks. The usefulness of the PROFEAT computed network descriptors is illustrated by their literature-reported applications in studying the protein-protein, gene regulatory, gene co-expression, protein-drug and metabolic networks. PROFEAT is accessible free of charge at http://bidd2.nus.edu.sg/cgi-bin/profeat2016/main.cgi.
Project description:Guilt by association (GBA) algorithm has been widely used to statistically predict gene functions, and network-based approach increases the confidence and veracity of identifying molecular signatures for diseases. This work proposed a network-based GBA method by integrating the GBA algorithm and network, to identify seed gene functions for progressive diabetic neuropathy (PDN). The inference of predicting seed gene functions comprised of three steps: i) Preparing gene lists and sets; ii) constructing a co-expression matrix (CEM) on gene lists by Spearman correlation coefficient (SCC) method and iii) predicting gene functions by GBA algorithm. Ultimately, seed gene functions were selected according to the area under the receiver operating characteristics curve (AUC) index. A total of 79 differentially expressed genes (DEGs) and 40 background gene ontology (GO) terms were regarded as gene lists and sets for the subsequent analyses, respectively. The predicted results obtained from the network-based GBA approach showed that 27.5% of all gene sets had a good classified performance with AUC >0.5. Most significantly, 3 gene sets with AUC >0.6 were denoted as seed gene functions for PDN, including binding, molecular function and regulation of the metabolic process. In summary, we predicted 3 seed gene functions for PDN compared with non-progressors utilizing network-based GBA algorithm. The findings provide insights to reveal pathological and molecular mechanism underlying PDN.
Project description:BACKGROUND:Gene expression connectivity mapping has gained much popularity in recent years with a number of successful applications in biomedical research testifying its utility and promise. A major application of connectivity mapping is the identification of small molecule compounds capable of inhibiting a disease state. In this study, we are additionally interested in small molecule compounds that may enhance a disease state or increase the risk of developing that disease. Using breast cancer as a case study, we aim to develop and test a methodology for identifying commonly prescribed drugs that may have a suppressing or inducing effect on the target disease (breast cancer). RESULTS:We obtained from public data repositories a collection of breast cancer gene expression datasets with over 7000 patients. An integrated meta-analysis approach to gene expression connectivity mapping was developed, which involved unified processing and normalization of raw gene expression data, systematic removal of batch effects, and multiple runs of balanced sampling for differential expression analysis. Differentially expressed genes stringently selected were used to construct multiple non-joint gene signatures representing the same biological state. Remarkably these non-joint gene signatures retrieved from connectivity mapping separate lists of candidate drugs with significant overlaps, providing high confidence in their predicted effects on breast cancers. Of particular note, among the top 26 compounds identified as inversely connected to the breast cancer gene signatures, 14 of them are known anti-cancer drugs. CONCLUSIONS:A few candidate drugs with potential to enhance breast cancer or increase the risk of the disease were also identified; further investigation on a large population is required to firmly establish their effects on breast cancer risks. This work thus provides a novel approach and an applicable example for identifying medications with potential to alter cancer risks through gene expression connectivity mapping.
Project description:The Reactome Knowledgebase (www.reactome.org) provides molecular details of signal transduction, transport, DNA replication, metabolism and other cellular processes as an ordered network of molecular transformations-an extended version of a classic metabolic map, in a single consistent data model. Reactome functions both as an archive of biological processes and as a tool for discovering unexpected functional relationships in data such as gene expression pattern surveys or somatic mutation catalogues from tumour cells. Over the last two years we redeveloped major components of the Reactome web interface to improve usability, responsiveness and data visualization. A new pathway diagram viewer provides a faster, clearer interface and smooth zooming from the entire reaction network to the details of individual reactions. Tool performance for analysis of user datasets has been substantially improved, now generating detailed results for genome-wide expression datasets within seconds. The analysis module can now be accessed through a RESTFul interface, facilitating its inclusion in third party applications. A new overview module allows the visualization of analysis results on a genome-wide Reactome pathway hierarchy using a single screen page. The search interface now provides auto-completion as well as a faceted search to narrow result lists efficiently.
Project description:Several methods were developed to mine gene-gene relationships from expression data. Examples include correlation and mutual information methods for coexpression analysis, clustering and undirected graphical models for functional assignments, and directed graphical models for pathway reconstruction. Using an encoding for gene expression data, followed by deep neural networks analysis, we present a framework that can successfully address all of these diverse tasks. We show that our method, convolutional neural network for coexpression (CNNC), improves upon prior methods in tasks ranging from predicting transcription factor targets to identifying disease-related genes to causality inference. CNNC's encoding provides insights about some of the decisions it makes and their biological basis. CNNC is flexible and can easily be extended to integrate additional types of genomics data, leading to further improvements in its performance.