Mining metabolic pathways through gene expression.
ABSTRACT: MOTIVATION: An observed metabolic response is the result of the coordinated activation and interaction between multiple genetic pathways. However, the complex structure of metabolism has meant that a compete understanding of which pathways are required to produce an observed metabolic response is not fully understood. In this article, we propose an approach that can identify the genetic pathways which dictate the response of metabolic network to specific experimental conditions. RESULTS: Our approach is a combination of probabilistic models for pathway ranking, clustering and classification. First, we use a non-parametric pathway extraction method to identify the most highly correlated paths through the metabolic network. We then extract the defining structure within these top-ranked pathways using both Markov clustering and classification algorithms. Furthermore, we define detailed node and edge annotations, which enable us to track each pathway, not only with respect to its genetic dependencies, but also allow for an analysis of the interacting reactions, compounds and KEGG sub-networks. We show that our approach identifies biologically meaningful pathways within two microarray expression datasets using entire KEGG metabolic networks. AVAILABILITY AND IMPLEMENTATION: An R package containing a full implementation of our proposed method is currently available from http://www.bic.kyoto-u.ac.jp/pathway/timhancock.
Project description:MOTIVATION: Metabolic pathway analysis is crucial not only in metabolic engineering but also in rational drug design. However, the biosynthetic/biodegradation pathways are known only for a small portion of metabolites, and a vast amount of pathways remain uncharacterized. Therefore, an important challenge in metabolomics is the de novo reconstruction of potential reaction networks on a metabolome-scale. RESULTS: In this article, we develop a novel method to predict the multistep reaction sequences for de novo reconstruction of metabolic pathways in the reaction-filling framework. We propose a supervised approach to learn what we refer to as 'multistep reaction sequence likeness', i.e. whether a compound-compound pair is possibly converted to each other by a sequence of enzymatic reactions. In the algorithm, we propose a recursive procedure of using step-specific classifiers to predict the intermediate compounds in the multistep reaction sequences, based on chemical substructure fingerprints/descriptors of compounds. We further demonstrate the usefulness of our proposed method on the prediction of enzymatic reaction networks from a metabolome-scale compound set and discuss characteristic features of the extracted chemical substructure transformation patterns in multistep reaction sequences. Our comprehensively predicted reaction networks help to fill the metabolic gap and to infer new reaction sequences in metabolic pathways. AVAILABILITY AND IMPLEMENTATION: Materials are available for free at http://web.kuicr.kyoto-u.ac.jp/supp/kot/ismb2014/
Project description:With the continued proliferation of high-throughput biological experiments, there is a pressing need for tools to integrate the data produced in ways that produce biologically meaningful conclusions. Many microarray studies have analysed transcriptomic data from a pathway perspective, for instance by testing for KEGG pathway enrichment in sets of upregulated genes. However, the increasing availability of species-specific metabolic models provides the opportunity to analyse these data in a more objective, system-wide manner.Here we introduce ambient (Active Modules for Bipartite Networks), a simulated annealing approach to the discovery of metabolic subnetworks (modules) that are significantly affected by a given genetic or environmental change. The metabolic modules returned by ambient are connected parts of the bipartite network that change coherently between conditions, providing a more detailed view of metabolic changes than standard approaches based on pathway enrichment.ambient is an effective and flexible tool for the analysis of high-throughput data in a metabolic context. The same approach can be applied to any system in which reactions (or metabolites) can be assigned a score based on some biological observation, without the limitation of predefined pathways. A Python implementation of ambient is available at http://www.theosysbio.bio.ic.ac.uk/ambient.
Project description:SUMMARY:PathwayConnector is a web-tool that facilitates the construction of complementary pathway-to-pathway networks and subnetworks of them, based on a reference pathway network derived from the rich information available either in KEGG or Reactome database for pathway mapping. Specifically, for a given set of pathways, PathwayConnector (i) finds all the direct connections between them, (ii) adds a minimum set of complementary pathways required to achieve connectivity between the pathways, leading to informative fully connected networks and (ii) provides a series of clustering methods for the further grouping of pathways in to sub-clusters. The proposed web-tool is a simple yet informative tool towards identifying connected groups of pathways that are significantly related to specific diseases. AVAILABILITY AND IMPLEMENTATION:http://bioinformatics.cing.ac.cy/PathwayConnector. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:Combining genetic inheritance information, for both molecular profiles and complex traits, is a promising strategy not only for detecting quantitative trait loci (QTLs) for complex traits but for understanding which genes, pathways, and biological processes are also under the influence of a given QTL. As a primary step in determining the feasibility of such an approach in humans, we present the largest survey to date, to our knowledge, of the heritability of gene-expression traits in segregating human populations. In particular, we measured expression for 23,499 genes in lymphoblastoid cell lines for members of 15 Centre d'Etude du Polymorphisme Humain (CEPH) families. Of the total set of genes, 2,340 were found to be expressed, of which 31% had significant heritability when a false-discovery rate of 0.05 was used. QTLs were detected for 33 genes on the basis of at least one P value <.000005. Of these, 13 genes possessed a QTL within 5 Mb of their physical location. Hierarchical clustering was performed on the basis of both Pearson correlation of gene expression and genetic correlation. Both reflected biologically relevant activity taking place in the lymphoblastoid cell lines, with greater coherency represented in Kyoto Encyclopedia of Genes and Genomes database (KEGG) pathways than in Gene Ontology database pathways. However, more pathway coherence was observed in KEGG pathways when clustering was based on genetic correlation than when clustering was based on Pearson correlation. As more expression data in segregating populations are generated, viewing clusters or networks based on genetic correlation measures and shared QTLs will offer potentially novel insights into the relationship among genes that may underlie complex traits.
Project description:<h4>Motivation</h4>Functional characterization of genes is of great importance for the understanding of complex cellular processes. Valuable information for this purpose can be obtained from pathway databases, like KEGG. However, only a small fraction of genes is annotated with pathway information up to now. In contrast, information on contained protein domains can be obtained for a significantly higher number of genes, e.g. from the InterPro database.<h4>Results</h4>We present a classification model, which for a specific gene of interest can predict the mapping to a KEGG pathway, based on its domain signature. The classifier makes explicit use of the hierarchical organization of pathways in the KEGG database. Furthermore, we take into account that a specific gene can be mapped to different pathways at the same time. The classification method produces a scoring of all possible mapping positions of the gene in the KEGG hierarchy. Evaluations of our model, which is a combination of a SVM and ranking perceptron approach, show a high prediction performance. Moreover, for signaling pathways we reveal that it is even possible to forecast accurately the membership to individual pathway components.<h4>Availability</h4>The R package gene2pathway is a supplement to this article.
Project description:BACKGROUND: In eukaryotes, the cell is divided into several compartments enclosed by unitary membranes. Such compartmentalization is critical for cells to restrict different pathways to be carried out in different subcellular regions. The summary and classification of subcellular localizations of metabolic pathways are the first steps towards understanding their roles in spatial differentiation and the specialization of metabolic pathways in different organisms. RESULTS: Integrating the subcellular localization of enzymes and their pathways from UniProt Knowledgebase and KEGG pathway databases, we present the first database for subcellular localization of 43014 pathways from 80676 UniProt entries and their pathway annotations from UniProt and KEGG pathway databases. To extract pathway localization across organisms, we defined 889 superpathways as clusters of basic pathways with the same pathway annotations from different organisms. Over eighty-eight percent of superpathways in the Swiss-Prot dataset occur in cytoplasm and mitochondria. And over seventy percent of UniProt superpathways have multiple localization annotations. We summarized four common reasons for the multiple localization of superpathways. Based on this database, we also discovered 88 potential transport systems between different steps of multiply localized pathways and 45 duplicated genes from 17 pathways, occurring in parallel in several locations in humans. CONCLUSIONS: PathLocdb is a free web-accessible database that enables biochemical researchers to quickly access summarized subcellular localization of pathways from UniProt and KEGG pathway databases. As the first effort to systematically integrate pathway localization, this database is very useful in discovering the variation of localization of pathways between organisms and also cross-talk between different organelles within a pathway. The Pathlocdb database is available at http://pathloc.cbi.pku.edu.cn.
Project description:The quantification of experimentally-induced alterations in biological pathways remains a major challenge in systems biology. One example of this is the quantitative characterization of alterations in defined, established metabolic pathways from complex metabolomic data. At present, the disruption of a given metabolic pathway is inferred from metabolomic data by observing an alteration in the level of one or more individual metabolites present within that pathway. Not only is this approach open to subjectivity, as metabolites participate in multiple pathways, but it also ignores useful information available through the pairwise correlations between metabolites. This extra information may be incorporated using a higher-level approach that looks for alterations between a pair of correlation networks. In this way experimentally-induced alterations in metabolic pathways can be quantitatively defined by characterizing group differences in metabolite clustering. Taking this approach increases the objectivity of interpreting alterations in metabolic pathways from metabolomic data.We present and justify a new technique for comparing pairs of networks--in our case these networks are based on the same set of nodes and there are two distinct types of weighted edges. The algorithm is based on the Generalized Singular Value Decomposition (GSVD), which may be regarded as an extension of Principle Components Analysis to the case of two data sets. We show how the GSVD can be interpreted as a technique for reordering the two networks in order to reveal clusters that are exclusive to only one. Here we apply this algorithm to a new set of metabolomic data from the prefrontal cortex (PFC) of a translational model relevant to schizophrenia, rats treated subchronically with the N-methyl-D-Aspartic acid (NMDA) receptor antagonist phencyclidine (PCP). This provides us with a means to quantify which predefined metabolic pathways (Kyoto Encyclopedia of Genes and Genomes (KEGG) metabolite pathway database) were altered in the PFC of PCP-treated rats. Several significant changes were discovered, notably: 1) neuroactive ligands active at glutamate and GABA receptors are disrupted in the PFC of PCP-treated animals, 2) glutamate dysfunction in these animals was not limited to compromised glutamatergic neurotransmission but also involves the disruption of metabolic pathways linked to glutamate; and 3) a specific series of purine reactions Xanthine ? Hypoxyanthine ? Inosine ? IMP ? adenylosuccinate is also disrupted in the PFC of PCP-treated animals.Network reordering via the GSVD provides a means to discover statistically validated differences in clustering between a pair of networks. In practice this analytical approach, when applied to metabolomic data, allows us to quantify the alterations in metabolic pathways between two experimental groups. With this new computational technique we identified metabolic pathway alterations that are consistent with known results. Furthermore, we discovered disruption in a novel series of purine reactions that may contribute to the PFC dysfunction and cognitive deficits seen in schizophrenia.
Project description:We introduce Pathway-Informed Classification System (PICS) for classifying cancers based on tumor sample gene expression levels. PICS is a computational method capable of expeditiously elucidating both known and novel biological pathway involvement specific to various cancers and uses that learned pathway information to separate patients into distinct classes. The method clearly separates a pan-cancer dataset by tissue of origin and also sub-classifies individual cancer datasets into distinct survival classes. Gene expression values are collapsed into pathway scores that reveal which biological activities are most useful for clustering cancer cohorts into subtypes. Variants of the method allow it to be used on datasets that do and do not contain noncancerous samples. Activity levels of all types of pathways, broadly grouped into metabolic, cellular processes and signaling, and immune system, are useful for separating the pan-cancer cohort. In the clustering of specific cancer types, certain pathway types become more valuable depending on the site being studied. For lung cancer, signaling pathways dominate; for pancreatic cancer, signaling and metabolic pathways dominate; and for melanoma, immune system pathways are the most useful. This work suggests the utility of pathway-level genomic analysis and points in the direction of using pathway classification for predicting the efficacy and side effects of drugs and radiation.
Project description:Microarray databases are a large source of genetic data, which, upon proper analysis, could enhance our understanding of biology and medicine. Many microarray experiments have been designed to investigate the genetic mechanisms of cancer, and analytical approaches have been applied in order to classify different types of cancer or distinguish between cancerous and non-cancerous tissue. However, microarrays are high-dimensional datasets with high levels of noise and this causes problems when using machine learning methods. A popular approach to this problem is to search for a set of features that will simplify the structure and to some degree remove the noise from the data. The most widely used approach to feature extraction is principal component analysis (PCA) which assumes a multivariate Gaussian model of the data. More recently, non-linear methods have been investigated. Among these, manifold learning algorithms, for example Isomap, aim to project the data from a higher dimensional space onto a lower dimension one. We have proposed a priori manifold learning for finding a manifold in which a representative set of microarray data is fused with relevant data taken from the KEGG pathway database. Once the manifold has been constructed the raw microarray data is projected onto it and clustering and classification can take place. In contrast to earlier fusion based methods, the prior knowledge from the KEGG databases is not used in, and does not bias the classification process--it merely acts as an aid to find the best space in which to search the data. In our experiments we have found that using our new manifold method gives better classification results than using either PCA or conventional Isomap.
Project description:<h4>Motivation</h4>How to find motifs from genome-scale functional sequences, such as all the promoters in a genome, is a challenging problem. Word-based methods count the occurrences of oligomers to detect excessively represented ones. This approach is known to be fast and accurate compared with other methods. However, two problems have hampered the application of such methods to large-scale data. One is the computational cost necessary for clustering similar oligomers, and the other is the bias in the frequency of fixed-length oligomers, which complicates the detection of significant words.<h4>Results</h4>We introduce a method that uses a DNA Gray code and equiprobable oligomers, which solve the clustering problem and the oligomer bias, respectively. Our method can analyze 18 000 sequences of ~1 kbp long in 30 s. We also show that the accuracy of our method is superior to that of a leading method, especially for large-scale data and small fractions of motif-containing sequences.<h4>Availability</h4>The online and stand-alone versions of the application, named Hegma, are available at our website: http://www.genome.ist.i.kyoto-u.ac.jp/~ichinose/hegma/<h4>Contact</h4>firstname.lastname@example.org; email@example.com