A Resource of Quantitative Functional Annotation for Homo sapiens Genes.
ABSTRACT: The body of human genomic and proteomic evidence continues to grow at ever-increasing rates, while annotation efforts struggle to keep pace. A surprisingly small fraction of human genes have clear, documented associations with specific functions, and new functions continue to be found for characterized genes. Here we assembled an integrated collection of diverse genomic and proteomic data for 21,341 human genes and make quantitative associations of each to 4333 Gene Ontology terms. We combined guilt-by-profiling and guilt-by-association approaches to exploit features unique to the data types. Performance was evaluated by cross-validation, prospective validation, and by manual evaluation with the biological literature. Functional-linkage networks were also constructed, and their utility was demonstrated by identifying candidate genes related to a glioma FLN using a seed network from genome-wide association studies. Our annotations are presented-alongside existing validated annotations-in a publicly accessible and searchable web interface.
Project description:We investigated the evidence of recent positive selection in the human phototransduction system at single nucleotide polymorphism (SNP) and gene level.SNP genotyping data from the International HapMap Project for European, Eastern Asian, and African populations was used to discover differences in haplotype length and allele frequency between these populations. Numeric selection metrics were computed for each SNP and aggregated into gene-level metrics to measure evidence of recent positive selection. The level of recent positive selection in phototransduction genes was evaluated and compared to a set of genes shown previously to be under recent selection, and a set of highly conserved genes as positive and negative controls, respectively.Six of 20 phototransduction genes evaluated had gene-level selection metrics above the 90th percentile: RGS9, GNB1, RHO, PDE6G, GNAT1, and SLC24A1. The selection signal across these genes was found to be of similar magnitude to the positive control genes and much greater than the negative control genes.There is evidence for selective pressure in the genes involved in retinal phototransduction, and traces of this selective pressure can be demonstrated using SNP-level and gene-level metrics of allelic variation. We hypothesize that the selective pressure on these genes was related to their role in low light vision and retinal adaptation to ambient light changes. Uncovering the underlying genetics of evolutionary adaptations in phototransduction not only allows greater understanding of vision and visual diseases, but also the development of patient-specific diagnostic and intervention strategies.
Project description:Diabetes is one of the most prevalent diseases in the world. Type 1 diabetes is characterized by the failure of synthesizing and secreting of insulin because of destroyed pancreatic ?-cells. Type 2 diabetes, on the other hand, is described by the decreased synthesis and secretion of insulin because of the defect in pancreatic ?-cells as well as by the failure of responding to insulin because of malfunctioning of insulin signaling. In order to understand the signaling mechanisms of responding to insulin, it is necessary to identify all components in the insulin signaling network. Here, an interaction network consisting of proteins that have statistically high probability of being biologically related to insulin signaling in Homo sapiens was reconstructed by integrating Gene Ontology (GO) annotations and interactome data. Furthermore, within this reconstructed network, interacting proteins which mediate the signal from insulin hormone to glucose transportation were identified using linear paths. The identification of key components functioning in insulin action on glucose metabolism is crucial for the efforts of preventing and treating type 2 diabetes mellitus.
Project description:Next-generation sequencing projects continue to drive a vast accumulation of metagenomic sequence data. Given the growth rate of this data, automated approaches to functional annotation are indispensable and a cornerstone heuristic of many computational protocols is the concept of guilt by association. The guilt by association paradigm has been heavily exploited by genomic context methods that offer functional predictions that are complementary to homology-based annotations, thereby offering a means to extend functional annotation. In particular, operon methods that exploit co-directional intergenic distances can provide homology-free functional annotation through the transfer of functions among co-operonic genes, under the assumption that guilt by association is indeed applicable. Although guilt by association is a well-accepted annotative device, its applicability to metagenomic functional annotation has not been definitively demonstrated. Here a large-scale assessment of metagenomic guilt by association is undertaken where functional associations are predicted on the basis of co-directional intergenic distances. Specifically, functional annotations are compared within pairs of adjacent co-directional genes, as well as operons of various lengths (i.e. number of member genes), in order to reveal new information about annotative cohesion versus operon length. The results suggests that co-directional gene pairs offer reduced confidence for metagenomic guilt by association due to difficulty in resolving the existence of functional associations when intergenic distance is the sole predictor of pairwise gene interactions. However, metagenomic operons, particularly those with substantial lengths, appear to be capable of providing a superior basis for metagenomic guilt by association due to increased annotative stability. The need for improved recognition of metagenomic operons is discussed, as well as the limitations of the present work.
Project description:The recently held Critical Assessment of Function Annotation challenge (CAFA2) required its participants to submit predictions for a large number of target proteins regardless of whether they have previous annotations or not. This is in contrast to the original CAFA challenge in which participants were asked to submit predictions for proteins with no existing annotations. The CAFA2 task is more realistic, in that it more closely mimics the accumulation of annotations over time. In this study we compare these tasks in terms of their difficulty, and determine whether cross-validation provides a good estimate of performance.The CAFA2 task is a combination of two subtasks: making predictions on annotated proteins and making predictions on previously unannotated proteins. In this study we analyze the performance of several function prediction methods in these two scenarios. Our results show that several methods (structured support vector machine, binary support vector machines and guilt-by-association methods) do not usually achieve the same level of accuracy on these two tasks as that achieved by cross-validation, and that predicting novel annotations for previously annotated proteins is a harder problem than predicting annotations for uncharacterized proteins. We also find that different methods have different performance characteristics in these tasks, and that cross-validation is not adequate at estimating performance and ranking methods.These results have implications for the design of computational experiments in the area of automated function prediction and can provide useful insight for the understanding and design of future CAFA competitions.
Project description:BACKGROUND: Emergence of multiple drug resistant strains of M. tuberculosis (MDR-TB) threatens to derail global efforts aimed at reigning in the pathogen. Co-infections of M. tuberculosis with HIV are difficult to treat. To counter these new challenges, it is essential to study the interactions between M. tuberculosis and the host to learn how these bacteria cause disease. RESULTS: We report a systematic flow to predict the host pathogen interactions (HPIs) between M. tuberculosis and Homo sapiens based on sequence motifs. First, protein sequences were used as initial input for identifying the HPIs by 'interolog' method. HPIs were further filtered by prediction of domain-domain interactions (DDIs). Functional annotations of protein and publicly available experimental results were applied to filter the remaining HPIs. Using such a strategy, 118 pairs of HPIs were identified, which involve 43 proteins from M. tuberculosis and 48 proteins from Homo sapiens. A biological interaction network between M. tuberculosis and Homo sapiens was then constructed using the predicted inter- and intra-species interactions based on the 118 pairs of HPIs. Finally, a web accessible database named PATH (Protein interactions of M. tuberculosis and Human) was constructed to store these predicted interactions and proteins. CONCLUSIONS: This interaction network will facilitate the research on host-pathogen protein-protein interactions, and may throw light on how M. tuberculosis interacts with its host.
Project description:Kynureninase is a member of a large family of catalytically diverse but structurally homologous pyridoxal 5'-phosphate (PLP) dependent enzymes known as the aspartate aminotransferase superfamily or alpha-family. The Homo sapiens and other eukaryotic constitutive kynureninases preferentially catalyze the hydrolytic cleavage of 3-hydroxy-l-kynurenine to produce 3-hydroxyanthranilate and l-alanine, while l-kynurenine is the substrate of many prokaryotic inducible kynureninases. The human enzyme was cloned with an N-terminal hexahistidine tag, expressed, and purified from a bacterial expression system using Ni metal ion affinity chromatography. Kinetic characterization of the recombinant enzyme reveals classic Michaelis-Menten behavior, with a Km of 28.3 +/- 1.9 microM and a specific activity of 1.75 micromol min-1 mg-1 for 3-hydroxy-dl-kynurenine. Crystals of recombinant kynureninase that diffracted to 2.0 A were obtained, and the atomic structure of the PLP-bound holoenzyme was determined by molecular replacement using the Pseudomonas fluorescens kynureninase structure (PDB entry 1qz9) as the phasing model. A structural superposition with the P. fluorescens kynureninase revealed that these two structures resemble the "open" and "closed" conformations of aspartate aminotransferase. The comparison illustrates the dynamic nature of these proteins' small domains and reveals a role for Arg-434 similar to its role in other AAT alpha-family members. Docking of 3-hydroxy-l-kynurenine into the human kynureninase active site suggests that Asn-333 and His-102 are involved in substrate binding and molecular discrimination between inducible and constitutive kynureninase substrates.
Project description:Learning the function of genes is a major goal of computational genomics. Methods for inferring gene function have typically fallen into two categories: 'guilt-by-profiling', which exploits correlation between function and other gene characteristics; and 'guilt-by-association', which transfers function from one gene to another via biological relationships.We have developed a strategy ('Funckenstein') that performs guilt-by-profiling and guilt-by-association and combines the results. Using a benchmark set of functional categories and input data for protein-coding genes in Saccharomyces cerevisiae, Funckenstein was compared with a previous combined strategy. Subsequently, we applied Funckenstein to 2,455 Gene Ontology terms. In the process, we developed 2,455 guilt-by-profiling classifiers based on 8,848 gene characteristics and 12 functional linkage graphs based on 23 biological relationships.Funckenstein outperforms a previous combined strategy using a common benchmark dataset. The combination of 'guilt-by-profiling' and 'guilt-by-association' gave significant improvement over the component classifiers, showing the greatest synergy for the most specific functions. Performance was evaluated by cross-validation and by literature examination of the top-scoring novel predictions. These quantitative predictions should help prioritize experimental study of yeast gene functions.
Project description:BACKGROUND: Information obtained from diverse data sources can be combined in a principled manner using various machine learning methods to increase the reliability and range of knowledge about protein function. The result is a weighted functional linkage network (FLN) in which linked neighbors share at least one function with high probability. Precision is, however, low. Aiming to provide precise functional annotation for as many proteins as possible, we explore and propose a two-step framework for functional annotation (1) construction of a high-coverage and reliable FLN via machine learning techniques (2) development of a decision rule for the constructed FLN to optimize functional annotation. RESULTS: We first apply this framework to Saccharomyces cerevisiae. In the first step, we demonstrate that four commonly used machine learning methods, Linear SVM, Linear Discriminant Analysis, Naïve Bayes, and Neural Network, all combine heterogeneous data to produce reliable and high-coverage FLNs, in which the linkage weight more accurately estimates functional coupling of linked proteins than use individual data sources alone. In the second step, empirical tuning of an adjustable decision rule on the constructed FLN reveals that basing annotation on maximum edge weight results in the most precise annotation at high coverages. In particular at low coverage all rules evaluated perform comparably. At coverage above approximately 50%, however, they diverge rapidly. At full coverage, the maximum weight decision rule still has a precision of approximately 70%, whereas for other methods, precision ranges from a high of slightly more than 30%, down to 3%. In addition, a scoring scheme to estimate the precisions of individual predictions is also provided. Finally, tests of the robustness of the framework indicate that our framework can be successfully applied to less studied organisms. CONCLUSION: We provide a general two-step function-annotation framework, and show that high coverage, high precision annotations can be achieved by constructing a high-coverage and reliable FLN via data integration followed by applying a maximum weight decision rule.
Project description:By using the 1.6 million single-nucleotide polymorphism (SNP) genotype data set from Perlegen Sciences [Hinds, D. A., Stuve, L. L., Nilsen, G. B., Halperin, E., Eskin, E., Ballinger, D. G., Frazer, K. A. & Cox, D. R. (2005) Science 307, 1072-1079], a probabilistic search for the landscape exhibited by positive Darwinian selection was conducted. By sorting each high-frequency allele by homozygosity, we search for the expected decay of adjacent SNP linkage disequilibrium (LD) at recently selected alleles, eliminating the need for inferring haplotype. We designate this approach the LD decay (LDD) test. By these criteria, 1.6% of Perlegen SNPs were found to exhibit the genetic architecture of selection. These results were confirmed on an independently generated data set of 1.0 million SNP genotypes (International Human Haplotype Map Phase I freeze). Simulation studies indicate that the LDD test, at the megabase scale used, effectively distinguishes selection from other causes of extensive LD, such as inversions, population bottlenecks, and admixture. The approximately 1,800 genes identified by the LDD test were clustered according to Gene Ontology (GO) categories. Based on overrepresentation analysis, several predominant biological themes are common in these selected alleles, including host-pathogen interactions, reproduction, DNA metabolism/cell cycle, protein metabolism, and neuronal function.