Comment on 'Hayai-Annotation plants: an ultra-fast and comprehensive functional gene annotation system in plants'. The importance of taking the GO graph structure into account.
Ontology highlight
ABSTRACT: Supplementary data are available at Bioinformatics online.
Comment on 'Hayai-Annotation Plants: an ultrafast and comprehensive functional gene annotation system in plants': the importance of taking the GO graph structure into account.
Project description:As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum.
Project description:SummaryHayai-Annotation Plants is a browser-based interface for an ultra-fast and accurate functional gene annotation system for plant species using R. The pipeline combines the sequence-similarity searches, using USEARCH against UniProtKB (taxonomy Embryophyta), with a functional annotation step. Hayai-Annotation Plants provides five layers of annotation: i) protein name; ii) gene ontology terms consisting of its three main domains (Biological Process, Molecular Function and Cellular Component); iii) enzyme commission number; iv) protein existence level; and v) evidence type. It implements a new algorithm that gives priority to protein existence level to propagate GO and EC information and annotated Arabidopsis thaliana representative peptide sequences (Araport11) within 5 min at the PC level.Availability and implementationThe software is implemented in R and runs on Macintosh and Linux systems. It is freely available at https://github.com/kdri-genomics/Hayai-Annotation-Plants under the GPLv3 license.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:Transcranial alternating current stimulation (tACS) has found widespread use as a basic tool in the exploration of the role of brain oscillations. Many studies have shown that frequency-specific tACS is able to not only alter cognitive processes during stimulation, but also cause specific physiological aftereffects visible in the electroencephalogram (EEG). The relationship between the emergence of these aftereffects and the necessary duration of stimulation is inconclusive. Our goal in this study was to narrow down the crucial length of tACS-blocks, by which aftereffects can be elicited. We stimulated participants with α-tACS in four blocks of 1-, 3-, 5-, and 10-min length, once in increasing and once in decreasing order. After each block, we measured the resting EEG for 10 min during a visual vigilance task. We could not find lasting enhancement of α-power following any stimulation block, when comparing the stimulated groups to the sham group. These findings offer no information regarding the crucial stimulation duration. In addition, this conflicts with previous findings, showing a power increase following 10 min of tACS in the alpha range. We performed additional explorative analyses, based on known confounds of (1) mismatches between stimulation frequency and individual alpha frequency and (2) abnormalities in baseline α-activity. The results of an ANCOVA suggested that both factor explain variance, but could not resolve how exactly both factors interfere with the stimulation effect. Employing a linear mixed model, we found a significant effect of stimulation following 10 min of α-tACS in the increasing sequence and a significant effect of the mismatch between stimulated frequency and individual alpha frequency. The implications of these findings for future research are discussed.
Project description:Day-to-day variability in microarray experiments is a recognized source of variation that can impede the analysis of large microarray studies where samples are processed on different days. In this study, we have applied an algorithm, called D2Dsum, which is based on a log-linear fixed effect model to cope with this kind of issues on a data set of 45 microarrays. Keywords: time course
Project description:Simulations are used to generate plausible realisations of soil and climatic variables for input into an enterprise land suitability assessment (LSA). Subsequently we present a case study demonstrating a LSA (for hazelnuts) which takes into account the quantified uncertainties of the biophysical model input variables. This study is carried out in the Meander Valley Irrigation District, Tasmania, Australia. It is found that when comparing to a LSA that assumes inputs to be error free, there is a significant difference in the assessment of suitability. Using an approach that assumes inputs to be error free, 56% of the study area was predicted to be suitable for hazelnuts. Using the simulation approach it is revealed that there is considerable uncertainty about the 'error free' assessment, where a prediction of 'unsuitable' was made 66% of the time (on average) at each grid cell of the study area. The cause of this difference is that digital soil mapping of both soil pH and conductivity have a high quantified uncertainty in this study area. Despite differences between the comparative methods, taking account of the prediction uncertainties provide a realistic appraisal of enterprise suitability. It is advantageous also because suitability assessments are provided as continuous variables as opposed to discrete classifications. We would recommend for other studies that consider similar FAO (Food and Agriculture Organisation of the United Nations) land evaluation framework type suitability assessments, that parameter membership functions (as opposed to discrete threshold cutoffs) together with the simulation approach are used in concert.
Project description:Currently, food waste is estimated at more than one-third of all food produced, and the primary responsibility for this phenomenon is attributed to households. Therefore, it seems reasonable to take action to limit food waste and to raise awareness about this link in the chain. To develop and implement educational programs addressed at consumers it is necessary to understand the factors determining food waste in households. Segmentation is a tool that can help effectively reach consumers who are to the greatest extent wasting food which identifies homogeneous clusters of consumers. The aim of this study was to perform segmentation to identify consumer groups with similar behaviors in relation to food, with particular emphasis on food wastage. We carried out segmentation on a representative sample of Polish people over 18 years of age and to identified three clusters of consumers. The three consumer segments diagnosed differed in sociodemographic terms, i.e., number of adults, number of children, subjective assessment of the financial situation, and percentage of spending on food. The segment exhibiting a high frequency of discarding food due to too large package size included single and double households.
Project description:Despite the structure and objectivity provided by the Gene Ontology (GO), the annotation of proteins is a complex task that is subject to errors and inconsistencies. Electronically inferred annotations in particular are widely considered unreliable. However, given that manual curation of all GO annotations is unfeasible, it is imperative to improve the quality of electronically inferred annotations. In this work, we analyze the full GO molecular function annotation of UniProtKB proteins, and discuss some of the issues that affect their quality, focusing particularly on the lack of annotation consistency. Based on our analysis, we estimate that 64% of the UniProtKB proteins are incompletely annotated, and that inconsistent annotations affect 83% of the protein functions and at least 23% of the proteins. Additionally, we present and evaluate a data mining algorithm, based on the association rule learning methodology, for identifying implicit relationships between molecular function terms. The goal of this algorithm is to assist GO curators in updating GO and correcting and preventing inconsistent annotations. Our algorithm predicted 501 relationships with an estimated precision of 94%, whereas the basic association rule learning methodology predicted 12,352 relationships with a precision below 9%.
Project description:The GO annotation dataset provided by the UniProt Consortium (GOA: http://www.ebi.ac.uk/GOA) is a comprehensive set of evidenced-based associations between terms from the Gene Ontology resource and UniProtKB proteins. Currently supplying over 100 million annotations to 11 million proteins in more than 360,000 taxa, this resource has increased 2-fold over the last 2 years and has benefited from a wealth of checks to improve annotation correctness and consistency as well as now supplying a greater information content enabled by GO Consortium annotation format developments. Detailed, manual GO annotations obtained from the curation of peer-reviewed papers are directly contributed by all UniProt curators and supplemented with manual and electronic annotations from 36 model organism and domain-focused scientific resources. The inclusion of high-quality, automatic annotation predictions ensures the UniProt GO annotation dataset supplies functional information to a wide range of proteins, including those from poorly characterized, non-model organism species. UniProt GO annotations are freely available in a range of formats accessible by both file downloads and web-based views. In addition, the introduction of a new, normalized file format in 2010 has made for easier handling of the complete UniProt-GOA data set.
Project description:InterPro amalgamates predictive protein signatures from a number of well-known partner databases into a single resource. To aid with interpretation of results, InterPro entries are manually annotated with terms from the Gene Ontology (GO). The InterPro2GO mappings are comprised of the cross-references between these two resources and are the largest source of GO annotation predictions for proteins. Here, we describe the protocol by which InterPro curators integrate GO terms into the InterPro database. We discuss the unique challenges involved in integrating specific GO terms with entries that may describe a diverse set of proteins, and we illustrate, with examples, how InterPro hierarchies reflect GO terms of increasing specificity. We describe a revised protocol for GO mapping that enables us to assign GO terms to domains based on the function of the individual domain, rather than the function of the families in which the domain is found. We also discuss how taxonomic constraints are dealt with and those cases where we are unable to add any appropriate GO terms. Expert manual annotation of InterPro entries with GO terms enables users to infer function, process or subcellular information for uncharacterized sequences based on sequence matches to predictive models. Database URL: http://www.ebi.ac.uk/interpro. The complete InterPro2GO mappings are available at: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/external2go/interpro2go.