Scaling up data curation using deep learning: An application to literature triage in genomic variation resources.
ABSTRACT: Manually curating biomedical knowledge from publications is necessary to build a knowledge based service that provides highly precise and organized information to users. The process of retrieving relevant publications for curation, which is also known as document triage, is usually carried out by querying and reading articles in PubMed. However, this query-based method often obtains unsatisfactory precision and recall on the retrieved results, and it is difficult to manually generate optimal queries. To address this, we propose a machine-learning assisted triage method. We collect previously curated publications from two databases UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog, and used them as a gold-standard dataset for training deep learning models based on convolutional neural networks. We then use the trained models to classify and rank new publications for curation. For evaluation, we apply our method to the real-world manual curation process of UniProtKB/Swiss-Prot and the GWAS Catalog. We demonstrate that our machine-assisted triage method outperforms the current query-based triage methods, improves efficiency, and enriches curated content. Our method achieves a precision 1.81 and 2.99 times higher than that obtained by the current query-based triage methods of UniProtKB/Swiss-Prot and the GWAS Catalog, respectively, without compromising recall. In fact, our method retrieves many additional relevant publications that the query-based method of UniProtKB/Swiss-Prot could not find. As these results show, our machine learning-based method can make the triage process more efficient and is being implemented in production so that human curators can focus on more challenging tasks to improve the quality of knowledge bases.
Project description:Motivation:Biological knowledgebases, such as UniProtKB/Swiss-Prot, constitute an essential component of daily scientific research by offering distilled, summarized and computable knowledge extracted from the literature by expert curators. While knowledgebases play an increasingly important role in the scientific community, their ability to keep up with the growth of biomedical literature is under scrutiny. Using UniProtKB/Swiss-Prot as a case study, we address this concern via multiple literature triage approaches. Results:With the assistance of the PubTator text-mining tool, we tagged more than 10?000 articles to assess the ratio of papers relevant for curation. We first show that curators read and evaluate many more papers than they curate, and that measuring the number of curated publications is insufficient to provide a complete picture as demonstrated by the fact that 8000-10?000 papers are curated in UniProt each year while curators evaluate 50?000-70?000 papers per year. We show that 90% of the papers in PubMed are out of the scope of UniProt, that a maximum of 2-3% of the papers indexed in PubMed each year are relevant for UniProt curation, and that, despite appearances, expert curation in UniProt is scalable. Availability and implementation:UniProt is freely available at http://www.uniprot.org/. Contact:email@example.com. Supplementary information:Supplementary data are available at Bioinformatics online.
Project description:Personalized genomic medicine depends on integrated analyses that combine genetic and phenotypic data from individual patients with reference knowledge of the functional and clinical significance of sequence variants. Sources of this reference knowledge include the ClinVar repository of human genetic variants, a community resource that accepts submissions from external groups, and UniProtKB/Swiss-Prot, an expert-curated resource of protein sequences and functional annotation. UniProtKB/Swiss-Prot provides knowledge on the functional impact and clinical significance of over 30?000 human protein-coding sequence variants, curated from peer-reviewed literature reports. Here we present a pilot study that lays the groundwork for the integration of curated knowledge of protein sequence variation from UniProtKB/Swiss-Prot with ClinVar. We show that existing interpretations of variant pathogenicity in UniProtKB/Swiss-Prot and ClinVar are highly concordant, with 88% of variants that are common to the two resources having interpretations of clinical significance that agree. Re-curation of a subset of UniProtKB/Swiss-Prot variants according to American College of Medical Genetics and Genomics (ACMG) guidelines using ClinGen tools further increases this level of agreement, mainly due to the reclassification of supposedly pathogenic variants as benign, based on newly available population frequency data. We have now incorporated ACMG guidelines and ClinGen tools into the UniProt Knowledgebase (UniProtKB) curation workflow and routinely submit variant data from UniProtKB/Swiss-Prot to ClinVar. These efforts will increase the usability and utilization of UniProtKB variant data and will facilitate the continuing (re-)evaluation of clinical variant interpretations as data sets and knowledge evolve.
Project description:UniProtKB/Swiss-Prot provides expert curation with information extracted from literature and curator-evaluated computational analysis. As knowledgebases continue to play an increasingly important role in scientific research, a number of studies have evaluated their accuracy and revealed various errors. While some are curation errors, others are the result of incorrect information published in the scientific literature. By taking the example of sirtuin-5, a complex annotation case, we will describe the curation procedure of UniProtKB/Swiss-Prot and detail how we report conflicting information in the database. We will demonstrate the importance of collaboration between resources to ensure curation consistency and the value of contributions from the user community in helping maintain error-free resources. Database URL: www.uniprot.org.
Project description:<h4>Background</h4>The annotation of protein post-translational modifications (PTMs) is an important task of UniProtKB curators and, with continuing improvements in experimental methodology, an ever greater number of articles are being published on this topic. To help curators cope with this growing body of information we have developed a system which extracts information from the scientific literature for the most frequently annotated PTMs in UniProtKB.<h4>Results</h4>The procedure uses a pattern-matching and rule-based approach to extract sentences with information on the type and site of modification. A ranked list of protein candidates for the modification is also provided. For PTM extraction, precision varies from 57% to 94%, and recall from 75% to 95%, according to the type of modification. The procedure was used to track new publications on PTMs and to recover potential supporting evidence for phosphorylation sites annotated based on the results of large scale proteomics experiments.<h4>Conclusions</h4>The information retrieval and extraction method we have developed in this study forms the basis of a simple tool for the manual curation of protein post-translational modifications in UniProtKB/Swiss-Prot. Our work demonstrates that even simple text-mining tools can be effectively adapted for database curation tasks, providing that a thorough understanding of the working process and requirements are first obtained. This system can be accessed at http://eagl.unige.ch/PTM/.
Project description:There is a frequent need to obtain sets of functionally equivalent homologous proteins (FEPs) from different species. While it is usually the case that orthology implies functional equivalence, this is not always true; therefore datasets of orthologous proteins are not appropriate. The information relevant to extracting FEPs is contained in databanks such as UniProtKB/Swiss-Prot and a manual analysis of these data allow FEPs to be extracted on a one-off basis. However there has been no resource allowing the easy, automatic extraction of groups of FEPs - for example, all instances of protein C.We have developed FOSTA, an automatically generated database of FEPs annotated as having the same function in UniProtKB/Swiss-Prot which can be used for large-scale analysis. The method builds a candidate list of homologues and filters out functionally diverged proteins on the basis of functional annotations using a simple text mining approach.Large scale evaluation of our FEP extraction method is difficult as there is no gold-standard dataset against which the method can be benchmarked. However, a manual analysis of five protein families confirmed a high level of performance. A more extensive comparison with two manually verified functional equivalence datasets also demonstrated very good performance.In summary, FOSTA provides an automated analysis of annotations in UniProtKB/Swiss-Prot to enable groups of proteins already annotated as functionally equivalent, to be extracted. Our results demonstrate that the vast majority of UniProtKB/Swiss-Prot functional annotations are of high quality, and that FOSTA can interpret annotations successfully. Where FOSTA is not successful, we are able to highlight inconsistencies in UniProtKB/Swiss-Prot annotation. Most of these would have presented equal difficulties for manual interpretation of annotations. We discuss limitations and possible future extensions to FOSTA, and recommend changes to the UniProtKB/Swiss-Prot format, which would facilitate text-mining of UniProtKB/Swiss-Prot.
Project description:BACKGROUND: Functional annotation of proteins remains a challenging task. Currently the scientific literature serves as the main source for yet uncurated functional annotations, but curation work is slow and expensive. Automatic techniques that support this work are still lacking reliability. We developed a method to identify conserved protein interaction graphs and to predict missing protein functions from orthologs in these graphs. To enhance the precision of the results, we furthermore implemented a procedure that validates all predictions based on findings reported in the literature. RESULTS: Using this procedure, more than 80% of the GO annotations for proteins with highly conserved orthologs that are available in UniProtKb/Swiss-Prot could be verified automatically. For a subset of proteins we predicted new GO annotations that were not available in UniProtKb/Swiss-Prot. All predictions were correct (100% precision) according to the verifications from a trained curator. CONCLUSION: Our method of integrating CCSs and literature mining is thus a highly reliable approach to predict GO annotations for weakly characterized proteins with orthologs.
Project description:Advances in high-throughput and advanced technologies allow researchers to routinely perform whole genome and proteome analysis. For this purpose, they need high-quality resources providing comprehensive gene and protein sets for their organisms of interest. Using the example of the human proteome, we will describe the content of a complete proteome in the UniProt Knowledgebase (UniProtKB). We will show how manual expert curation of UniProtKB/Swiss-Prot is complemented by expert-driven automatic annotation to build a comprehensive, high-quality and traceable resource. We will also illustrate how the complexity of the human proteome is captured and structured in UniProtKB. Database URL: www.uniprot.org.
Project description:UniProtKB/Swiss-Prot, a curated protein database, and dictyBase, the Model Organism Database for Dictyostelium discoideum, have established a collaboration to improve data sharing. One of the major steps in this effort was the 'Dicty annotation marathon', a week-long exercise with 30 annotators aimed at achieving a major increase in the number of D. discoideum proteins represented in UniProtKB/Swiss-Prot. The marathon led to the annotation of over 1000 D. discoideum proteins in UniProtKB/Swiss-Prot. Concomitantly, there were a large number of updates in dictyBase concerning gene symbols, protein names and gene models. This exercise demonstrates how UniProtKB/Swiss-Prot can work in very close cooperation with model organism databases and how the annotation of proteins can be accelerated through those collaborations.
Project description:In 2004, when the protein estimate from the finished human genome was only 24,000, the surprise was compounded as reviewed estimates fell to 19,000 by 2014. However, variability in the total canonical protein counts (i.e. excluding alternative splice forms) of open reading frames (ORFs) in different annotation portals persists. This work assesses these differences and possible causes. A 16-year analysis of Ensembl and UniProtKB/Swiss-Prot shows convergence to a protein number of ~20,000. The former had shown some yo-yoing, but both have now plateaued. Nine major annotation portals, reviewed at the beginning of 2017, gave a spread of counts from 21,819 down to 18,891. The 4-way cross-reference concordance (within UniProt) between Ensembl, Swiss-Prot, Entrez Gene and the Human Gene Nomenclature Committee (HGNC) drops to 18,690, indicating methodological differences in protein definitions and experimental existence support between sources. The Swiss-Prot and neXtProt evidence criteria include mass spectrometry peptide verification and also cross-references for antibody detection from the Human Protein Atlas. Notwithstanding, hundreds of Swiss-Prot entries are classified as non-coding biotypes by HGNC. The only inference that protein numbers might still rise comes from numerous reports of small ORF (smORF) discovery. However, while there have been recent cases of protein verifications from previous miss-annotation of non-coding RNA, very few have passed the Swiss-Prot curation and genome annotation thresholds. The post-genomic era has seen both advances in data generation and improvements in the human reference assembly. Notwithstanding, current numbers, while persistently discordant, show that the earlier yo-yoing has largely ceased. Given the importance to biology and biomedicine of defining the canonical human proteome, the task will need more collaborative inter-source curation combined with broader and deeper experimental confirmation in vivo and in vitro of proteins predicted in silico. The eventual closure could be well be below ~19,000.
Project description:SUMMARY: The SwissVar portal provides access to a comprehensive collection of single amino acid polymorphisms and diseases in the UniProtKB/Swiss-Prot database via a unique search engine. In particular, it gives direct access to the newly improved Swiss-Prot variant pages. The key strength of this portal is that it provides a possibility to query for similar diseases, as well as the underlying protein products and the molecular details of each variant. In the context of the recently proposed molecular view on diseases, the SwissVar portal should be in a unique position to provide valuable information for researchers and to advance research in this area. AVAILABILITY: The SwissVar portal is available at www.expasy.org/swissvar CONTACT: firstname.lastname@example.org; email@example.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.