Finding one's way in proteomics: a protein species nomenclature.
ABSTRACT: Our knowledge of proteins has greatly improved in recent years, driven by new technologies in the fields of molecular biology and proteome research. It has become clear that from a single gene not only one single gene product but many different ones - termed protein species - are generated, all of which may be associated with different functions. Nonetheless, an unambiguous nomenclature for describing individual protein species is still lacking. With the present paper we therefore propose a systematic nomenclature for the comprehensive description of protein species. The protein species nomenclature is flexible and adaptable to every level of knowledge and of experimental data in accordance with the exact chemical composition of individual protein species. As a minimum description the entry name (gene name + species according to the UniProt knowledgebase) can be used, if no analytical data about the target protein species are available.
Project description:Most protein in hair and wool is of two broad types: keratin intermediate filament-forming proteins (commonly known as keratins) and keratin-associated proteins (KAPs). Keratin nomenclature was reviewed in 2006, but the KAP nomenclature has not been revised since 1993. Recently there has been an increase in the number of KAP genes (KRTAPs) identified in humans and other species, and increasingly reports of variation in these genes. We therefore propose that an updated naming system is needed to accommodate the complexity of the KAPs. It is proposed that the system is founded in the previous nomenclature, but with the abbreviation sp-KAPm-nL*x for KAP proteins and sp-KRTAPm-n(p/L)*x for KAP genes. In this system "sp" is a unique letter-based code for different species as described by the protein knowledge-based UniProt. "m" is a number identifying the gene or protein family, "n" is a constituent member of that family, "p" signifies a pseudogene if present, "L" if present signifies "like" and refers to a temporary "place-holder" until the family is confirmed and "x" signifies a genetic variant or allele. We support the use of non-italicised text for the proteins and italicised text for the genes. This nomenclature is not that different to the existing system, but it includes species information and also describes genetic variation if identified, and hence is more informative. For example, GenBank sequence JN091630 would historically have been named KRTAP7-1 for the gene and KAP7-1 for the protein, but with the proposed nomenclature would be SHEEP-KRTAP7-1*A and SHEEP-KAP7-1*A for the gene and protein respectively. This nomenclature will facilitate more efficient storage and retrieval of data and define a common language for the KAP proteins and genes from all mammalian species.
Project description:Comparative genomics is an essential component of the post-genomic era. The chicken genome is the first avian genome to be sequenced and it will serve as a model for other avian species. Moreover, due to its unique evolutionary niche, the chicken genome can be used to understand evolution of functional elements and gene regulation in mammalian species. However comparative biology both within avian species and within amniotes is hampered due to the difficulty of recognising functional orthologs. This problem is compounded as different databases and sequence repositories proliferate and the names they assign to functional elements proliferate along with them. Currently, genes can be published under more than one name and one name sometimes refers to unrelated genes. Standardized gene nomenclature is necessary to facilitate communication between scientists and genomic resources. Moreover, it is important that this nomenclature be based on existing nomenclature efforts where possible to truly facilitate studies between different species. We report here the formation of the Chicken Gene Nomenclature Committee (CGNC), an international and centralized effort to provide standardized nomenclature for chicken genes. The CGNC works in conjunction with public resources such as NCBI and Ensembl and in consultation with existing nomenclature committees for human and mouse. The CGNC will develop standardized nomenclature in consultation with the research community and relies on the support of the research community to ensure that the nomenclature facilitates comparative and genomic studies.
Project description:Fungal mitochondrial genes are often invaded by group I or II introns, which represent an ideal marker for understanding fungal evolution. A standard nomenclature of mitochondrial introns is needed to avoid confusion when comparing different fungal mitogenomes. Currently, there has been a standard nomenclature for introns present in rRNA genes, but there is a lack of a standard nomenclature for introns present in protein-coding genes. In this study, we propose a new nomenclature system for introns in fungal mitochondrial protein-coding genes based on (1) three-letter abbreviation of host scientific name, (2) host gene name, (3), one capital letter P (for group I introns), S (for group II introns), or U (for introns with unknown types), and (4) intron insertion site in the host gene according to the cyclosporin-producing fungus Tolypocladium inflatum. The suggested nomenclature was proved feasible by naming introns present in mitogenomes of 16 fungi of different phyla, including both basal and higher fungal lineages although minor adjustment of the nomenclature is needed to fit certain special conditions. The nomenclature also had the potential to name plant/protist/animal mitochondrial introns. We hope future studies follow the proposed nomenclature to ensure direct comparison across different studies.
Project description:MOTIVATION:To provide high quality computationally tractable enzyme annotation in UniProtKB using Rhea, a comprehensive expert-curated knowledgebase of biochemical reactions which describes reaction participants using the ChEBI (Chemical Entities of Biological Interest) ontology. RESULTS:We replaced existing textual descriptions of biochemical reactions in UniProtKB with their equivalents from Rhea, which is now the standard for annotation of enzymatic reactions in UniProtKB. We developed improved search and query facilities for the UniProt website, REST API and SPARQL endpoint that leverage the chemical structure data, nomenclature and classification that Rhea and ChEBI provide. AVAILABILITY AND IMPLEMENTATION:UniProtKB at https://www.uniprot.org; UniProt REST API at https://www.uniprot.org/help/api; UniProt SPARQL endpoint at https://sparql.uniprot.org/; Rhea at https://www.rhea-db.org.
Project description:Genew, the Human Gene Nomenclature Database, is the only resource that provides data for all human genes which have approved symbols. It is managed by the HUGO Gene Nomenclature Committee (HGNC) as a confidential database, containing over 16 000 records, 80% of which are represented on the Web by searchable text files. The data in Genew are highly curated by HGNC editors and gene records can be searched on the Web by symbol or name to directly retrieve information on gene symbol, gene name, cytogenetic location, OMIM number and PubMed ID. Data are integrated with other human gene databases, e.g. GDB, LocusLink and SWISS-PROT, and approved gene symbols are carefully co-ordinated with the Mouse Genome Database (MGD). Approved gene symbols are available for querying and browsing at http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/searchgenes.pl.
Project description:The Gene Ontology (GO) is the de facto standard for the functional description of gene products, providing a consistent, information-rich terminology applicable across species and information repositories. The UniProt Consortium uses both manual and automatic GO annotation approaches to curate UniProt Knowledgebase (UniProtKB) entries. The selection of a protein set prioritized for manual annotation has implications for the characteristics of the information provided to users working in a specific field or interested in particular pathways or processes. In this article, we describe an organelle-focused, manual curation initiative targeting proteins from the human peroxisome. We discuss the steps taken to define the peroxisome proteome and the challenges encountered in defining the boundaries of this protein set. We illustrate with the use of examples how GO annotations now capture cell and tissue type information and the advantages that such an annotation approach provides to users. Database URL: http://www.ebi.ac.uk/GOA/ and http://www.uniprot.org.
Project description:Lysophospholipids encompass a diverse range of small, membrane-derived phospholipids that act as extracellular signals. The signalling properties are mediated by 7-transmembrane GPCRs, constituent members of which have continued to be identified after their initial discovery in the mid-1990s. Here we briefly review this class of receptors, with a particular emphasis on their protein and gene nomenclatures that reflect their cognate ligands. There are six lysophospholipid receptors that interact with lysophosphatidic acid (LPA): protein names LPA1 - LPA6 and italicized gene names LPAR1-LPAR6 (human) and Lpar1-Lpar6 (non-human). There are five sphingosine 1-phosphate (S1P) receptors: protein names S1P1 -S1P5 and italicized gene names S1PR1-S1PR5 (human) and S1pr1-S1pr5 (non-human). Recent additions to the lysophospholipid receptor family have resulted in the proposed names for a lysophosphatidyl inositol (LPI) receptor - protein name LPI1 and gene name LPIR1 (human) and Lpir1 (non-human) - and three lysophosphatidyl serine receptors - protein names LyPS1 , LyPS2 , LyPS3 and gene names LYPSR1-LYPSR3 (human) and Lypsr1-Lypsr3 (non-human) along with a variant form that does not appear to exist in humans that is provisionally named LyPS2L . This nomenclature incorporates previous recommendations from the International Union of Basic and Clinical Pharmacology, the Human Genome Organization, the Gene Nomenclature Committee, and the Mouse Genome Informatix.
Project description:The Universal Protein Resource (UniProt) provides the scientific community with a single, centralized, authoritative resource for protein sequences and functional information. Formed by uniting the Swiss-Prot, TrEMBL and PIR protein database activities, the UniProt consortium produces three layers of protein sequence databases: the UniProt Archive (UniParc), the UniProt Knowledgebase (UniProt) and the UniProt Reference (UniRef) databases. The UniProt Knowledgebase is a comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase with extensive cross-references. This centrepiece consists of two sections: UniProt/Swiss-Prot, with fully, manually curated entries; and UniProt/TrEMBL, enriched with automated classification and annotation. During 2004, tens of thousands of Knowledgebase records got manually annotated or updated; we introduced a new comment line topic: TOXIC DOSE to store information on the acute toxicity of a toxin; the UniProt keyword list got augmented by additional keywords; we improved the documentation of the keywords and are continuously overhauling and standardizing the annotation of post-translational modifications. Furthermore, we introduced a new documentation file of the strains and their synonyms. Many new database cross-references were introduced and we started to make use of Digital Object Identifiers. We also achieved in collaboration with the Macromolecular Structure Database group at EBI an improved integration with structural databases by residue level mapping of sequences from the Protein Data Bank entries onto corresponding UniProt entries. For convenient sequence searches we provide the UniRef non-redundant sequence databases. The comprehensive UniParc database stores the complete body of publicly available protein sequence data. The UniProt databases can be accessed online (http://www.uniprot.org) or downloaded in several formats (ftp://ftp.uniprot.org/pub). New releases are published every two weeks.
Project description:To provide the scientific community with a single, centralized, authoritative resource for protein sequences and functional information, the Swiss-Prot, TrEMBL and PIR protein database activities have united to form the Universal Protein Knowledgebase (UniProt) consortium. Our mission is to provide a comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and query interfaces. The central database will have two sections, corresponding to the familiar Swiss-Prot (fully manually curated entries) and TrEMBL (enriched with automated classification, annotation and extensive cross-references). For convenient sequence searches, UniProt also provides several non-redundant sequence databases. The UniProt NREF (UniRef) databases provide representative subsets of the knowledgebase suitable for efficient searching. The comprehensive UniProt Archive (UniParc) is updated daily from many public source databases. The UniProt databases can be accessed online (http://www.uniprot.org) or downloaded in several formats (ftp://ftp.uniprot.org/pub). The scientific community is encouraged to submit data for inclusion in UniProt.
Project description:Mammalian carboxylesterase (CES or Ces) genes encode enzymes that participate in xenobiotic, drug, and lipid metabolism in the body and are members of at least five gene families. Tandem duplications have added more genes for some families, particularly for mouse and rat genomes, which has caused confusion in naming rodent Ces genes. This article describes a new nomenclature system for human, mouse, and rat carboxylesterase genes that identifies homolog gene families and allocates a unique name for each gene. The guidelines of human, mouse, and rat gene nomenclature committees were followed and "CES" (human) and "Ces" (mouse and rat) root symbols were used followed by the family number (e.g., human CES1). Where multiple genes were identified for a family or where a clash occurred with an existing gene name, a letter was added (e.g., human CES4A; mouse and rat Ces1a) that reflected gene relatedness among rodent species (e.g., mouse and rat Ces1a). Pseudogenes were named by adding "P" and a number to the human gene name (e.g., human CES1P1) or by using a new letter followed by ps for mouse and rat Ces pseudogenes (e.g., Ces2d-ps). Gene transcript isoforms were named by adding the GenBank accession ID to the gene symbol (e.g., human CES1_AB119995 or mouse Ces1e_BC019208). This nomenclature improves our understanding of human, mouse, and rat CES/Ces gene families and facilitates research into the structure, function, and evolution of these gene families. It also serves as a model for naming CES genes from other mammalian species.