Project description:The dark proteome, as we define it, is the part of the proteome where 3D structure has not been observed either by homology modeling or by experimental characterization in the protein universe. From the 550.116 proteins available in Swiss-Prot (as of July 2016), 43.2% of the eukarya universe and 49.2% of the virus universe are part of the dark proteome. In bacteria and archaea, the percentage of the dark proteome presence is significantly less, at 12.6% and 13.3% respectively. In this work, we present a necessary step to complete the dark proteome picture by introducing the map of the dark proteome in the human and in other model organisms of special importance to mankind. The most significant result is that around 40% to 50% of the proteome of these organisms are still in the dark, where the higher percentages belong to higher eukaryotes (mouse and human organisms). Due to the amount of darkness present in the human organism being more than 50%, deeper studies were made, including the identification of 'dark' genes that are responsible for the production of so-called dark proteins, as well as the identification of the 'dark' tissues where dark proteins are over represented, namely, the heart, cervical mucosa, and natural killer cells. This is a step forward in the direction of gaining a deeper knowledge of the human dark proteome.
Project description:MotivationIntrinsically disordered proteins (IDPs) are involved in numerous processes crucial for living organisms. Bias in amino acid composition of these proteins determines their unique biophysical and functional features. Distinct intrinsically disordered regions (IDRs) with compositional bias play different important roles in various biological processes. IDRs enriched in particular amino acids in human proteome have not been described consistently.ResultsWe developed DisEnrich-the database of human proteome IDRs that are significantly enriched in particular amino acids. Each human protein is described using Gene Ontology (GO) function terms, disorder prediction for the full-length sequence using three methods, enriched IDR composition and ranks of human proteins with similar enriched IDRs. Distribution analysis of enriched IDRs among broad functional categories revealed significant overrepresentation of R- and Y-enriched IDRs in metabolic and enzymatic activities and F-enriched IDRs in transport. About 75% of functional categories contain IDPs with IDRs significantly enriched in hydrophobic residues that are important for protein-protein interactions.Availability and implementationThe database is available at http://prodata.swmed.edu/DisEnrichDB/.Supplementary informationSupplementary data are available at Bioinformatics Advances online.
Project description:We surveyed the "dark" proteome-that is, regions of proteins never observed by experimental structure determination and inaccessible to homology modeling. For 546,000 Swiss-Prot proteins, we found that 44-54% of the proteome in eukaryotes and viruses was dark, compared with only ∼14% in archaea and bacteria. Surprisingly, most of the dark proteome could not be accounted for by conventional explanations, such as intrinsic disorder or transmembrane regions. Nearly half of the dark proteome comprised dark proteins, in which the entire sequence lacked similarity to any known structure. Dark proteins fulfill a wide variety of functions, but a subset showed distinct and largely unexpected features, such as association with secretion, specific tissues, the endoplasmic reticulum, disulfide bonding, and proteolytic cleavage. Dark proteins also had short sequence length, low evolutionary reuse, and few known interactions with other proteins. These results suggest new research directions in structural and computational biology.
Project description:In the framework of the Human Proteome Project initiative, we aim to improve mapping and characterization of mitochondrial proteome. In this work we implemented an experimental workflow, combining classical biochemical enrichments and mass spectrometry, to pursue a much deeper definition of mitochondrial proteome and possibly mine mitochondrial uncharacterized dark proteins. We fractionated in two compartments mitochondria enriched from HeLa cells in order to annotate 4230 proteins in both fraction by means of a multiple-enzyme digestion (trypsin, chymotrypsin and Glu-C) followed by mass spectrometry analysis using a combination of Data Dependent Acquisition (DDA) and Data Independent Acquisition (DIA). We detected 22 mitochondrial dark proteins not annotated for their function and we provide their relative abundance inside the mitochondrial organelle. Considering this work as a pilot study we expect that the same approach, in different biological system, could represent an advancement in the characterization of the human mitochondrial proteome providing uncharted ground to explore the mitonuclear phenotypic relationships. All spectra have been deposited to ProteomeXchange with PXD014201 and PXD014200 identifier.
Project description:Proteome-pI is an online database containing information about predicted isoelectric points for 5029 proteomes calculated using 18 methods. The isoelectric point, the pH at which a particular molecule carries no net electrical charge, is an important parameter for many analytical biochemistry and proteomics techniques, especially for 2D gel electrophoresis (2D-PAGE), capillary isoelectric focusing, liquid chromatography-mass spectrometry and X-ray protein crystallography. The database, available at http://isoelectricpointdb.org allows the retrieval of virtual 2D-PAGE plots and the development of customised fractions of proteome based on isoelectric point and molecular weight. Moreover, Proteome-pI facilitates statistical comparisons of the various prediction methods as well as biological investigation of protein isoelectric point space in all kingdoms of life. For instance, using Proteome-pI data, it is clear that Eukaryotes, which evolved tight control of homeostasis, encode proteins with pI values near the cell pH. In contrast, Archaea living frequently in extreme environments can possess proteins with a wide range of isoelectric points. The database includes various statistics and tools for interactive browsing, searching and sorting. Apart from data for individual proteomes, datasets corresponding to major protein databases such as UniProtKB/TrEMBL and the NCBI non-redundant (nr) database have also been precalculated and made available in CSV format.
Project description:Chikungunya virus (CHIKV) is a mosquito-borne alphavirus. The outbreak of CHIKV infection has been seen in many tropical and subtropical regions of the biosphere. Current reports evidenced that after outbreaks in 2005-06, the fitness of this virus propagating in Aedes albopictus enhanced due to the epistatic mutational changes in its envelope protein. In our study, we evaluated the prevalence of intrinsically disordered proteins (IDPs) and IDP regions (IDPRs) in CHIKV proteome. IDPs/IDPRs are known as members of a 'Dark Proteome' that defined as a set of polypeptide segments or whole protein without unique three-dimensional structure within the cellular milieu but with significant biological functions, such as cell cycle regulation, control of signaling pathways, and maintenance of viral proteomes. However, the intrinsically disordered aspects of CHIKV proteome and roles of IDPs/IDPRs in the pathogenic mechanism of this important virus have not been evaluated as of yet. There are no existing reports on the analysis of intrinsic disorder status of CHIKV. To fulfil this goal, we have analyzed the abundance and functionality of IDPs/IDPRs in CHIKV proteins, involved in the replication and maturation. It is likely that these IDPs/IDPRs can serve as novel targets for disorder based drug design.
Project description:The Nucleolar Proteome Database (NOPdb) archives data on >700 proteins that were identified by multiple mass spectrometry (MS) analyses from highly purified preparations of human nucleoli, the most prominent nuclear organelle. Each protein entry is annotated with information about its corresponding gene, its domain structures and relevant protein homologues across species, as well as documenting its MS identification history including all the peptides sequenced by tandem MS/MS. Moreover, data showing the quantitative changes in the relative levels of approximately 500 nucleolar proteins are compared at different timepoints upon transcriptional inhibition. Correlating changes in protein abundance at multiple timepoints, highlighted by visualization means in the NOPdb, provides clues regarding the potential interactions and relationships between nucleolar proteins and thereby suggests putative functions for factors within the 30% of the proteome which comprises novel/uncharacterized proteins. The NOPdb (http://www.lamondlab.com/NOPdb) is searchable by either gene names, nucleotide or protein sequences, Gene Ontology terms or motifs, or by limiting the range for isoelectric points and/or molecular weights and links to other databases (e.g. LocusLink, OMIM and PubMed).
Project description:Proteome-pI 2.0 is an update of an online database containing predicted isoelectric points and pKa dissociation constants of proteins and peptides. The isoelectric point-the pH at which a particular molecule carries no net electrical charge-is an important parameter for many analytical biochemistry and proteomics techniques. Additionally, it can be obtained directly from the pKa values of individual charged residues of the protein. The Proteome-pI 2.0 database includes data for over 61 million protein sequences from 20 115 proteomes (three to four times more than the previous release). The isoelectric point for proteins is predicted by 21 methods, whereas pKa values are inferred by one method. To facilitate bottom-up proteomics analysis, individual proteomes were digested in silico with the five most commonly used proteases (trypsin, chymotrypsin, trypsin + LysC, LysN, ArgC), and the peptides' isoelectric point and molecular weights were calculated. The database enables the retrieval of virtual 2D-PAGE plots and customized fractions of a proteome based on the isoelectric point and molecular weight. In addition, isoelectric points for proteins in NCBI non-redundant (nr), UniProt, SwissProt, and Protein Data Bank are available in both CSV and FASTA formats. The database can be accessed at http://isoelectricpointdb2.org.
Project description:The chordate proteome history database (http://ioda.univ-provence.fr) comprises some 20,000 evolutionary analyses of proteins from chordate species. Our main objective was to characterize and study the evolutionary histories of the chordate proteome, and in particular to detect genomic events and automatic functional searches. Firstly, phylogenetic analyses based on high quality multiple sequence alignments and a robust phylogenetic pipeline were performed for the whole protein and for each individual domain. Novel approaches were developed to identify orthologs/paralogs, and predict gene duplication/gain/loss events and the occurrence of new protein architectures (domain gains, losses and shuffling). These important genetic events were localized on the phylogenetic trees and on the genomic sequence. Secondly, the phylogenetic trees were enhanced by the creation of phylogroups, whereby groups of orthologous sequences created using OrthoMCL were corrected based on the phylogenetic trees; gene family size and gene gain/loss in a given lineage could be deduced from the phylogroups. For each ortholog group obtained from the phylogenetic or the phylogroup analysis, functional information and expression data can be retrieved. Database searches can be performed easily using biological objects: protein identifier, keyword or domain, but can also be based on events, eg, domain exchange events can be retrieved. To our knowledge, this is the first database that links group clustering, phylogeny and automatic functional searches along with the detection of important events occurring during genome evolution, such as the appearance of a new domain architecture.
Project description:Our current knowledge of complex biological systems is stored in a computable form through the Gene Ontology (GO) which provides a comprehensive description of genes function. Prediction of GO terms from the sequence remains, however, a challenging task, which is particularly critical for novel genomes. Here we present INGA 2.0, a new version of the INGA software for protein function prediction. INGA exploits homology, domain architecture, interaction networks and information from the 'dark proteome', like transmembrane and intrinsically disordered regions, to generate a consensus prediction. INGA was ranked in the top ten methods on both CAFA2 and CAFA3 blind tests. The new algorithm can process entire genomes in a few hours or even less when additional input files are provided. The new interface provides a better user experience by integrating filters and widgets to explore the graph structure of the predicted terms. The INGA web server, databases and benchmarking are available from URL: https://inga.bio.unipd.it/.