Predicting Functions of Uncharacterized Human Proteins: From Canonical to Proteoforms.
ABSTRACT: Despite tremendous efforts in genomics, transcriptomics, and proteomics communities, there is still no comprehensive data about the exact number of protein-coding genes, translated proteoforms, and their function. In addition, by now, we lack functional annotation for 1193 genes, where expression was confirmed at the proteomic level (uPE1 proteins). We re-analyzed results of AP-MS experiments from the BioPlex 2.0 database to predict functions of uPE1 proteins and their splice forms. By building a protein-protein interaction network for 12 ths. identified proteins encoded by 11 ths. genes, we were able to predict Gene Ontology categories for a total of 387 uPE1 genes. We predicted different functions for canonical and alternatively spliced forms for four uPE1 genes. In total, functional differences were revealed for 62 proteoforms encoded by 31 genes. Based on these results, it can be carefully concluded that the dynamics and versatility of the interactome is ensured by changing the dominant splice form. Overall, we propose that analysis of large-scale AP-MS experiments performed for various cell lines and under various conditions is a key to understanding the full potential of genes role in cellular processes.
Project description:The human genome harbors just 20,000 genes suggesting that the variety of possible protein products per gene plays a significant role in generating functional diversity. In bottom-up proteomics peptides are mapped back to proteins and proteoforms to describe a proteome; however, accurate quantitation of proteoforms is challenging due to incomplete protein sequence coverage and mapping ambiguities. Here, we demonstrate that a new software tool called ProteinClusterQuant (PCQ) can be used to deduce the presence of proteoforms that would have otherwise been missed, as exemplified in a proteomic comparison of two fly species, Drosophila melanogaster and D. virilis. PCQ was used to identify reduced levels of serine/threonine protein kinases PKN1 and PKN4 in CFBE41o- cells compared to HBE41o- cells and to elucidate that shorter proteoforms of full-length caspase-4 and ephrin B receptor are differentially expressed. Thus, PCQ extends current analyses in quantitative proteomics and facilitates finding differentially regulated proteins and proteoforms.
Project description:Alternative pre-mRNA splicing has long been proposed to contribute greatly to proteome complexity. However, the extent to which mature mRNA isoforms are successfully translated into protein remains controversial. Here, we used high-throughput RNA sequencing and mass spectrometry (MS)-based proteomics to better evaluate the translation of alternatively spliced mRNAs. To increase proteome coverage and improve protein quantitation, we optimized cell fractionation and sample processing steps at both the protein and peptide level. Furthermore, we generated a custom peptide database trained on analysis of RNA-seq data with MAJIQ, an algorithm optimized to detect and quantify differential and unannotated splice junction usage. We matched tandem mass spectra acquired by data-dependent acquisition (DDA) against our custom RNA-seq based database, as well as SWISS-PROT and RefSeq databases to improve identification of splicing-derived proteoforms by 28% compared with use of the SWISS-PROT database alone. Altogether, we identified peptide evidence for 554 alternate proteoforms corresponding to 274 genes. Our increased depth and detection of proteins also allowed us to track changes in the transcriptome and proteome induced by T-cell stimulation, as well as fluctuations in protein subcellular localization. In sum, our data here confirm that use of generic databases in proteomic studies underestimates the number of spliced mRNA isoforms that are translated into protein and provides a workflow that improves isoform detection in large-scale proteomic experiments.
Project description:PROTEOFORMER is a pipeline that enables the automated processing of data derived from ribosome profiling (RIBO-seq, i.e. the sequencing of ribosome-protected mRNA fragments). As such, genome-wide ribosome occupancies lead to the delineation of data-specific translation product candidates and these can improve the mass spectrometry-based identification. Since its first publication, different upgrades, new features and extensions have been added to the PROTEOFORMER pipeline. Some of the most important upgrades include P-site offset calculation during mapping, comprehensive data pre-exploration, the introduction of two alternative proteoform calling strategies and extended pipeline output features. These novelties are illustrated by analyzing ribosome profiling data of human HCT116 and Jurkat data. The different proteoform calling strategies are used alongside one another and in the end combined together with reference sequences from UniProt. Matching mass spectrometry data are searched against this extended search space with MaxQuant. Overall, besides annotated proteoforms, this pipeline leads to the identification and validation of different categories of new proteoforms, including translation products of up- and downstream open reading frames, 5' and 3' extended and truncated proteoforms, single amino acid variants, splice variants and translation products of so-called noncoding regions. Further, proof-of-concept is reported for the improvement of spectrum matching by including Prosit, a deep neural network strategy that adds extra fragmentation spectrum intensity features to the analysis. In the light of ribosome profiling-driven proteogenomics, it is shown that this allows validating the spectrum matches of newly identified proteoforms with elevated stringency. These updates and novel conclusions provide new insights and lessons for the ribosome profiling-based proteogenomic research field. More practical information on the pipeline, raw code, the user manual (README) and explanations on the different modes of availability can be found at the GitHub repository of PROTEOFORMER: https://github.com/Biobix/proteoformer.
Project description:Proteoforms, the primary effectors of biological processes, are the different forms of proteins that arise from molecular processing events such as alternative splicing and post-translational modifications. Heart diseases exhibit changes in proteoform levels, motivating the development of a deeper understanding of the heart proteoform landscape. Our recently developed two-dimensional top-down proteomics platform coupling serial size exclusion chromatography (sSEC) to reversed-phase chromatography (RPC) expanded coverage of the human heart proteome and allowed observation of high-molecular weight proteoforms. However, most of these observed proteoforms were not identified due to the difficulty in obtaining quality tandem mass spectrometry (MS2) fragmentation data for large proteoforms from complex biological mixtures on a chromatographic time scale. Herein, we sought to identify human heart proteoforms in this data set using an enhanced version of Proteoform Suite, which identifies proteoforms by intact mass alone. Specifically, we added a new feature to Proteoform Suite to determine candidate identifications for isotopically unresolved proteoforms larger than 50 kDa, enabling subsequent MS2 identification of important high-molecular weight human heart proteoforms such as lamin A (72 kDa) and trifunctional enzyme subunit ? (79 kDa). With this new workflow for large proteoform identification, endogenous human cardiac myosin binding protein C (140 kDa) was identified for the first time. This study demonstrates the integration of our sSEC-RPC-MS proteomics platform with intact-mass analysis through Proteoform Suite to create a catalog of human heart proteoforms and facilitate the identification of large proteoforms in complex systems.
Project description:BACKGROUND:Immunization with attenuated malaria sporozoites protects humans from experimental malaria challenge by mosquito bite. Protection in humans is strongly correlated with the production of T cells targeting a heterogeneous population of pre-erythrocyte antigen proteoforms, including liver stage antigens. Currently, few T cell epitopes derived from Plasmodium falciparum, the major aetiologic agent of malaria in humans are known. METHODS:In this study both in vitro and in vivo malaria liver stage models were used to sequence host and pathogen proteoforms. Proteoforms from these diverse models were subjected to mild acid elution (of soluble forms), multi-dimensional fractionation, tandem mass spectrometry, and top-down bioinformatics analysis to identify proteoforms in their intact state. RESULTS:These results identify a group of host and malaria liver stage proteoforms that meet a 5% false discovery rate threshold. CONCLUSIONS:This work provides proof-of-concept for the validity of this mass spectrometry/bioinformatic approach for future studies seeking to reveal malaria liver stage antigens towards vaccine development.
Project description:Amyloid-beta (A?) plays a key role in the pathogenesis of Alzheimer's disease (AD), but little is known about the proteoforms present in AD brain. We used high-resolution mass spectrometry to analyze intact A? from soluble aggregates and insoluble material in brains of six cases with severe dementia and pathologically confirmed AD. The soluble aggregates are especially relevant because they are believed to be the most toxic form of A?. We found a diversity of A? peptides, with 26 unique proteoforms including various N- and C-terminal truncations. N- and C-terminal truncations comprised 73% and 30%, respectively, of the total A? proteoforms detected. The A? proteoforms segregated between the soluble and more insoluble aggregates with N-terminal truncations predominating in the insoluble material and C- terminal truncations segregating into the soluble aggregates. In contrast, canonical A? comprised the minority of the identified proteoforms (15.3%) and did not distinguish between the soluble and more insoluble aggregates. The relative abundance of many truncated A? proteoforms did not correlate with post-mortem interval, suggesting they are not artefacts. This heterogeneity of A? proteoforms deepens our understanding of AD and offers many new avenues for investigation into pathological mechanisms of the disease, with implications for therapeutic development.
Project description:The development of large-scale data sets requires a new means to display and disseminate research studies to large audiences. Knowledge of protein-protein interaction (PPI) networks has become a principle interest of many groups within the field of proteomics. At the confluence of technologies, such as cross-linking mass spectrometry, yeast two-hybrid, protein cofractionation, and affinity purification mass spectrometry (AP-MS), detection of PPIs can uncover novel biological inferences at a high-throughput. Thus new platforms to provide community access to large data sets are necessary. To this end, we have developed a web application that enables exploration and dissemination of the growing BioPlex interaction network. BioPlex is a large-scale interactome data set based on AP-MS of baits from the human ORFeome. The latest BioPlex data set release (BioPlex 2.0) contains 56?553 interactions from 5891 AP-MS experiments. To improve community access to this vast compendium of interactions, we developed BioPlex Display, which integrates individual protein querying, access to empirical data, and on-the-fly annotation of networks within an easy-to-use and mobile web application. BioPlex Display enables rapid acquisition of data from BioPlex and development of hypotheses based on protein interactions.
Project description:We have previously developed an approach, where two-dimensional gel electrophoresis (2DE) was used, followed by sectional analysis of the whole gel using high-resolution nano-liquid chromatography-mass spectrometry (ESI LC-MS/MS). In this study, we applied this approach on the panoramic analysis of proteins and their proteoforms from normal (liver) and cancer (HepG2) cells. This allowed us to detect, in a single proteome, about 20,000 proteoforms coded by more than 4000 genes. A set of 3D-graphs showing distribution of these proteoforms in 2DE maps (profiles) was generated. A comparative analysis of these profiles between normal and cancer cells showed high variability and dynamics of many proteins. Among these proteins, there are some well-known features like alpha-fetoprotein (FETA) or glypican-3 (GPC3) and potential hepatocellular carcinoma (HCC) markers. More detailed information about their proteoforms could be used for generation of panels of more specific biomarkers.
Project description:A top-down proteomic strategy with semiautomated analysis of data sets has proven successful for the global identification of truncated proteins without the use of chemical derivatization, enzymatic manipulation, immunoprecipitation, or other enrichment. This approach provides the reliable identification of internal polypeptides formed from precursor gene products by proteolytic cleavage of both the N- and C-termini, as well as truncated proteoforms that retain one or the other termini. The strategy has been evaluated by application to the immunosuppressive extracellular vesicles released by myeloid-derived suppressor cells. More than 1000 truncated proteoforms have been identified, from which binding motifs are derived to allow characterization of the putative proteases responsible for truncation.
Project description:Proteins can exist as multiple proteoforms in vivo, as a result of alternative splicing and single-nucleotide polymorphisms (SNPs), as well as posttranslational processing. To address their clinical significance in a context of diagnostic information, proteoforms require a more in-depth analysis. Mass spectrometric immunoassays (MSIA) have been devised for studying structural diversity in human proteins. MSIA enables protein profiling in a simple and high-throughput manner, by combining the selectivity of targeted immunoassays, with the specificity of mass spectrometric detection. MSIA has been used for qualitative and quantitative analysis of single and multiple proteoforms, distinguishing between normal fluctuations and changes related to clinical conditions. This mini review offers an overview of the development and application of mass spectrometric immunoassays for clinical and population proteomics studies. Provided are examples of some recent developments, and also discussed are the trends and challenges in mass spectrometry-based immunoassays for the next-phase of clinical applications.