Integrated Transcriptomic-Proteomic Analysis Using a Proteogenomic Workflow Refines Rat Genome Annotation.
ABSTRACT: Proteogenomic re-annotation and mRNA splicing information can lead to the discovery of various protein forms for eukaryotic model organisms like rat. However, detection of novel proteoforms using mass spectrometry proteomics data remains a formidable challenge. We developed EuGenoSuite, an open source multiple algorithmic proteomic search tool and utilized it in our in-house integrated transcriptomic-proteomic pipeline to facilitate automated proteogenomic analysis. Using four proteogenomic pipelines (integrated transcriptomic-proteomic, Peppy, Enosi, and ProteoAnnotator) on publicly available RNA-sequence and MS proteomics data, we discovered 363 novel peptides in rat brain microglia representing novel proteoforms for 249 gene loci in the rat genome. These novel peptides aided in the discovery of novel exons, translation of annotated untranslated regions, pseudogenes, and splice variants for various loci; many of which have known disease associations, including neurological disorders like schizophrenia, amyotrophic lateral sclerosis, etc. Novel isoforms were also discovered for genes implicated in cardiovascular diseases and breast cancer for which rats are considered model organisms. Our integrative multi-omics data analysis not only enables the discovery of new proteoforms but also generates an improved reference for human disease studies in the rat model.
Project description:Recent advances in mass spectrometry (MS) have enabled extensive analysis of cancer proteomes. Here, we employed quantitative proteomics to profile protein expression across 24 breast cancer patient-derived xenograft (PDX) models. Integrated proteogenomic analysis shows positive correlation between expression measurements from transcriptomic and proteomic analyses; further, gene expression-based intrinsic subtypes are largely re-capitulated using non-stromal protein markers. Proteogenomic analysis also validates a number of predicted genomic targets in multiple receptor tyrosine kinases. However, several protein/phosphoprotein events such as overexpression of AKT proteins and ARAF, BRAF, HSP90AB1 phosphosites are not readily explainable by genomic analysis, suggesting that druggable translational and/or post-translational regulatory events may be uniquely diagnosed by MS. Drug treatment experiments targeting HER2 and components of the PI3K pathway supported proteogenomic response predictions in seven xenograft models. Our study demonstrates that MS-based proteomics can identify therapeutic targets and highlights the potential of PDX drug response evaluation to annotate MS-based pathway activities.
Project description:We present GenoSuite, an integrated proteogenomic pipeline to validate, refine and discover protein coding genes using high-throughput mass spectrometry (MS) data from prokaryotes. To demonstrate the effectiveness of GenoSuite, we analyzed proteomics data of Bradyrhizobium japonicum (USDA110), a model organism to study agriculturally important rhizobium-legume symbiosis. Our analysis confirmed 31% of known genes, refined 49 gene models for their translation initiation site (TIS) and discovered 59 novel protein coding genes. Notably, a novel protein which redefined the boundary of a crucial cytochrome P450 system related operon was discovered, known to be highly expressed in the anaerobic symbiotic bacteroids. A focused analysis on N-terminally acetylated peptides indicated downstream TIS for gene blr0594. Finally, ortho-proteogenomic analysis revealed three novel genes in recently sequenced B. japonicum USDA6(T) genome. The discovery of large number of missing genes and correction of gene models have expanded the proteomic landscape of B. japonicum and presents an unparalleled utility of proteogenomic analyses and versatility of GenoSuite for annotating prokaryotic genomes including pathogens.
Project description:Extensive genomic characterization of human cancers presents the problem of inference from genomic abnormalities to cancer phenotypes. To address this problem, we analysed proteomes of colon and rectal tumours characterized previously by The Cancer Genome Atlas (TCGA) and perform integrated proteogenomic analyses. Somatic variants displayed reduced protein abundance compared to germline variants. Messenger RNA transcript abundance did not reliably predict protein abundance differences between tumours. Proteomics identified five proteomic subtypes in the TCGA cohort, two of which overlapped with the TCGA 'microsatellite instability/CpG island methylation phenotype' transcriptomic subtype, but had distinct mutation, methylation and protein expression patterns associated with different clinical outcomes. Although copy number alterations showed strong cis- and trans-effects on mRNA abundance, relatively few of these extend to the protein level. Thus, proteomics data enabled prioritization of candidate driver genes. The chromosome 20q amplicon was associated with the largest global changes at both mRNA and protein levels; proteomics data highlighted potential 20q candidates, including HNF4A (hepatocyte nuclear factor 4, alpha), TOMM34 (translocase of outer mitochondrial membrane 34) and SRC (SRC proto-oncogene, non-receptor tyrosine kinase). Integrated proteogenomic analysis provides functional context to interpret genomic abnormalities and affords a new paradigm for understanding cancer biology.
Project description:<h4>Background</h4>The majority of genetic biomarkers for human cancers are defined by statistical screening of high-throughput genomics data. While a large number of genetic biomarkers have been proposed for diagnostic and prognostic applications, only a small number have been applied in the clinic. Similarly, the use of proteomics methods for the discovery of cancer biomarkers is increasing. The emerging field of proteogenomics seeks to enrich the value of genomics and proteomics approaches by studying the intersection of genomics and proteomics data. This task is challenging due to the complex nature of transcriptional and translation regulatory mechanisms and the disparities between genomic and proteomic data from the same samples. In this study, we have examined tumor antigens as potential biomarkers for breast cancer using genomics and proteomics data from previously reported laser capture microdissected ER+ tumor samples.<h4>Results</h4>We applied proteogenomic analyses to study the genetic aberrations of 32 tumor antigens determined in the proteomic data. We found that tumor antigens that are aberrantly expressed at the genetic level and expressed at the protein level, are likely involved in perturbing pathways directly linked to the hallmarks of cancer. The results found by proteogenomic analysis of the 32 tumor antigens studied here, capture largely the same pathway irregularities as those elucidated from large-scale screening of genomics analyses, where several thousands of genes are often found to be perturbed.<h4>Conclusion</h4>Tumor antigens are a group of proteins recognized by the cells of the immune system. Specifically, they are recognized in tumor cells where they are present in larger than usual amounts, or are physiochemically altered to a degree at which they no longer resemble native human proteins. This proteogenomic analysis of 32 tumor antigens suggests that tumor antigens have the potential to be highly specific biomarkers for different cancers.
Project description:Proteogenomic searching is a useful method for identifying novel proteins, annotating genes and detecting peptides unique to an individual genome. The approach, however, can be laborious, as it often requires search segmentation and the use of several unintegrated tools. Furthermore, many proteogenomic efforts have been limited to small genomes, as large genomes can prove impractical due to the required amount of computer memory and computation time. We present Peppy, a software tool designed to perform every necessary task of proteogenomic searches quickly, accurately and automatically. The software generates a peptide database from a genome, tracks peptide loci, matches peptides to MS/MS spectra and assigns confidence values to those matches. Peppy automatically performs a decoy database generation, search and analysis to return identifications at the desired false discovery rate threshold. Written in Java for cross-platform execution, the software is fully multithreaded for enhanced speed. The program can run on regular desktop computers, opening the doors of proteogenomic searching to a wider audience of proteomics and genomics researchers. Peppy is available at http://geneffects.com/peppy .
Project description:Proteogenomics combines large-scale genomic and transcriptomic data with mass-spectrometry-based proteomic data to discover novel protein sequence variants and improve genome annotation. In contrast with conventional proteomic applications, proteogenomic analysis requires a number of additional data processing steps. Ideally, these required steps would be integrated and automated via a single software platform offering accessibility for wet-bench researchers as well as flexibility for user-specific customization and integration of new software tools as they emerge. Toward this end, we have extended the Galaxy bioinformatics framework to facilitate proteogenomic analysis. Using analysis of whole human saliva as an example, we demonstrate Galaxy's flexibility through the creation of a modular workflow incorporating both established and customized software tools that improve depth and quality of proteogenomic results. Our customized Galaxy-based software includes automated, batch-mode BLASTP searching and a Peptide Sequence Match Evaluator tool, both useful for evaluating the veracity of putative novel peptide identifications. Our complex workflow (approximately 140 steps) can be easily shared using built-in Galaxy functions, enabling their use and customization by others. Our results provide a blueprint for the establishment of the Galaxy framework as an ideal solution for the emerging field of proteogenomics.
Project description:We performed the first proteogenomic study on a prospectively collected colon cancer cohort. Comparative proteomic and phosphoproteomic analysis of paired tumor and normal adjacent tissues produced a catalog of colon cancer-associated proteins and phosphosites, including known and putative new biomarkers, drug targets, and cancer/testis antigens. Proteogenomic integration not only prioritized genomically inferred targets, such as copy-number drivers and mutation-derived neoantigens, but also yielded novel findings. Phosphoproteomics data associated Rb phosphorylation with increased proliferation and decreased apoptosis in colon cancer, which explains why this classical tumor suppressor is amplified in colon tumors and suggests a rationale for targeting Rb phosphorylation in colon cancer. Proteomics identified an association between decreased CD8 T cell infiltration and increased glycolysis in microsatellite instability-high (MSI-H) tumors, suggesting glycolysis as a potential target to overcome the resistance of MSI-H tumors to immune checkpoint blockade. Proteogenomics presents new avenues for biological discoveries and therapeutic development.
Project description:Proteogenomics is an area of research at the interface of proteomics and genomics. In this approach, customized protein sequence databases generated using genomic and transcriptomic information are used to help identify novel peptides (not present in reference protein sequence databases) from mass spectrometry-based proteomic data; in turn, the proteomic data can be used to provide protein-level evidence of gene expression and to help refine gene models. In recent years, owing to the emergence of new sequencing technologies such as RNA-seq and dramatic improvements in the depth and throughput of mass spectrometry-based proteomics, the pace of proteogenomic research has greatly accelerated. Here I review the current state of proteogenomic methods and applications, including computational strategies for building and using customized protein sequence databases. I also draw attention to the challenge of false positive identifications in proteogenomics and provide guidelines for analyzing the data and reporting the results of proteogenomic studies.
Project description:To elucidate the deregulated functional modules that drive clear cell renal cell carcinoma (ccRCC), we performed comprehensive genomic, epigenomic, transcriptomic, proteomic, and phosphoproteomic characterization of treatment-naive ccRCC and paired normal adjacent tissue samples. Genomic analyses identified a distinct molecular subgroup associated with genomic instability. Integration of proteogenomic measurements uniquely identified protein dysregulation of cellular mechanisms impacted by genomic alterations, including oxidative phosphorylation-related metabolism, protein translation processes, and phospho-signaling modules. To assess the degree of immune infiltration in individual tumors, we identified microenvironment cell signatures that delineated four immune-based ccRCC subtypes characterized by distinct cellular pathways. This study reports a large-scale proteogenomic analysis of ccRCC to discern the functional impact of genomic alterations and provides evidence for rational treatment selection stemming from ccRCC pathobiology.
Project description:Experimental evidence is increasingly being used to reassess the quality and accuracy of genome annotation. Proteomics data used for this purpose, called proteogenomics, can alleviate many of the problematic areas of genome annotation, e.g. short protein validation and start site assignment. We performed a proteogenomic analysis of 46 genomes spanning eight bacterial and archaeal phyla across the tree of life. These diverse datasets facilitated the development of a robust approach for proteogenomics that is functional across genomes varying in %GC, gene content, proteomic sampling depth, phylogeny, and genome size. In addition to finding evidence for 682 novel proteins, 1336 new start sites, and numerous dubious genes, we discovered sites of post-translational maturation in the form of proteolytic cleavage of 1175 signal peptides. The number of novel proteins per genome is highly variable (median 7, mean 15, stdev 20). Moreover, comparison of novel genes with the current genes did not reveal any consistent abnormalities. Thus, we conclude that proteogenomics fulfills a yet to be understood deficiency in gene prediction. With the adoption of new sequencing technologies which have higher error rates than Sanger-based methods and the advances in proteomics, proteogenomics may become even more important in the future.