Cell-Type-Specific Proteogenomic Signal Diffusion for Integrating Multi-Omics Data Predicts Novel Schizophrenia Risk Genes.
ABSTRACT: Accumulation of diverse types of omics data on schizophrenia (SCZ) requires a systems approach to model the interplay between genome, transcriptome, and proteome. We introduce Markov affinity-based proteogenomic signal diffusion (MAPSD), a method to model intra-cellular protein trafficking paradigms and tissue-wise single-cell protein abundances. MAPSD integrates multi-omics data to amplify the signals at SCZ risk loci with small effect sizes, and reveal convergent disease-associated gene modules in the brain. We predicted a set of high-confidence SCZ risk loci followed by characterizing the subcellular localization of proteins encoded by candidate SCZ risk genes, and illustrated that most are enriched in neuronal cells in the cerebral cortex as well as Purkinje cells in the cerebellum. We demonstrated how the identified genes may be involved in neurodevelopment, how they may alter SCZ-related biological pathways, and how they facilitate drug repurposing. MAPSD is applicable in other polygenic diseases and can facilitate our understanding of disease mechanisms.
Project description:Proteogenomic re-annotation and mRNA splicing information can lead to the discovery of various protein forms for eukaryotic model organisms like rat. However, detection of novel proteoforms using mass spectrometry proteomics data remains a formidable challenge. We developed EuGenoSuite, an open source multiple algorithmic proteomic search tool and utilized it in our in-house integrated transcriptomic-proteomic pipeline to facilitate automated proteogenomic analysis. Using four proteogenomic pipelines (integrated transcriptomic-proteomic, Peppy, Enosi, and ProteoAnnotator) on publicly available RNA-sequence and MS proteomics data, we discovered 363 novel peptides in rat brain microglia representing novel proteoforms for 249 gene loci in the rat genome. These novel peptides aided in the discovery of novel exons, translation of annotated untranslated regions, pseudogenes, and splice variants for various loci; many of which have known disease associations, including neurological disorders like schizophrenia, amyotrophic lateral sclerosis, etc. Novel isoforms were also discovered for genes implicated in cardiovascular diseases and breast cancer for which rats are considered model organisms. Our integrative multi-omics data analysis not only enables the discovery of new proteoforms but also generates an improved reference for human disease studies in the rat model.
Project description:Genome-wide association studies (GWAS) have identified more than 100 schizophrenia (SCZ)-associated loci, but using these findings to illuminate disease biology remains a challenge. Here we present integrative risk gene selector (iRIGS), a Bayesian framework that integrates multi-omics data and gene networks to infer risk genes in GWAS loci. By applying iRIGS to SCZ GWAS data, we predicted a set of high-confidence risk genes, most of which are not the nearest genes to the GWAS index variants. High-confidence risk genes account for a significantly enriched heritability, as estimated by stratified linkage disequilibrium score regression. Moreover, high-confidence risk genes are predominantly expressed in brain tissues, especially prenatally, and are enriched for targets of approved drugs, suggesting opportunities to reposition existing drugs for SCZ. Thus, iRIGS can leverage accumulating functional genomics and GWAS data to advance our understanding of SCZ etiology and potential therapeutics.
Project description:Proteogenomic searching is a useful method for identifying novel proteins, annotating genes and detecting peptides unique to an individual genome. The approach, however, can be laborious, as it often requires search segmentation and the use of several unintegrated tools. Furthermore, many proteogenomic efforts have been limited to small genomes, as large genomes can prove impractical due to the required amount of computer memory and computation time. We present Peppy, a software tool designed to perform every necessary task of proteogenomic searches quickly, accurately and automatically. The software generates a peptide database from a genome, tracks peptide loci, matches peptides to MS/MS spectra and assigns confidence values to those matches. Peppy automatically performs a decoy database generation, search and analysis to return identifications at the desired false discovery rate threshold. Written in Java for cross-platform execution, the software is fully multithreaded for enhanced speed. The program can run on regular desktop computers, opening the doors of proteogenomic searching to a wider audience of proteomics and genomics researchers. Peppy is available at http://geneffects.com/peppy .
Project description:In this work, we propose iProFun, an integrative analysis tool to screen for proteogenomic functional traits perturbed by DNA copy number alterations (CNAs) and DNA methylations. The goal is to characterize functional consequences of DNA copy number and methylation alterations in tumors and to facilitate screening for cancer drivers contributing to tumor initiation and progression. Specifically, we consider three functional molecular quantitative traits: mRNA expression levels, global protein abundances, and phosphoprotein abundances. We aim to identify those genes whose CNAs and/or DNA methylations have cis-associations with either some or all three types of molecular traits. Compared with analyzing each molecular trait separately, the joint modeling of multi-omics data enjoys several benefits: iProFun experienced enhanced power for detecting significant cis-associations shared across different omics data types, and it also achieved better accuracy in inferring cis-associations unique to certain type(s) of molecular trait(s). For example, unique associations of CNAs/methylations to global/phospho protein abundances may imply posttranslational regulations.We applied iProFun to ovarian high-grade serous carcinoma tumor data from The Cancer Genome Atlas and Clinical Proteomic Tumor Analysis Consortium and identified CNAs and methylations of 500 and 121 genes, respectively, affecting the cis-functional molecular quantitative traits of the corresponding genes. We observed substantial power gain via the joint analysis of iProFun. For example, iProFun identified 117 genes whose CNAs were associated with phosphoprotein abundances by leveraging mRNA expression levels and global protein abundances. By comparison, analyses based on phosphoprotein data alone identified none. A network analysis of these 117 genes revealed the known oncogene AKT1 as a key hub node interacting with many of the rest. In addition, iProFun identified one gene, BIN2, whose DNA methylation has cis-associations with its mRNA expression, global protein, and phosphoprotein abundances. These and other genes identified by iProFun could serve as potential drug targets for ovarian cancer.
Project description:BACKGROUND: Proteogenomics combines the cutting-edge methods from genomics and proteomics. While it has become cheap to sequence whole genomes, the correct annotation of protein coding regions in the genome is still tedious and error prone. Mass spectrometry on the other hand relies on good characterizations of proteins derived from the genome, but can also be used to help improving the annotation of genomes or find species specific peptides. Additionally, proteomics is widely used to find evidence for differential expression of proteins under different conditions, e.g. growth conditions for bacteria. The concept of proteogenomics is not altogether new, in-house scripts are used by different labs and some special tools for eukaryotic and human analyses are available. RESULTS: The Bacterial Proteogenomic Pipeline, which is completely written in Java, alleviates the conducting of proteogenomic analyses of bacteria. From a given genome sequence, a naïve six frame translation is performed and, if desired, a decoy database generated. This database is used to identify MS/MS spectra by common peptide identification algorithms. After combination of the search results and optional flagging for different experimental conditions, the results can be browsed and further inspected. In particular, for each peptide the number of identifications for each condition and the positions in the corresponding protein sequences are shown. Intermediate and final results can be exported into GFF3 format for visualization in common genome browsers. CONCLUSIONS: To facilitate proteogenomics analyses the Bacterial Proteogenomic Pipeline is a set of comprehensive tools running on common desktop computers, written in Java and thus platform independent. The pipeline allows integrating peptide identifications from various algorithms and emphasizes the visualization of spectral counts from different experimental conditions.
Project description:Omics approaches, including genomics, transcriptomics, proteomics, epigenomics, microbiomics, and metabolomics, generate large data sets. Once they have been used to address initial study aims, these large data sets are extremely valuable to the greater research community for ancillary investigations. Repurposing available omics data sets provides data to address research questions, generate and test hypotheses, replicate findings, and conduct mega-analyses. Many well-characterized, longitudinal, epidemiological studies collected extensive phenotype data related to symptom occurrence and severity. While the main phenotype of interest for many of these studies was often not symptom related, these data were collected to better understand the primary phenotype of interest. A search for symptom data (i.e., cognitive impairment, fatigue, gastrointestinal distress/nausea, sleep, and pain) in the database of genotypes and phenotypes (dbGaP) revealed many studies that collected symptom and omics data. There is thus a real possibility for nurse scientists to be able to look at symptom data over time from thousands of individuals and use omics data to identify key biological underpinnings that account for the development and severity of symptoms without recruiting participants or generating any new data. The purpose of this article is to introduce the reader to resources that provide omics data to the research community for repurposing, provide guidance on using these databases, and encourage the use of these data to move symptom science forward.
Project description:To explore the biology of lung adenocarcinoma (LUAD) and identify new therapeutic opportunities, we performed comprehensive proteogenomic characterization of 110 tumors and 101 matched normal adjacent tissues (NATs) incorporating genomics, epigenomics, deep-scale proteomics, phosphoproteomics, and acetylproteomics. Multi-omics clustering revealed four subgroups defined by key driver mutations, country, and gender. Proteomic and phosphoproteomic data illuminated biology downstream of copy number aberrations, somatic mutations, and fusions and identified therapeutic vulnerabilities associated with driver events involving KRAS, EGFR, and ALK. Immune subtyping revealed a complex landscape, reinforced the association of STK11 with immune-cold behavior, and underscored a potential immunosuppressive role of neutrophil degranulation. Smoking-associated LUADs showed correlation with other environmental exposure signatures and a field effect in NATs. Matched NATs allowed identification of differentially expressed proteins with potential diagnostic and therapeutic utility. This proteogenomics dataset represents a unique public resource for researchers and clinicians seeking to better understand and treat lung adenocarcinomas.
Project description:Proteogenomics combines large-scale genomic and transcriptomic data with mass-spectrometry-based proteomic data to discover novel protein sequence variants and improve genome annotation. In contrast with conventional proteomic applications, proteogenomic analysis requires a number of additional data processing steps. Ideally, these required steps would be integrated and automated via a single software platform offering accessibility for wet-bench researchers as well as flexibility for user-specific customization and integration of new software tools as they emerge. Toward this end, we have extended the Galaxy bioinformatics framework to facilitate proteogenomic analysis. Using analysis of whole human saliva as an example, we demonstrate Galaxy's flexibility through the creation of a modular workflow incorporating both established and customized software tools that improve depth and quality of proteogenomic results. Our customized Galaxy-based software includes automated, batch-mode BLASTP searching and a Peptide Sequence Match Evaluator tool, both useful for evaluating the veracity of putative novel peptide identifications. Our complex workflow (approximately 140 steps) can be easily shared using built-in Galaxy functions, enabling their use and customization by others. Our results provide a blueprint for the establishment of the Galaxy framework as an ideal solution for the emerging field of proteogenomics.
Project description:An integrated analysis of DNA, RNA and protein, so called proteogenomic studies, has the potential to greatly increase our understanding of both normal physiology and disease development. However, such studies are challenged by a lack of a systematic approach to credential individual samples resulting in the introduction of noise into the system that limits the ability to identify important biological signals. Indeed, a recent proteogenomic CPTAC study identified 26% of samples as unsatisfactory, resulting in a marked increase in cost and loss of information content. Based on a large-scale analysis of RNA-seq and proteomic data generated by reverse phase protein arrays (RPPA) and by mass spectrometry, we propose a protein-mRNA correlation-based (PMC) score as a robust metric to credential single samples for integrated proteogenomic studies. Samples with high PMC scores have significantly higher protein-mRNA correlation, total protein content and tumor purity. Our results highlight the importance of credentialing individual samples prior to proteogenomic analysis.