Dataset Information

Enhanced protein isoform characterization through long-read proteogenomics.

ABSTRACT:

Background

The detection of physiologically relevant protein isoforms encoded by the human genome is critical to biomedicine. Mass spectrometry (MS)-based proteomics is the preeminent method for protein detection, but isoform-resolved proteomic analysis relies on accurate reference databases that match the sample; neither a subset nor a superset database is ideal. Long-read RNA sequencing (e.g., PacBio or Oxford Nanopore) provides full-length transcripts which can be used to predict full-length protein isoforms.

Results

We describe here a long-read proteogenomics approach for integrating sample-matched long-read RNA-seq and MS-based proteomics data to enhance isoform characterization. We introduce a classification scheme for protein isoforms, discover novel protein isoforms, and present the first protein inference algorithm for the direct incorporation of long-read transcriptome data to enable detection of protein isoforms previously intractable to MS-based detection. We have released an open-source Nextflow pipeline that integrates long-read sequencing in a proteomic workflow for isoform-resolved analysis.

Conclusions

Our work suggests that the incorporation of long-read sequencing and proteomic data can facilitate improved characterization of human protein isoform diversity. Our first-generation pipeline provides a strong foundation for future development of long-read proteogenomics and its adoption for both basic and translational research.

SUBMITTER: Miller RM

PROVIDER: S-EPMC8892804 | biostudies-literature | 2022 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Enhanced protein isoform characterization through long-read proteogenomics.

Miller Rachel M RM Jordan Ben T BT Mehlferber Madison M MM Jeffery Erin D ED Chatzipantsiou Christina C Kaur Simi S Millikin Robert J RJ Dai Yunxiang Y Tiberi Simone S Castaldi Peter J PJ Shortreed Michael R MR Luckey Chance John CJ Conesa Ana A Smith Lloyd M LM Deslattes Mays Anne A Sheynkman Gloria M GM

Genome biology 20220303 1

<h4>Background</h4>The detection of physiologically relevant protein isoforms encoded by the human genome is critical to biomedicine. Mass spectrometry (MS)-based proteomics is the preeminent method for protein detection, but isoform-resolved proteomic analysis relies on accurate reference databases that match the sample; neither a subset nor a superset database is ideal. Long-read RNA sequencing (e.g., PacBio or Oxford Nanopore) provides full-length transcripts which can be used to predict full ...[more]

PMID: 35241129

Similar Datasets

Project description:Genome-wide association studies (GWASs) have revealed thousands of associations in many complex traits and diseases. Previous studies suggest that a subset of associations are due to alterations in splicing; however, interpreting the effects of splicing on protein isoforms is hindered by limitations in defining full-length transcript isoforms using short-read RNA-seq data. Long-read RNA-seq represents a powerful approach to define and quantify transcript isoforms. In this study, we developed a novel approach that integrates information from GWAS, splicing QTL (sQTL), and PacBio long-read RNA-seq in a disease relevant model to infer the effects of sQTL on the ultimate protein isoform products they encode. Such information enables identification of genes potentially responsible for GWAS associations. As a proof-of-concept, we generated deep coverage (N=~22 million full-length reads) PacBio long-read RNAseq data on human fetal osteoblasts (hFOBs), a cell-line of relevance to the regulation of bone mineral density (BMD). We identified 68,326 protein-coding isoforms, including 17,375 (25%) which were novel. Next, we used Bayesian colocalization to identify 1,863 sQTLs from the Genotype-Tissue Expression (GTEx) project in 732 protein-coding genes which colocalized with BMD associations (H4PP > 0.75). A total of 836 junctions with colocalizing sQTLs in 459 (of the 732) genes were expressed in hFOB long-read RNA-seq data. With these data, we formulated hypotheses regarding the potential mechanism of action of each sQTL. For example, we identified 7 junctions with colocalizing sQTLs (maximum H4PP = 0.98-0.99) in TPM2 for splice junctions between two nearly mutually exclusive exons, and two different transcript termination sites, making it impossible to interpret without long-read RNA-seq data. siRNA mediated knockdown in hFOBs showed two TPM2 isoforms with opposing effects on mineralization. Our results suggest that splicing is a major mechanism underlying GWAS associations and long-read proteogenomics data is critical to precisely define the protein isoforms that are produced from splicing alterations.

Project description:Mass spectrometry-based proteomics sample preparation Harvested HUVECs, approximately 5 million cells each, were pelleted and frozen at -80C. The sample pellet was lysed according to the Filter Aided Sample Preparation (FASP) protocol.59 Lysis buffer used in the FASP was changed to 6% SDS, 150 mM DTT, 75 mM Tris-HCl. To the 30 uL pellet of 5 million cells, an aliquot of 60 uL of lysis buffer was added and probe-sonicated to lyse the cells and shear the nucleotide material. Sonication continued for 1-5 minutes until the sample was clear and no longer viscous. The lysate was then incubated at 95C for 5 minutes. Protein quantitation was estimated by BCA assay to be approximately 500-600 ug. Quadruplicate aliquots of 20 uL each were subjected to FASP and trypsin digest (1 ug per aliquot) and allowed to incubate at 37C overnight. Nanodrop analysis estimated peptide content at 22 ug per trypsin digest (total of 88 ug). Offline HPLC Fractionation The tryptic digests were pooled and dried down to a volume of 40 uL and subjected to offline high pH RP-HPLC fractionation using an Agilent 1200 HPLC. Sample was were loaded onto a Thermo Scientific Hypersil Gold C18 column (150 mm x 3 mm x 3 um C18), equilibrated with 95% solvent A (20 mM NH4 formate, pH 10) and 5% solvent B (70% acetonitrile/30% solvent A), and eluted at a flow rate of 400 uL/min, with fractions collected every 1 minute from RT 38-63 min. The following gradient was used: 5% B from 0-30 min, 5-65% B from 30-63 min, 65-100% B from 64-69 min, 100-5% B from 69-70 min, 5% B from 70-73 min. Samples containing peptide, according to UV 214 nm corresponding to the HUVEC pellet were digested with trypsin. Collected fractions 4-20 were selected for LC-MS/MS analysis. NanoLC-MS/MS analysis The resulting peptides were dried to 12 uL and analyzed by nanoLC-MS/MS using a Dionex Ultimate 3000 (Thermo Fisher Scientific, Bremen, Germany) coupled to an Orbitrap Eclipse Tribrid mass spectrometer (Thermo Fisher Scientific, Bremen, Germany). Three microliters of each peptide-containing sample were loaded onto an Acclaim PepMap 100 trap column (300 um x 5 mm x 5 um C18) and gradient-eluted from an Acclaim PepMap 100 analytical column (75 um x 25 cm, 3 um C18) equilibrated in 96% solvent A (0.1% formic acid in water) and 4% solvent B (80% acetonitrile in 0.1% formic acid). The peptides were eluted at 300 nL/min using the following gradient: 4% B from 0-5 min, 4 to 28% B from 5-210 min, 28-40% B from 210-240 min, 40-95% B from 240-250 min and 95%B from 250-260 min. The Orbitrap Eclipse was operated in positive ion mode with 1.9 kV at the spray source, RF lens at 30% and data dependent MS/MS acquisition with XCalibur version 4.3.73.11. Positive ion Full MS scans were acquired in the Orbitrap from 375-1500 m/z with 120,000 resolution. Data dependent selection of precursor ions was performed in Cycle Time mode, with 3 seconds in between Master Scans, using an intensity threshold of 2 x 104 ion counts and applying dynamic exclusion (n=1 scans within 30 seconds for an exclusion duration of 60 seconds and +/- 10 ppm mass tolerance). Monoisotopic peak determination was applied and charge states 2-6 were included for HCD scans (quadrupole isolation mode; 1.6 m/z isolation window). The resulting fragments were detected in the Orbitrap at 15,000 resolution with standard AGC target.

Dataset Information

Enhanced protein isoform characterization through long-read proteogenomics.

Background

Results

Conclusions

Publications

Enhanced protein isoform characterization through long-read proteogenomics.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets