Project description:Discovering noncanonical peptides has been a common application of proteogenomics. Recent studies suggest that certain noncanonical peptides, known as ncMAPs (noncanonical MHC-I-associated peptides), that bind to major histocompatibility complex I may make good immunotherapeutic targets. De novo peptide sequencing is a great way to find ncMAPs since it can detect peptide sequences from their tandem mass spectra without using any sequence databases. However, this strategy hasn’t been widely applied for ncMAP identification because there is not a good way to estimate its false-positive rates. In order to completely and accurately identify immunopeptides using de novo peptide sequencing, we describe a unique pipeline called pXg. In contrast to current pipelines, it makes use of genomic data, RNA-Seq abundance and sequencing quality, in addition to proteomic features to increase the sensitivity and specificity of peptide identification. We show that the peptide-spectrum match quality and genetic traits have a clear relationship, showing that they can be utilized to evaluate peptide-spectrum matches. From ten samples, we found 24,449 cMAPs (canonical MHC-I-associated peptides) and 956 ncMAPs by using a target-decoy competition. 387 ncMAPs and 1,611 cMAPs were novel identifications that had not yet been published. We discovered 11 ncMAPs produced from a squirrel monkey retrovirus in human cell lines in addition to the 2 ncMAPs originating from a complementarity determining region 3 in an antibody thanks to the unrestricted search space assumed by de novo sequencing. These entirely new identifications show that pXg can make the most of de novo peptide sequencing's advantages and its potential use in the search for new immunotherapeutic targets.
Project description:Dependent on concise, pre-defined protein sequence databases, traditional search algorithms perform poorly when analyzing mass spectra derived from wholly uncharacterized protein products. Conversely, de novo peptide sequencing algorithms can interpret mass spectra without relying on reference databases. However, such algorithms have been difficult to apply to complex protein mixtures, in part due to a lack of methods for automatically validating de novo sequencing results. Here, we present novel metrics for benchmarking de novo sequencing algorithm performance on large scale proteomics datasets, and present a method for accurately calibrating false discovery rates on de novo results. We also present a novel algorithm (LADS) which leverages experimentally disambiguated fragmentation spectra to boost sequencing accuracy and sensitivity. LADS improves sequencing accuracy on longer peptides relative to other algorithms and improves discriminability of correct and incorrect sequences. Using these advancements, we demonstrate accurate de novo identification of peptide sequences not identifiable using database search-based approaches.
Project description:To uncover the mechanism of SPD-induced autophagy in FGSCs, we used RNA sequencing technology to compare the mRNA expression differences between the control groups and the SPD treated groups. FGSCs mRNA profiles of the control groups and the SPD treated groups were generated by deep sequencing, in two replicates. The library quality was determined using a Bioanalyzer 2100 (Agilent). The Illumina HiSeq 2500 platform was used for RNA sequencing. The quality of RNA-seq reads was examined using FastQC.
Project description:In multiple myeloma diseases, monoclonal immunoglobulin light chains (LCs) are abundantly produced, with as a consequence in some cases the formation of deposits affecting various organs, such as kidney, while in other cases to remain soluble up to concentrations of several g.L-1 in plasma. The exact factors crucial for the solubility of light chains are poorly understood, but it can be hypothesized that their amino acid sequence plays an important role. Determining the precise sequences of patient-derived light chains is therefore highly desirable. We establish here a novel de novo sequencing workflow for patient-derived LCs, based on the combination of bottom-up and top-down proteomics without database search. PEAKS is used for the de novo sequencing of peptides that are further assembled into full length LC sequences using ALPS. Top-down proteomics provides the molecular masses of proteoforms and allows the exact determination of the amino acid sequence including all post translational modifications. This pipeline is then used for the complete de novo sequencing of LCs extracted from the urine of 10 patients with multiple myeloma. We show that for the bottom-up part, digestions with trypsin and Nepenthes digestive fluid are sufficient to produce overlapping peptides able to generate the best sequence candidates. Top-down proteomics is absolutely required to achieve 100% final sequence coverage and characterize clinical samples containing several LCs . Our work highlights an unexpected range of modifications.
Project description:Generating and analyzing overlapping peptides through multienzymatic digestion is an efficient procedure for de novo protein using from bottom-up mass spectrometry (MS). Despite improved instrumentation and software, de novo MS data analysis remains challenging. In recent years, deep learning models have represented a performance breakthrough. Incorporating that technology into de novo protein sequencing workflows require machine-learning models capable of handling highly diverse MS data. In this study, we analyzed the requirements for assembling such generalizable deep learning models by systematically varying the composition and size of the training set. We assessed the generated models' performances using two test sets composed of peptides originating from the multienzyme digestion of samples from various species. The peptide recall values on the test sets showed that the deep learning models generated from a collection of highly N- and C-termini diverse peptides generalized 76% more over the termini-restricted ones. Moreover, expanding the training set's size by adding peptides from the multienzymatic digestion with five proteases of several species samples led to a 2-3 fold generalizability gain. Furthermore, we tested the applicability of these multienzyme deep learning (MEM) models by fully de novo sequencing the heavy and light monomeric chains of five commercial antibodies (mAbs). MEMs extracted over 10000 matching and overlapped peptides across six different proteases mAb samples, achieving a 100% sequence coverage for 8 of the ten polypeptide chains. We foretell that the MEMs' proven improvements to de novo analysis will positively impact several applications, such as analyzing samples of high complexity, unknown nature, or the peptidomics field.
Project description:Precision de novo peptide sequencing using mirror proteases of Ac-LysargiNase and trypsin for large-scale proteomicsPrecision de novo peptide sequencing using mirror proteases of Ac-LysargiNase and trypsin for large-scale proteomics
Project description:The analysis of samples from unsequenced and/or understudied species as well as samples where the proteome is derived from multiple organisms poses two key questions. The first is whether the proteomic data obtained from an unusual sample type even contains peptide tandem mass spectra. The second question is whether an appropriate protein sequence database is available for proteomic searches. We describe the use of automated de novo sequencing for evaluating both the quality of a collection of tandem mass spectra and the suitability of a given protein sequence database for searching that data. Applications of this method include the proteome analysis of closely related species, metaproteomics, and proteomics of extant organisms.
Project description:We selected humann intervertebral disc samples to perform proteomics analysis. There were 1 case of grade I , 1 case of grade II, 3 cases of grade Ⅲ and 3 cases of grade Ⅳ according to Pfirrmann classfication. RNA seqencing analysis and single-cell RNA sequencing were integrated with proteomics data to identify the hub genes for intervertebral disc degeneration using bioinformatic method.