Project description:Dependent on concise, pre-defined protein sequence databases, traditional search algorithms perform poorly when analyzing mass spectra derived from wholly uncharacterized protein products. Conversely, de novo peptide sequencing algorithms can interpret mass spectra without relying on reference databases. However, such algorithms have been difficult to apply to complex protein mixtures, in part due to a lack of methods for automatically validating de novo sequencing results. Here, we present novel metrics for benchmarking de novo sequencing algorithm performance on large scale proteomics datasets, and present a method for accurately calibrating false discovery rates on de novo results. We also present a novel algorithm (LADS) which leverages experimentally disambiguated fragmentation spectra to boost sequencing accuracy and sensitivity. LADS improves sequencing accuracy on longer peptides relative to other algorithms and improves discriminability of correct and incorrect sequences. Using these advancements, we demonstrate accurate de novo identification of peptide sequences not identifiable using database search-based approaches.
Project description:De novo peptide sequencing is a fundamental research area in mass spectrometry (MS) based proteomics. However, those methods have often been evaluated using a couple of simple metrics that do not fully reflect their overall performance. Moreover, there has not been an established method to estimate the false discovery rate (FDR) and the significance of de novo peptide-spectrum matches (PSMs). Here we propose NovoBoard, a comprehensive framework to evaluate the performance of de novo peptide sequencing methods. The framework consists of diverse benchmark datasets (including tryptic, nontryptic, immunopeptidomics, and different species), and a standard set of accuracy metrics to evaluate the fragment ions, amino acids, and peptides of the de novo results. More importantly, a new approach is designed to evaluate de novo peptide sequencing methods on target-decoy spectra and to estimate their FDRs. Our results thoroughly reveal the strengths and weaknesses of different de novo peptide sequencing methods, and how their performances depend on specific applications and the types of data. Our FDR estimation also shows that some tools may perform better than the others in distinguishing between de novo PSMs and random matches, and can be used to assess the significance of de novo PSMs.
Project description:Precision de novo peptide sequencing using mirror proteases of Ac-LysargiNase and trypsin for large-scale proteomicsPrecision de novo peptide sequencing using mirror proteases of Ac-LysargiNase and trypsin for large-scale proteomics
Project description:Demonstration of novel experimental and computational pipeline for high performance de novo peptide sequencing. E. coli whole cell lysate is carbamylated to block lysine side chains, digested with trypsin (now active only at Arg residues), and N-terminally tagged with the chromophore AMCA. Three technical replicates were analyzed using a Thermo Velos Pro dual linear ion trap mass spectrometer coupled to a Coherent 351 nm excimer laser. 351 nm ultraviolet photodissociation (UVPD) of parent peptides produces MS2 spectra dominated by the y-type ion series. We developed the software tool UVnovo for de novo sequencing of these spectra and used Proteome Discoverer SEQUEST/Percolator to generate a dataset for training and validation. UVnovo results provided here derive from a 3-fold cross validation regime. Our methods and dataset are described in the accompanying publication.
Project description:Shotgun protein sequencing with meta-contig assembly.
Full-length de novo sequencing from tandem mass (MS/MS) spectra of unknown proteins such as antibodies or proteins from organisms with unsequenced genomes remains a challenging open problem. Conventional algorithms designed to individually sequence each MS/MS spectrum are limited by incomplete peptide fragmentation or low signal to noise ratios and tend to result in short de novo sequences at low sequencing accuracy. Our shotgun protein sequencing (SPS) approach was developed to ameliorate these limitations by first finding groups of unidentified spectra from the same peptides (contigs) and then deriving a consensus de novo sequence for each assembled set of spectra (contig sequences). But whereas SPS enables much more accurate reconstruction of de novo sequences longer than can be recovered from individual MS/MS spectra, it still requires error-tolerant matching to homologous proteins to group smaller contig sequences into full-length protein sequences, thus limiting its effectiveness on sequences from poorly annotated proteins. Using low and high resolution CID and high resolution HCD MS/MS spectra, we address this limitation with a Meta-SPS algorithm designed to overlap and further assemble SPS contigs into Meta-SPS de novo contig sequences extending as long as 100 amino acids at over 97% accuracy without requiring any knowledge of homologous protein sequences. We demonstrate Meta-SPS using distinct MS/MS data sets obtained with separate enzymatic digestions and discuss how the remaining de novo sequencing limitations relate to MS/MS acquisition settings.
Project description:For this manuscript, the Prochlorococcus MED4 strain shotgun proteome dataset was used for benchmarking a de novo-directed sequencing approach. De novo peptide sequencing, where the sequence of amino acids is determined directly from mass spectra rather than by comparison (or peptide spectrum matching) to a selected database. We perform a benchmarking experiment using Prochlorococcus culture data, demonstrating de novo peptides are sufficiently accurate and taxonomically specific to be useful in environmental studies. The MED4 dataset herein represents the output from peptide spectrum matching using COMET within the transproteomic pipeline (TPP). Additional MED4 data outside this manuscript are included for both trypsin and Glu-C protease digestions as well as TPP output for post-translational modification searches. De novo output data derived from Peaks Studio can be found by referencing the manuscript publication.
Project description:Predicted peptides for the 9-species de novo sequencing benchmark MSV000090982 as described in Yilmaz et al. [Yilmaz2023]. FTP directory contains outputs of 5 de novo peptide sequencing methods on the 9-species benchmark: Casanovo, Casanovo_bm (benchmark), PointNovo, DeepNovo and Novor. Output files for Casanovo contain scan numbers and run names to allow matching to spectra files. [Yilmaz2023] M. Yilmaz*, W. Fondrie*, W. Bittremieux*, R. Nelson, V. Ananth, S. Oh, and W. Noble,"Sequence-to-sequence translation from mass spectra to peptides with a transformer model", bioRxiv, 2023
Project description:Predicted peptides for the 9-species de novo sequencing benchmark MSV000090982 as described in Yilmaz et al. [Yilmaz2023]. FTP directory contains outputs of 5 de novo peptide sequencing methods on the 9-species benchmark: Casanovo, Casanovo_bm (benchmark), PointNovo, DeepNovo and Novor. [Yilmaz2023] M. Yilmaz*, W. Fondrie*, W. Bittremieux*, R. Nelson, V. Ananth, S. Oh, and W. Noble,"Sequence-to-sequence translation from mass spectra to peptides with a transformer model", bioRxiv, 2023
Project description:Complex MS-based proteomics datasets are usually analyzed by protein database-searches. While this approach performs considerably well for sequenced organisms, direct inference of peptide sequences from tandem mass spectra, i.e. de novo peptide sequencing, oftentimes is the only way to obtain information when protein databases are absent. However, available algorithms suffer from drawbacks such as lack of validation and often high rates of false positive hits (FP). Here we present a simple method of combining results from commonly available de novo peptide sequencing algorithms, which in conjunction with minor tweaks in data acquisition ensues lower empirical FDR compared to the analysis using single algorithms. Results were validated using state-of-the art database search algorithms as well specifically synthesized reference peptides. Thus, we could increase the number of PSMs meeting a stringent FDR of 5% more than threefold compared to the single best de novo sequencing algorithm alone, accounting for an average of 11,120 PSMs (combined) instead of 3,476 PSMs (alone) in triplicate 2 h LC-MS runs of tryptic HeLa digestion.