Pepitome: evaluating improved spectral library search for identification complementarity and quality assessment.
ABSTRACT: Spectral libraries have emerged as a viable alternative to protein sequence databases for peptide identification. These libraries contain previously detected peptide sequences and their corresponding tandem mass spectra (MS/MS). Search engines can then identify peptides by comparing experimental MS/MS scans to those in the library. Many of these algorithms employ the dot product score for measuring the quality of a spectrum-spectrum match (SSM). This scoring system does not offer a clear statistical interpretation and ignores fragment ion m/z discrepancies in the scoring. We developed a new spectral library search engine, Pepitome, which employs statistical systems for scoring SSMs. Pepitome outperformed the leading library search tool, SpectraST, when analyzing data sets acquired on three different mass spectrometry platforms. We characterized the reliability of spectral library searches by confirming shotgun proteomics identifications through RNA-Seq data. Applying spectral library and database searches on the same sample revealed their complementary nature. Pepitome identifications enabled the automation of quality analysis and quality control (QA/QC) for shotgun proteomics data acquisition pipelines.
Project description:Methods harnessing protein cross-linking and mass spectrometry (XL-MS) offer high-throughput means to identify protein-protein interactions (PPIs) and structural interfaces of protein complexes. Yet, specialized data dependent methods and search algorithms are often required to confidently assign peptide identifications to spectra. To improve the efficiency of matching high confidence spectra, we developed a spectral library based approach to search cross-linked peptide data derived from Protein Interaction Reporter (PIR) methods using the spectral library search algorithm, SpectraST. Spectral library matching of cross-linked peptide data from query spectra increased the absolute number of confident peptide relationships matched to spectra and thereby the number of PPIs identified. By matching library spectra from bona fide, previously established PIR-cross-linked peptide relationships, spectral library searching reduces the need for continued, complex mass spectrometric methods to identify peptide relationships, increases coverage of relationship identifications, and improves the accessibility of XL-MS technologies.
Project description:Tandem mass spectrometry (MS/MS) has been used in analysis of proteins and their post-translational modifications. A recently developed data analysis method, which simulates MS/MS spectra of phosphopeptides and performs spectral library searching using SpectraST, facilitates confident localization of phosphorylation sites. However, its performance has been evaluated only on MS/MS spectra acquired using Orbitrap HCD mass spectrometers so far. In this study, we have investigated whether this approach would be applicable to another type of mass spectrometers, and optimized the simulation and search conditions to achieve sensitive and confident site localization. Synthetic phosphopeptides and enriched K562 cell phosphopeptides were analyzed using a TripleTOF 6600 mass spectrometer before and after enzymatic dephosphorylation. Dephosphorylated peptides identified by X!Tandem database searching were subjected to spectral simulation of all possible single phosphorylations using SimPhospho software. Phosphopeptides were identified and localized by SpectraST searching against a library of the simulated spectra. Although no synthetic phosphopeptide was localized at 1% false localization rate under the previous conditions, optimization of the spectral simulation and search conditions for the TripleTOF datasets achieved the localization and improved the sensitivity. Furthermore, the optimized conditions enabled sensitive localization of K562 phosphopeptides at 1% false discovery and localization rates. These results suggest that accurate phosphopeptide simulation of TripleTOF MS/MS spectra is possible and the simulated spectral libraries can be used in SpectraST searching for confident localization of phosphorylation sites.
Project description:Chagas disease also known as American trypanosomiasis is caused by the protozoan Trypanosoma cruzi. Over the last 30 years, Chagas disease has expanded from a neglected parasitic infection of the rural population to an urbanized chronic disease, becoming a potentially emergent global health problem. T. cruzi strains were assigned to seven genetic groups (TcI-TcVI and TcBat), named discrete typing units (DTUs), which represent a set of isolates that differ in virulence, pathogenicity and immunological features. Indeed, diverse clinical manifestations (from asymptomatic to highly severe disease) have been attempted to be related to T.cruzi genetic variability. Due to that, several DTU typing methods have been introduced. Each method has its own advantages and drawbacks such as high complexity and analysis time and all of them are based on genetic signatures. Recently, a novel method discriminated bacterial strains using a peptide identification-free, genome sequence-independent shotgun proteomics workflow. Here, we aimed to develop a Trypanosoma cruzi Strain Typing Assay using MS/MS peptide spectral libraries, named Tc-STAMS2.The Tc-STAMS2 method uses shotgun proteomics combined with spectral library search to assign and discriminate T. cruzi strains independently on the genome knowledge. The method is based on the construction of a library of MS/MS peptide spectra built using genotyped T. cruzi reference strains. For identification, the MS/MS peptide spectra of unknown T. cruzi cells are identified using the spectral matching algorithm SpectraST. The Tc-STAMS2 method allowed correct identification of all DTUs with high confidence. The method was robust towards different sample preparations, length of chromatographic gradients and fragmentation techniques. Moreover, a pilot inter-laboratory study showed the applicability to different MS platforms.This is the first study that develops a MS-based platform for T. cruzi strain typing. Indeed, the Tc-STAMS2 method allows T. cruzi strain typing using MS/MS spectra as discriminatory features and allows the differentiation of TcI-TcVI DTUs. Similar to genomic-based strategies, the Tc-STAMS2 method allows identification of strains within DTUs. Its robustness towards different experimental and biological variables makes it a valuable complementary strategy to the current T. cruzi genotyping assays. Moreover, this method can be used to identify DTU-specific features correlated with the strain phenotype.
Project description:With great biological interest in post-translational modifications (PTMs), various approaches have been introduced to identify PTMs using MS/MS. Recent developments for PTM identification have focused on an unrestrictive approach that searches MS/MS spectra for all known and possibly even unknown types of PTMs at once. However, the resulting expanded search space requires much longer search time and also increases the number of false positives (incorrect identifications) and false negatives (missed true identifications), thus creating a bottleneck in high throughput analysis. Here we introduce MODa, a novel "multi-blind" spectral alignment algorithm that allows for fast unrestrictive PTM searches with no limitation on the number of modifications per peptide while featuring over an order of magnitude speedup in relation to existing approaches. We demonstrate the sensitivity of MODa on human shotgun proteomics data where it reveals multiple mutations, a wide range of modifications (including glycosylation), and evidence for several putative novel modifications. Based on the reported findings, we argue that the efficiency and sensitivity of MODa make it the first unrestrictive search tool with the potential to fully replace conventional restrictive identification of proteomics mass spectrometry data.
Project description:A key problem in computational proteomics is distinguishing between correct and false peptide identifications. We argue that evaluating the error rates of peptide identifications is not unlike computing generating functions in combinatorics. We show that the generating functions and their derivatives ( spectral energy and spectral probability) represent new features of tandem mass spectra that, similarly to Delta-scores, significantly improve peptide identifications. Furthermore, the spectral probability provides a rigorous solution to the problem of computing statistical significance of spectral identifications. The spectral energy/probability approach improves the sensitivity-specificity tradeoff of existing MS/MS search tools, addresses the notoriously difficult problem of "one-hit-wonders" in mass spectrometry, and often eliminates the need for decoy database searches. We therefore argue that the generating function approach has the potential to increase the number of peptide identifications in MS/MS searches.
Project description:Open modification searching (OMS) is a powerful search strategy that identifies peptides carrying any type of modification by allowing a modified spectrum to match against its unmodified variant by using a very wide precursor mass window. A drawback of this strategy, however, is that it leads to a large increase in search time. Although performing an open search can be done using existing spectral library search engines by simply setting a wide precursor mass window, none of these tools have been optimized for OMS, leading to excessive runtimes and suboptimal identification results. We present the ANN-SoLo tool for fast and accurate open spectral library searching. ANN-SoLo uses approximate nearest neighbor indexing to speed up OMS by selecting only a limited number of the most relevant library spectra to compare to an unknown query spectrum. This approach is combined with a cascade search strategy to maximize the number of identified unmodified and modified spectra while strictly controlling the false discovery rate as well as a shifted dot product score to sensitively match modified spectra to their unmodified counterparts. ANN-SoLo achieves state-of-the-art performance in terms of speed and the number of identifications. On a previously published human cell line data set, ANN-SoLo confidently identifies more spectra than SpectraST or MSFragger and achieves a speedup of an order of magnitude compared with SpectraST. ANN-SoLo is implemented in Python and C++. It is freely available under the Apache 2.0 license at https://github.com/bittremieux/ANN-SoLo .
Project description:MOTIVATION: Identification of post-translationally modified proteins has become one of the central issues of current proteomics. Spectral library search is a new and promising computational approach to mass spectrometry-based protein identification. However, its potential in identification of unanticipated post-translational modifications has rarely been explored. The existing spectral library search tools are designed to match the query spectrum to the reference library spectra with the same peptide mass. Thus, spectra of peptides with unanticipated modifications cannot be identified. RESULTS: In this article, we present an open spectral library search tool, named pMatch. It extends the existing library search algorithms in at least three aspects to support the identification of unanticipated modifications. First, the spectra in library are optimized with the full peptide sequence information to better tolerate the peptide fragmentation pattern variations caused by some modification(s). Second, a new scoring system is devised, which uses charge-dependent mass shifts for peak matching and combines a probability-based model with the general spectral dot-product for scoring. Third, a target-decoy strategy is used for false discovery rate control. To demonstrate the effectiveness of pMatch, a library search experiment was conducted on a public dataset with over 40,000 spectra in comparison with SpectraST, the most popular library search engine. Additional validations were done on four published datasets including over 150,000 spectra. The results showed that pMatch can effectively identify unanticipated modifications and significantly increase spectral identification rate. AVAILABILITY: http://pfind.ict.ac.cn/pmatch/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Project description:Neuropeptides are essential for cell-cell communication in neurological and endocrine physiological processes in health and disease. While many neuropeptides have been identified in previous studies, the resulting data has not been structured to facilitate further analysis by tandem mass spectrometry (MS/MS), the main technology for high-throughput neuropeptide identification. Many neuropeptides are difficult to identify when searching MS/MS spectra against large protein databases because of their atypical lengths (e.g. shorter/longer than common tryptic peptides) and lack of tryptic residues to facilitate peptide ionization/fragmentation. NeuroPedia is a neuropeptide encyclopedia of peptide sequences (including genomic and taxonomic information) and spectral libraries of identified MS/MS spectra of homolog neuropeptides from multiple species. Searching neuropeptide MS/MS data against known NeuroPedia sequences will improve the sensitivity of database search tools. Moreover, the availability of neuropeptide spectral libraries will also enable the utilization of spectral library search tools, which are known to further improve the sensitivity of peptide identification. These will also reinforce the confidence in peptide identifications by enabling visual comparisons between new and previously identified neuropeptide MS/MS spectra.http://proteomics.ucsd.edu/Software/NeuroPedia.firstname.lastname@example.orgSupplementary materials are available at Bioinformatics online.
Project description:Spectral counting has become a popular method for LC-MS/MS based proteome quantification; however, this methodology is often not reliable when proteins are identified by a small number of spectra. Here, we present a simple strategy to improve spectral counting based quantification for low-abundance proteins by recovering low-quality or low-scoring spectra for confidently identified peptides. In this approach, stringent data filtering criteria were initially applied to achieve confident peptide identifications with low false discovery rate (e.g., < 1% at peptide level) after LC-MS/MS analysis and database search by SEQUEST. Then, all low-scoring MS/MS spectra that matched to this set of confidently identified peptides were recovered, leading to more than 20% increase of total identified spectra. The validity of these recovered spectra was assessed by the parent ion mass measurement error distribution, retention time distribution, and by comparing the individual low score and high score spectra that correspond to the same peptides. The results support that the recovered low-scoring spectra have similar confidence levels in peptide identifications as the spectra passing the initial stringent filter. The application of this strategy of recovering low-scoring spectra significantly improved the spectral count quantification statistics for low-abundance proteins, as illustrated in the identification of mouse brain region specific proteins.
Project description:Currently data-dependent acquisition (DDA) is the method of choice for mass spectrometry-based proteomics discovery experiments, but data-independent acquisition (DIA) is steadily becoming more important. One of the most important requirements to perform a DIA analysis is the availability of suitable spectral libraries for peptide identification and quantification. Several studies were performed addressing the evaluation of spectral library performance for protein identification in DIA measurements. But so far only few experiments estimate the effect of these libraries on the quantitative level.In this work we created a gold standard spike-in sample set with known contents and ratios of proteins in a complex protein matrix that allowed a detailed comparison of DIA quantification data obtained with different spectral library approaches. We used in-house generated sample-specific spectral libraries created using varying sample preparation approaches and repeated DDA measurement. In addition, two different search engines were tested for protein identification from DDA data and subsequent library generation. In total, eight different spectral libraries were generated, and the quantification results compared with a library free method, as well as a default DDA analysis. Not only the number of identifications on peptide and protein level in the spectral libraries and the corresponding DIA analysis results was inspected, but also the number of expected and identified differentially abundant protein groups and their ratios.We found, that while libraries of prefractionated samples were generally larger, there was no significant increase in DIA identifications compared with repetitive non-fractionated measurements. Furthermore, we show that the accuracy of the quantification is strongly dependent on the applied spectral library and whether the quantification is based on peptide or protein level. Overall, the reproducibility and accuracy of DIA quantification is superior to DDA in all applied approaches.Data has been deposited to the ProteomeXchange repository with identifiers PXD012986, PXD012987, PXD012988 and PXD014956.