A novel approach to denoising ion trap tandem mass spectra.
ABSTRACT: BACKGROUND:Mass spectrometers can produce a large number of tandem mass spectra. They are unfortunately noise-contaminated. Noises can affect the quality of tandem mass spectra and thus increase the false positives and false negatives in the peptide identification. Therefore, it is appealing to develop an approach to denoising tandem mass spectra. RESULTS:We propose a novel approach to denoising tandem mass spectra. The proposed approach consists of two modules: spectral peak intensity adjustment and intensity local maximum extraction. In the spectral peak intensity adjustment module, we introduce five features to describe the quality of each peak. Based on these features, a score is calculated for each peak and is used to adjust its intensity. As a result, the intensity will be adjusted to a local maximum if a peak is a signal peak, and it will be decreased if the peak is a noisy one. The second module uses a morphological reconstruction filter to remove the peaks whose intensities are not the local maxima of the spectrum. Experiments have been conducted on two ion trap tandem mass spectral datasets: ISB and TOV. Experimental results show that our algorithm can remove about 69% of the peaks of a spectrum. At the same time, the number of spectra that can be identified by Mascot algorithm increases by 31.23% and 14.12% for the two tandem mass spectra datasets, respectively. CONCLUSION:The proposed denoising algorithm can be integrated into current popular peptide identification algorithms such as Mascot to improve the reliability of assigning peptides to spectra. AVAILABILITY OF THE SOFTWARE: The software created from this work is available upon request.
Project description:<h4>Background</h4>High-throughput shotgun proteomics data contain a significant number of spectra from non-peptide ions or spectra of too poor quality to obtain highly confident peptide identifications. These spectra cannot be identified with any positive peptide matches in some database search programs or are identified with false positives in others. Removing these spectra can improve the database search results and lower computational expense.<h4>Results</h4>A new algorithm has been developed to filter tandem mass spectra of poor quality from shotgun proteomic experiments. The algorithm determines the noise level dynamically and independently for each spectrum in a tandem mass spectrometric data set. Spectra are filtered based on a minimum number of required signal peaks with a signal-to-noise ratio of 2. The algorithm was tested with 23 sample data sets containing 62,117 total spectra.<h4>Conclusions</h4>The spectral screening removed 89.0% of the tandem mass spectra that did not yield a peptide match when searched with the MassMatrix database search software. Only 6.0% of tandem mass spectra that yielded peptide matches considered to be true positive matches were lost after spectral screening. The algorithm was found to be very effective at removal of unidentified spectra in other database search programs including Mascot, OMSSA, and X!Tandem (75.93%-91.00%) with a small loss (3.59%-9.40%) of true positive matches.
Project description:Phosphorylation site assignment of high throughput tandem mass spectrometry (LC-MS/MS) data is one of the most common and critical aspects of phosphoproteomics. Correctly assigning phosphorylated residues helps us understand their biological significance. The design of common search algorithms (such as Sequest, Mascot etc.) do not incorporate site assignment; therefore additional algorithms are essential to assign phosphorylation sites for mass spectrometry data. The main contribution of this study is the design and implementation of a linear time and space dynamic programming strategy for phosphorylation site assignment referred to as PhosSA. The proposed algorithm uses summation of peak intensities associated with theoretical spectra as an objective function. Quality control of the assigned sites is achieved using a post-processing redundancy criteria that indicates the signal-to-noise ratio properties of the fragmented spectra. The quality assessment of the algorithm was determined using experimentally generated data sets using synthetic peptides for which phosphorylation sites were known. We report that PhosSA was able to achieve a high degree of accuracy and sensitivity with all the experimentally generated mass spectrometry data sets. The implemented algorithm is shown to be extremely fast and scalable with increasing number of spectra (we report up to 0.5 million spectra/hour on a moderate workstation). The algorithm is designed to accept results from both Sequest and Mascot search engines. An executable is freely available at http://helixweb.nih.gov/ESBL/PhosSA/ for academic research purposes.
Project description:MassMatrix is a program that matches tandem mass spectra with theoretical peptide sequences derived from a protein database. The program uses a mass accuracy sensitive probabilistic score model to rank peptide matches. The MS/MS search software was evaluated by use of a high mass accuracy dataset and its results compared with those from MASCOT, SEQUEST, X!Tandem, and OMSSA. For the high mass accuracy data, MassMatrix provided better sensitivity than MASCOT, SEQUEST, X!Tandem, and OMSSA for a given specificity and the percentage of false positives was 2%. More importantly all manually validated true positives corresponded to a unique peptide/spectrum match. The presence of decoy sequence and additional variable PTMs did not significantly affect the results from the high mass accuracy search. MassMatrix performs well when compared with MASCOT, SEQUEST, X!Tandem, and OMSSA with regard to search time. MassMatrix was also run on a distributed memory clusters and achieved search speeds of approximately 100,000 spectra per hour when searching against a complete human database with eight variable modifications. The algorithm is available for public searches at (http://www.massmatrix.net).
Project description:Today's highly accurate spectra provided by modern tandem mass spectrometers offer considerable advantages for the analysis of proteomic samples of increased complexity. Among other factors, the quantity of reliably identified peptides is considerably influenced by the peptide identification algorithm. While most widely used search engines were developed when high-resolution mass spectrometry data were not readily available for fragment ion masses, we have designed a scoring algorithm particularly suitable for high mass accuracy. Our algorithm, MS Amanda, is generally applicable to HCD, ETD, and CID fragmentation type data. The algorithm confidently explains more spectra at the same false discovery rate than Mascot or SEQUEST on examined high mass accuracy data sets, with excellent overlap and identical peptide sequence identification for most spectra also explained by Mascot or SEQUEST. MS Amanda, available at http://ms.imp.ac.at/?goto=msamanda , is provided free of charge both as standalone version for integration into custom workflows and as a plugin for the Proteome Discoverer platform.
Project description:A novel hierarchical MS(2)/MS(3) database search algorithm has been developed to analyze MS(2)/MS(3) phosphopeptides proteomic data. The algorithm is incorporated in an automated database search program, MassMatrix. The algorithm matches experimental MS(2) spectra against a supplied protein database to determine candidate peptide matches. It then matches the corresponding experimental MS(3) spectra against those candidate peptide matches. The MS(2) and MS(3) spectra are used in concert to arrive at peptide matches with overall higher confidence rather than combining MS(2) and MS(3) data searched separately. Receiver operating characteristic analysis showed that hierarchical MS(2)/MS(3) database searches with MassMatrix had better sensitivity and specificity than the two-stage MS(2)/MS(3) database searches obtained with MassMatrix, MASCOT, and X!Tandem. A greater number of true peptide matches at a given false rate were identified by use of this new algorithm for data collected on both LCQ and LTQ-FTICR mass spectrometers. The additional MS(3) spectral data also improved the overall reliability and the number of true positives (TPs) due to the fact that the TPs of the MS(2)/MS(3) search results had higher scores than those of the MS(2).
Project description:Recent emergence of new mass spectrometry techniques (e.g. electron transfer dissociation, ETD) and improved availability of additional proteases (e.g. Lys-N) for protein digestion in high-throughput experiments raised the challenge of designing new algorithms for interpreting the resulting new types of tandem mass (MS/MS) spectra. Traditional MS/MS database search algorithms such as SEQUEST and Mascot were originally designed for collision induced dissociation (CID) of tryptic peptides and are largely based on expert knowledge about fragmentation of tryptic peptides (rather than machine learning techniques) to design CID-specific scoring functions. As a result, the performance of these algorithms is suboptimal for new mass spectrometry technologies or nontryptic peptides. We recently proposed the generating function approach (MS-GF) for CID spectra of tryptic peptides. In this study, we extend MS-GF to automatically derive scoring parameters from a set of annotated MS/MS spectra of any type (e.g. CID, ETD, etc.), and present a new database search tool MS-GFDB based on MS-GF. We show that MS-GFDB outperforms Mascot for ETD spectra or peptides digested with Lys-N. For example, in the case of ETD spectra, the number of tryptic and Lys-N peptides identified by MS-GFDB increased by a factor of 2.7 and 2.6 as compared with Mascot. Moreover, even following a decade of Mascot developments for analyzing CID spectra of tryptic peptides, MS-GFDB (that is not particularly tailored for CID spectra or tryptic peptides) resulted in 28% increase over Mascot in the number of peptide identifications. Finally, we propose a statistical framework for analyzing multiple spectra from the same precursor (e.g. CID/ETD spectral pairs) and assigning p values to peptide-spectrum-spectrum matches.
Project description:BACKGROUND: High-resolution tandem mass spectra can now be readily acquired with hybrid instruments, such as LTQ-Orbitrap and LTQ-FT, in high-throughput shotgun proteomics workflows. The improved spectral quality enables more accurate de novo sequencing for identification of post-translational modifications and amino acid polymorphisms. RESULTS: In this study, a new de novo sequencing algorithm, called Vonode, has been developed specifically for analysis of such high-resolution tandem mass spectra. To fully exploit the high mass accuracy of these spectra, a unique scoring system is proposed to evaluate sequence tags based primarily on mass accuracy information of fragment ions. Consensus sequence tags were inferred for 11,422 spectra with an average peptide length of 5.5 residues from a total of 40,297 input spectra acquired in a 24-hour proteomics measurement of Rhodopseudomonas palustris. The accuracy of inferred consensus sequence tags was 84%. According to our comparison, the performance of Vonode was shown to be superior to the PepNovo v2.0 algorithm, in terms of the number of de novo sequenced spectra and the sequencing accuracy. CONCLUSIONS: Here, we improved de novo sequencing performance by developing a new algorithm specifically for high-resolution tandem mass spectral data. The Vonode algorithm is freely available for download at http://compbio.ornl.gov/Vonode.
Project description:Mixture - modeling of mass spectra is an approach with many potential applications including peak detection and quantification, smoothing, de-noising, feature extraction and spectral signal compression. However, existing algorithms do not allow for automated analyses of whole spectra. Therefore, despite highlighting potential advantages of mixture modeling of mass spectra of peptide/protein mixtures and some preliminary results presented in several papers, the mixture modeling approach was so far not developed to the stage enabling systematic comparisons with existing software packages for proteomic mass spectra analyses. In this paper we present an efficient algorithm for Gaussian mixture modeling of proteomic mass spectra of different types (e.g., MALDI-ToF profiling, MALDI-IMS). The main idea is automated partitioning of protein mass spectral signal into fragments. The obtained fragments are separately decomposed into Gaussian mixture models. The parameters of the mixture models of fragments are then aggregated to form the mixture model of the whole spectrum. We compare the elaborated algorithm to existing algorithms for peak detection and we demonstrate improvements of peak detection efficiency obtained by using Gaussian mixture modeling. We also show applications of the elaborated algorithm to real proteomic datasets of low and high resolution.
Project description:Label-free quantification has become a common-practice in many mass spectrometry-based proteomics experiments. In recent years, we and others have shown that spectral clustering can considerably improve the analysis of (primarily large-scale) proteomics data sets. Here we show that spectral clustering can be used to infer additional peptide-spectrum matches and improve the quality of label-free quantitative proteomics data in data sets also containing only tens of MS runs. We analyzed four well-known public benchmark data sets that represent different experimental settings using spectral counting and peak intensity based label-free quantification. In both approaches, the additionally inferred peptide-spectrum matches through our spectra-cluster algorithm improved the detectability of low abundant proteins while increasing the accuracy of the derived quantitative data, without increasing the data sets' noise. Additionally, we developed a Proteome Discoverer node for our spectra-cluster algorithm which allows anyone to rebuild our proposed pipeline using the free version of Proteome Discoverer.
Project description:In shotgun proteomics, tandem mass spectra of peptides are typically identified through database search algorithms such as Sequest. We have developed DirecTag, an open-source algorithm to infer partial sequence tags directly from observed fragment ions. This algorithm is unique in its implementation of three separate scoring systems to evaluate each tag on the basis of peak intensity, m/ z fidelity, and complementarity. In data sets from several types of mass spectrometers, DirecTag reproducibly exceeded the accuracy and speed of InsPecT and GutenTag, two previously published algorithms for this purpose. The source code and binaries for DirecTag are available from http://fenchurch.mc.vanderbilt.edu.