Enhanced peptide quantification using spectral count clustering and cluster abundance.
ABSTRACT: Quantification of protein expression by means of mass spectrometry (MS) has been introduced in various proteomics studies. In particular, two label-free quantification methods, such as spectral counting and spectra feature analysis have been extensively investigated in a wide variety of proteomic studies. The cornerstone of both methods is peptide identification based on a proteomic database search and subsequent estimation of peptide retention time. However, they often suffer from restrictive database search and inaccurate estimation of the liquid chromatography (LC) retention time. Furthermore, conventional peptide identification methods based on the spectral library search algorithms such as SEQUEST or SpectraST have been found to provide neither the best match nor high-scored matches. Lastly, these methods are limited in the sense that target peptides cannot be identified unless they have been previously generated and stored into the database or spectral libraries.To overcome these limitations, we propose a novel method, namely Quantification method based on Finding the Identical Spectral set for a Homogenous peptide (Q-FISH) to estimate the peptide's abundance from its tandem mass spectrometry (MS/MS) spectra through the direct comparison of experimental spectra. Intuitively, our Q-FISH method compares all possible pairs of experimental spectra in order to identify both known and novel proteins, significantly enhancing identification accuracy by grouping replicated spectra from the same peptide targets.We applied Q-FISH to Nano-LC-MS/MS data obtained from human hepatocellular carcinoma (HCC) and normal liver tissue samples to identify differentially expressed peptides between the normal and disease samples. For a total of 44,318 spectra obtained through MS/MS analysis, Q-FISH yielded 14,747 clusters. Among these, 5,777 clusters were identified only in the HCC sample, 6,648 clusters only in the normal tissue sample, and 2,323 clusters both in the HCC and normal tissue samples. While it will be interesting to investigate peptide clusters only found from one sample, further examined spectral clusters identified both in the HCC and normal samples since our goal is to identify and assess differentially expressed peptides quantitatively. The next step was to perform a beta-binomial test to isolate differentially expressed peptides between the HCC and normal tissue samples. This test resulted in 84 peptides with significantly differential spectral counts between the HCC and normal tissue samples. We independently identified 50 and 95 peptides by SEQUEST, of which 24 and 56 peptides, respectively, were found to be known biomarkers for the human liver cancer. Comparing Q-FISH and SEQUEST results, we found 22 of the differentially expressed 84 peptides by Q-FISH were also identified by SEQUEST. Remarkably, of these 22 peptides discovered both by Q-FISH and SEQUEST, 13 peptides are known for human liver cancer and the remaining 9 peptides are known to be associated with other cancers.We proposed a novel statistical method, Q-FISH, for accurately identifying protein species and simultaneously quantifying the expression levels of identified peptides from mass spectrometry data. Q-FISH analysis on human HCC and liver tissue samples identified many protein biomarkers that are highly relevant to HCC. Q-FISH can be a useful tool both for peptide identification and quantification on mass spectrometry data analysis. It may also prove to be more effective in discovering novel protein biomarkers than SEQUEST and other standard methods.
Project description:Spectral counting has become a popular method for LC-MS/MS based proteome quantification; however, this methodology is often not reliable when proteins are identified by a small number of spectra. Here, we present a simple strategy to improve spectral counting based quantification for low-abundance proteins by recovering low-quality or low-scoring spectra for confidently identified peptides. In this approach, stringent data filtering criteria were initially applied to achieve confident peptide identifications with low false discovery rate (e.g., < 1% at peptide level) after LC-MS/MS analysis and database search by SEQUEST. Then, all low-scoring MS/MS spectra that matched to this set of confidently identified peptides were recovered, leading to more than 20% increase of total identified spectra. The validity of these recovered spectra was assessed by the parent ion mass measurement error distribution, retention time distribution, and by comparing the individual low score and high score spectra that correspond to the same peptides. The results support that the recovered low-scoring spectra have similar confidence levels in peptide identifications as the spectra passing the initial stringent filter. The application of this strategy of recovering low-scoring spectra significantly improved the spectral count quantification statistics for low-abundance proteins, as illustrated in the identification of mouse brain region specific proteins.
Project description:The analysis of tandem mass (MS/MS) data to identify and quantify proteins is hampered by the heterogeneity of file formats at the raw spectral data, peptide identification, and protein identification levels. Different mass spectrometers output their raw spectral data in a variety of proprietary formats, and alternative methods that assign peptides to MS/MS spectra and infer protein identifications from those peptide assignments each write their results in different formats. Here we describe an MS/MS analysis platform, the Trans-Proteomic Pipeline, which makes use of open XML file formats for storage of data at the raw spectral data, peptide, and protein levels. This platform enables uniform analysis and exchange of MS/MS data generated from a variety of different instruments, and assigned peptides using a variety of different database search programs. We demonstrate this by applying the pipeline to data sets generated by ThermoFinnigan LCQ, ABI 4700 MALDI-TOF/TOF, and Waters Q-TOF instruments, and searched in turn using SEQUEST, Mascot, and COMET.
Project description:Identifying peptides from mass spectrometric fragmentation data (MS/MS spectra) using search strategies that map protein sequences to spectra is computationally expensive. An alternative strategy uses direct spectrum-to-spectrum matching against a reference library of previously observed MS/MS that has the advantage of evaluating matches using fragment ion intensities and other ion types than the simple set normally used. However, this approach is limited by the small sizes of the available peptide MS/MS libraries and the inability to evaluate the rate of false assignments. In this study, we observed good performance of simulated spectra generated by the kinetic model implemented in MassAnalyzer (Zhang, Z. (2004) Prediction of low-energy collision-induced dissociation spectra of peptides. Anal. Chem. 76, 3908-3922; Zhang, Z. (2005) Prediction of low-energy collision-induced dissociation spectra of peptides with three or more charges. Anal. Chem. 77, 6364-6373) as a substitute for the reference libraries used by the spectrum-to-spectrum search programs X!Hunter and BiblioSpec and similar results in comparison with the spectrum-to-sequence program Mascot. We also demonstrate the use of simulated spectra for searching against decoy sequences to estimate false discovery rates. Although we found lower score discrimination with spectrum-to-spectrum searches than with Mascot, particularly for higher charge forms, comparable peptide assignments with low false discovery rate were achieved by examining consensus between X!Hunter and Mascot, filtering results by mass accuracy, and ignoring score thresholds. Protein identification results are comparable to those achieved when evaluating consensus between Sequest and Mascot. Run times with large scale data sets using X!Hunter with the simulated spectral library are 7 times faster than Mascot and 80 times faster than Sequest with the human International Protein Index (IPI) database. We conclude that simulated spectral libraries greatly expand the search space available for spectrum-to-spectrum searching while enabling principled analyses and that the approach can be used in consensus strategies for large scale studies while reducing search times.
Project description:Recent emergence of new mass spectrometry techniques (e.g. electron transfer dissociation, ETD) and improved availability of additional proteases (e.g. Lys-N) for protein digestion in high-throughput experiments raised the challenge of designing new algorithms for interpreting the resulting new types of tandem mass (MS/MS) spectra. Traditional MS/MS database search algorithms such as SEQUEST and Mascot were originally designed for collision induced dissociation (CID) of tryptic peptides and are largely based on expert knowledge about fragmentation of tryptic peptides (rather than machine learning techniques) to design CID-specific scoring functions. As a result, the performance of these algorithms is suboptimal for new mass spectrometry technologies or nontryptic peptides. We recently proposed the generating function approach (MS-GF) for CID spectra of tryptic peptides. In this study, we extend MS-GF to automatically derive scoring parameters from a set of annotated MS/MS spectra of any type (e.g. CID, ETD, etc.), and present a new database search tool MS-GFDB based on MS-GF. We show that MS-GFDB outperforms Mascot for ETD spectra or peptides digested with Lys-N. For example, in the case of ETD spectra, the number of tryptic and Lys-N peptides identified by MS-GFDB increased by a factor of 2.7 and 2.6 as compared with Mascot. Moreover, even following a decade of Mascot developments for analyzing CID spectra of tryptic peptides, MS-GFDB (that is not particularly tailored for CID spectra or tryptic peptides) resulted in 28% increase over Mascot in the number of peptide identifications. Finally, we propose a statistical framework for analyzing multiple spectra from the same precursor (e.g. CID/ETD spectral pairs) and assigning p values to peptide-spectrum-spectrum matches.
Project description:SEQUEST has long been used to identify peptides/proteins from their tandem mass spectra and protein sequence databases. The algorithm has proven to be hugely successful for its sensitivity and specificity in identifying peptides/proteins, the sequences of which are present in the protein sequence databases. In this work, we report on work that attempts a new use for the algorithm by applying it to search a complete list of theoretically possible peptides, a de novo-like sequencing. We used freely available mass spectral data and determined a number of unique peptides as identified by SEQUEST. Using masses of these peptides and the mass accuracy of 0.001 Da, we have created a database of all theoretically possible peptide sequences corresponding to the precursor masses. We used our recently developed algorithm for determining all amino acid compositions corresponding to a mass interval, and used a lexicographic ordering to generate theoretical sequences from the compositions. The newly generated theoretical database was many-fold more complex than the original protein sequence database. We used SEQUEST to search and identify the best matches to the spectra from all theoretically possible peptide sequences. We found that SEQUEST cross-correlation score ranked the correct peptide match among the top sequence matches. The results testify to the high specificity of SEQUEST when combined with the high mass accuracy for intact peptides. Graphical Abstract ?.
Project description:A concept of unique peptides (CUP) was proposed and implemented to identify whole-cell proteins from tandem mass spectrometry (MS/MS) ion spectra. A unique peptide is defined as a peptide, irrespective of its length, that exists only in one protein of a proteome of interest, despite the fact that this peptide may appear more than once in the same protein. Integrating CUP, a two-step whole-cell protein identification strategy was developed to further increase the confidence of identified proteins. A dataset containing 40,243 MS/MS ion spectra of Saccharomyces cerevisiae and protein identification tools including Mascot and SEQUEST were used to illustrate the proposed concept and strategy. Without implementing CUP, the proteins identified by SEQUEST are 2.26 fold of those identified by Mascot. When CUP was applied, the proteins bearing unique peptides identified by SEQUEST are 3.89 fold of those identified by Mascot. By cross-comparing two sets of identified proteins, only 89 common proteins derived from CUP were found. The key discrepancy between identified proteins was resulted from the filtering criteria employed by each protein identification tool. According to the origin of peptides classified by CUP and the commonality of proteins recognized by protein identification tools, all identified proteins were cross-compared, resulting in four groups of proteins possessing different levels of assigned confidence.
Project description:MassMatrix is a program that matches tandem mass spectra with theoretical peptide sequences derived from a protein database. The program uses a mass accuracy sensitive probabilistic score model to rank peptide matches. The MS/MS search software was evaluated by use of a high mass accuracy dataset and its results compared with those from MASCOT, SEQUEST, X!Tandem, and OMSSA. For the high mass accuracy data, MassMatrix provided better sensitivity than MASCOT, SEQUEST, X!Tandem, and OMSSA for a given specificity and the percentage of false positives was 2%. More importantly all manually validated true positives corresponded to a unique peptide/spectrum match. The presence of decoy sequence and additional variable PTMs did not significantly affect the results from the high mass accuracy search. MassMatrix performs well when compared with MASCOT, SEQUEST, X!Tandem, and OMSSA with regard to search time. MassMatrix was also run on a distributed memory clusters and achieved search speeds of approximately 100,000 spectra per hour when searching against a complete human database with eight variable modifications. The algorithm is available for public searches at (http://www.massmatrix.net).
Project description:Liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) has revolutionized the proteomics analysis of complexes, cells, and tissues. In a typical proteomic analysis, the tandem mass spectra from a LC-MS/MS experiment are assigned to a peptide by a search engine that compares the experimental MS/MS peptide data to theoretical peptide sequences in a protein database. The peptide spectra matches are then used to infer a list of identified proteins in the original sample. However, the search engines often fail to distinguish between correct and incorrect peptides assignments. In this study, we designed and implemented a novel algorithm called De-Noise to reduce the number of incorrect peptide matches and maximize the number of correct peptides at a fixed false discovery rate using a minimal number of scoring outputs from the SEQUEST search engine. The novel algorithm uses a three-step process: data cleaning, data refining through a SVM-based decision function, and a final data refining step based on proteolytic peptide patterns. Using proteomics data generated on different types of mass spectrometers, we optimized the De-Noise algorithm on the basis of the resolution and mass accuracy of the mass spectrometer employed in the LC-MS/MS experiment. Our results demonstrate De-Noise improves peptide identification compared to other methods used to process the peptide sequence matches assigned by SEQUEST. Because De-Noise uses a limited number of scoring attributes, it can be easily implemented with other search engines.
Project description:This article provides information regarding the effect of four common high abundant protein (albumin and immunoglobulins (Ig)) depletion strategies upon serum proteomics datasets derived from normal, non-diseased rat or human serum. After tryptic digest, peptides were separated using C18 reverse phase liquid chromatography-tandem mass spectrometry (rpLC-MS/MS). Peptide spectral matching (PSM) and database searching was conducted using MS Amanda 2.0 and Sequest HT. Peptide and protein false discovery rates (FDR) were set at 0.01%, with at least two peptides assigned per protein. Protein quantitation and the extent of albumin and Ig removal was defined by PSM counts. Venn diagram analysis of the core proteomes, derived from proteins identified by both search engines, was performed using Venny. Ontological characterization and gene set enrichment were performed using WebGestalt. The dataset resulting from each depletion column is provided.
Project description:Fewer than half of all tandem mass spectrometry (MS/MS) spectra acquired in shotgun proteomics experiments are typically matched to a peptide with high confidence. Here we determine the identity of unassigned peptides using an ultra-tolerant Sequest database search that allows peptide matching even with modifications of unknown masses up to ± 500 Da. In a proteome-wide data set on HEK293 cells (9,513 proteins and 396,736 peptides), this approach matched an additional 184,000 modified peptides, which were linked to biological and chemical modifications representing 523 distinct mass bins, including phosphorylation, glycosylation and methylation. We localized all unknown modification masses to specific regions within a peptide. Known modifications were assigned to the correct amino acids with frequencies >90%. We conclude that at least one-third of unassigned spectra arise from peptides with substoichiometric modifications.