Confidence assignment for mass spectrometry based peptide identifications via the extreme value distribution.
ABSTRACT: There is a growing trend for biomedical researchers to extract evidence and draw conclusions from mass spectrometry based proteomics experiments, the cornerstone of which is peptide identification. Inaccurate assignments of peptide identification confidence thus may have far-reaching and adverse consequences. Although some peptide identification methods report accurate statistics, they have been limited to certain types of scoring function. The extreme value statistics based method, while more general in the scoring functions it allows, demands accurate parameter estimates and requires, at least in its original design, excessive computational resources. Improving the parameter estimate accuracy and reducing the computational cost for this method has two advantages: it provides another feasible route to accurate significance assessment, and it could provide reliable statistics for scoring functions yet to be developed.We have formulated and implemented an efficient algorithm for calculating the extreme value statistics for peptide identification applicable to various scoring functions, bypassing the need for searching large random databases.The source code, implemented in C?++?on a linux system, is available for download at ftp://ftp.ncbi.nlm.nih.gov/pub/qmbp/qmbp_ms/RAId/RAId_Linux_64Bityyu@ncbi.nlm.nih.govSupplementary data are available at Bioinformatics online.
Project description:<h4>Motivation</h4>Assigning statistical significance accurately has become increasingly important as metadata of many types, often assembled in hierarchies, are constructed and combined for further biological analyses. Statistical inaccuracy of metadata at any level may propagate to downstream analyses, undermining the validity of scientific conclusions thus drawn. From the perspective of mass spectrometry-based proteomics, even though accurate statistics for peptide identification can now be achieved, accurate protein level statistics remain challenging.<h4>Results</h4>We have constructed a protein ID method that combines peptide evidences of a candidate protein based on a rigorous formula derived earlier; in this formula the database P-value of every peptide is weighted, prior to the final combination, according to the number of proteins it maps to. We have also shown that this protein ID method provides accurate protein level E-value, eliminating the need of using empirical post-processing methods for type-I error control. Using a known protein mixture, we find that this protein ID method, when combined with the Sori? formula, yields accurate values for the proportion of false discoveries. In terms of retrieval efficacy, the results from our method are comparable with other methods tested.<h4>Availability and implementation</h4>The source code, implemented in C++ on a linux system, is available for download at ftp://ftp.ncbi.nlm.nih.gov/pub/qmbp/qmbp_ms/RAId/RAId_Linux_64Bit.
Project description:Mass spectrometry-based proteomics starts with identifications of peptides and proteins, which provide the bases for forming the next-level hypotheses whose "validations" are often employed for forming even higher level hypotheses and so forth. Scientifically meaningful conclusions are thus attainable only if the number of falsely identified peptides/proteins is accurately controlled. For this reason, RAId continued to be developed in the past decade. RAId employs rigorous statistics for peptides/proteins identification, hence assigning accurate P-values/E-values that can be used confidently to control the number of falsely identified peptides and proteins. The RAId web service is a versatile tool built to identify peptides and proteins from tandem mass spectrometry data. Not only recognizing various spectra file formats, the web service also allows four peptide scoring functions and choice of three statistical methods for assigning P-values/E-values to identified peptides. Users may upload their own protein database or use one of the available knowledge integrated organismal databases that contain annotated information such as single amino acid polymorphisms, post-translational modifications, and their disease associations. The web service also provides a friendly interface to display, sort using different criteria, and download the identified peptides and proteins. RAId web service is freely available at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/raid.
Project description:Statistically meaningful comparison/combination of peptide identification results from various search methods is impeded by the lack of a universal statistical standard. Providing an E-value calibration protocol, we demonstrated earlier the feasibility of translating either the score or heuristic E-value reported by any method into the textbook-defined E-value, which may serve as the universal statistical standard. This protocol, although robust, may lose spectrum-specific statistics and might require a new calibration when changes in experimental setup occur. To mitigate these issues, we developed a new MS/MS search tool, RAId_aPS, that is able to provide spectrum-specific-values for additive scoring functions. Given a selection of scoring functions out of RAId score, K-score, Hyperscore and XCorr, RAId_aPS generates the corresponding score histograms of all possible peptides using dynamic programming. Using these score histograms to assign E-values enables a calibration-free protocol for accurate significance assignment for each scoring function. RAId_aPS features four different modes: (i) compute the total number of possible peptides for a given molecular mass range, (ii) generate the score histogram given a MS/MS spectrum and a scoring function, (iii) reassign E-values for a list of candidate peptides given a MS/MS spectrum and the scoring functions chosen, and (iv) perform database searches using selected scoring functions. In modes (iii) and (iv), RAId_aPS is also capable of combining results from different scoring functions using spectrum-specific statistics. The web link is http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/raid_aps/index.html. Relevant binaries for Linux, Windows, and Mac OS X are available from the same page.
Project description:The Rating Anxiety in Dementia (RAID; Shankar, K.K., Walker, M., Frost, D., & Orrell, M.W. (1999). The development of a valid and reliable scale for rating anxiety in dementia (RAID). Aging and Mental Health, 3, 39-49.) is a clinical rating scale developed to evaluate anxiety in persons with dementia. This report explores the psychometric properties and clinical utility of a new structured interview format of the RAID (RAID-SI), developed to standardize administration and scoring based on information obtained from the patient, an identified collateral, and rater observation.The RAID-SI was administered by trained master's level raters. Participants were 32 persons with dementia who qualified for an anxiety treatment outcome study. Self-report anxiety, depression, and quality of life measures were administered to both the person with dementia and a collateral.The RAID-SI exhibited adequate internal consistency reliability and inter-rater reliability. There was also some evidence of construct validity as indicated by significant correlations with other measures of patient-reported and collateral-reported anxiety, and non-significant correlations with collateral reports of patient depression and quality of life. Further, RAID-SI scores were significantly higher in persons with an anxiety diagnosis compared to those without an anxiety diagnosis.There is evidence that the RAID-SI exhibits good reliability and validity in older adults with dementia. The advantage of the structured interview format is increased standardization in administration and scoring, which may be particularly important when RAID raters are not experienced clinicians.
Project description:The key to mass-spectrometry-based proteomics is peptide identification. A major challenge in peptide identification is to obtain realistic E-values when assigning statistical significance to candidate peptides.Using a simple scoring scheme, we propose a database search method with theoretically characterized statistics. Taking into account possible skewness in the random variable distribution and the effect of finite sampling, we provide a theoretical derivation for the tail of the score distribution. For every experimental spectrum examined, we collect the scores of peptides in the database, and find good agreement between the collected score statistics and our theoretical distribution. Using Student's t-tests, we quantify the degree of agreement between the theoretical distribution and the score statistics collected. The T-tests may be used to measure the reliability of reported statistics. When combined with reported P-value for a peptide hit using a score distribution model, this new measure prevents exaggerated statistics. Another feature of RAId_DbS is its capability of detecting multiple co-eluted peptides. The peptide identification performance and statistical accuracy of RAId_DbS are assessed and compared with several other search tools. The executables and data related to RAId_DbS are freely available upon request.
Project description:In the analysis of differential peptide peak intensities (i.e. abundance measures), LC-MS analyses with poor quality peptide abundance data can bias downstream statistical analyses and hence the biological interpretation for an otherwise high-quality dataset. Although considerable effort has been placed on assuring the quality of the peptide identification with respect to spectral processing, to date quality assessment of the subsequent peptide abundance data matrix has been limited to a subjective visual inspection of run-by-run correlation or individual peptide components. Identifying statistical outliers is a critical step in the processing of proteomics data as many of the downstream statistical analyses [e.g. analysis of variance (ANOVA)] rely upon accurate estimates of sample variance, and their results are influenced by extreme values.We describe a novel multivariate statistical strategy for the identification of LC-MS runs with extreme peptide abundance distributions. Comparison with current method (run-by-run correlation) demonstrates a significantly better rate of identification of outlier runs by the multivariate strategy. Simulation studies also suggest that this strategy significantly outperforms correlation alone in the identification of statistically extreme liquid chromatography-mass spectrometry (LC-MS) runs.https://www.biopilot.org/docs/Software/RMD.email@example.comSupplementary material is available at Bioinformatics online.
Project description:Although many methods and statistical approaches have been developed for protein identification by mass spectrometry, the problem of accurate assessment of statistical significance of protein identifications remains an open question. The main issues are as follows: (i) statistical significance of inferring peptide from experimental mass spectra must be platform independent and spectrum specific and (ii) individual spectrum matches at the peptide level must be combined into a single statistical measure at the protein level.We present a method and software to assign statistical significance to protein identifications from search engines for mass spectrometric data. The approach is based on asymptotic theory of order statistics. The parameters of the asymptotic distributions of identification scores are estimated for each spectrum individually. The method relies on new unbiased estimators for parameters of extreme value distribution. The estimated parameters are used to assign a spectrum-specific P-value to each peptide-spectrum match. The protein-level confidence measure combines P-values of peptide-to-spectrum matches.We extensively tested the method using triplicate mouse and yeast high-throughput proteomic experiments. The proposed statistical approach improves the sensitivity of protein identifications without compromising specificity. While the method was primarily designed to work with Mascot, it is platform-independent and is applicable to any search engine which outputs a single score for a peptide-spectrum match. We demonstrate this by testing the method in conjunction with X!Tandem.The software is available for download at ftp://genetics.bwh.harvard.edu/SSPVfirstname.lastname@example.orgSupplementary data are available at Bioinformatics online.
Project description:Spectral counting has become a popular method for LC-MS/MS based proteome quantification; however, this methodology is often not reliable when proteins are identified by a small number of spectra. Here, we present a simple strategy to improve spectral counting based quantification for low-abundance proteins by recovering low-quality or low-scoring spectra for confidently identified peptides. In this approach, stringent data filtering criteria were initially applied to achieve confident peptide identifications with low false discovery rate (e.g., < 1% at peptide level) after LC-MS/MS analysis and database search by SEQUEST. Then, all low-scoring MS/MS spectra that matched to this set of confidently identified peptides were recovered, leading to more than 20% increase of total identified spectra. The validity of these recovered spectra was assessed by the parent ion mass measurement error distribution, retention time distribution, and by comparing the individual low score and high score spectra that correspond to the same peptides. The results support that the recovered low-scoring spectra have similar confidence levels in peptide identifications as the spectra passing the initial stringent filter. The application of this strategy of recovering low-scoring spectra significantly improved the spectral count quantification statistics for low-abundance proteins, as illustrated in the identification of mouse brain region specific proteins.
Project description:MS dissociation methods, including collision induced dissociation (CID), high energy collision dissociation (HCD), and electron transfer dissociation (ETD), can each contribute distinct peptidome identifications using conventional peptide identification methods (Shen et al. J. Proteome Res. 2011), but such samples still pose significant informatics challenges. In this work, we explored utilization of high accuracy fragment ion mass measurements, in this case provided by Fourier transform MS/MS, to improve peptidome peptide data set size and consistency relative to conventional descriptive and probabilistic scoring methods. For example, we identified 20-40% more peptides than SEQUEST, Mascot, and MS_GF scoring methods using high accuracy fragment ion information and the same false discovery rate (FDR) from CID, HCD, and ETD spectra. Identified species covered >90% of the collective identifications obtained using various conventional peptide identification methods, which significantly addresses the common issue of different data analysis methods generating different peptide data sets. Choice of peptide dissociation and high-precision measurement-based identification methods presently available for degradomic-peptidomic analyses needs to be based on the coverage and confidence (or specificity) afforded by the method, as well as practical issues (e.g., throughput). By using accurate fragment information, >1000 peptidome components can be identified from a single human blood plasma analysis with low peptide-level FDRs (e.g., 0.6%), providing an improved basis for investigating potential disease-related peptidome components.
Project description:RAId is a software package that has been actively developed for the past 10 years for computationally and visually analyzing MS/MS data. Founded on rigorous statistical methods, RAId's core program computes accurate E-values for peptides and proteins identified during database searches. Making this robust tool readily accessible for the proteomics community by developing a graphical user interface (GUI) is our main goal here.We have constructed a graphical user interface to facilitate the use of RAId on users' local machines. Written in Java, RAId_GUI not only makes easy executions of RAId but also provides tools for data/spectra visualization, MS-product analysis, molecular isotopic distribution analysis, and graphing the retrieval versus the proportion of false discoveries. The results viewer displays and allows the users to download the analyses results. Both the knowledge-integrated organismal databases and the code package (containing source code, the graphical user interface, and a user manual) are available for download at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads/raid.html .