ABSTRACT: We adopted the method used in the O-Pair article to validate the FDR calculations of O-glycopeptide search in MS-Decipher. The detailed description of the method was provided in the ‘Methods’ and ‘Evaluating O-Pair Search performance’ parts of O-Pair article (Lu et al., 2020).
The results were firstly filtered for Target hits and GPSM (glycopeptide-spectrum match)-level q-value < 0.01 and score > 0. And the ‘Estimated FDRs’ were calculated by dividing the number of total GPSM by the number of GPSM, whose sequence matched with the entrapment proteins.
Project description:Although many methods and statistical approaches have been developed for protein identification by mass spectrometry, the problem of accurate assessment of statistical significance of protein identifications remains an open question. The main issues are as follows: (i) statistical significance of inferring peptide from experimental mass spectra must be platform independent and spectrum specific and (ii) individual spectrum matches at the peptide level must be combined into a single statistical measure at the protein level.We present a method and software to assign statistical significance to protein identifications from search engines for mass spectrometric data. The approach is based on asymptotic theory of order statistics. The parameters of the asymptotic distributions of identification scores are estimated for each spectrum individually. The method relies on new unbiased estimators for parameters of extreme value distribution. The estimated parameters are used to assign a spectrum-specific P-value to each peptide-spectrum match. The protein-level confidence measure combines P-values of peptide-to-spectrum matches.We extensively tested the method using triplicate mouse and yeast high-throughput proteomic experiments. The proposed statistical approach improves the sensitivity of protein identifications without compromising specificity. While the method was primarily designed to work with Mascot, it is platform-independent and is applicable to any search engine which outputs a single score for a peptide-spectrum match. We demonstrate this by testing the method in conjunction with X!Tandem.The software is available for download at ftp://genetics.bwh.harvard.edu/SSPVemail@example.comSupplementary data are available at Bioinformatics online.
Project description:Accurate assignment of peptide sequences to observed fragmentation spectra is hindered by the large number of hypotheses that must be considered for each observed spectrum. A high score assigned to a particular peptide-spectrum match (PSM) may not end up being statistically significant after multiple testing correction. Researchers can mitigate this problem by controlling the hypothesis space in various ways: considering only peptides resulting from enzymatic cleavages, ignoring possible post-translational modifications or single nucleotide variants, etc. However, these strategies sacrifice identifications of spectra generated by rarer types of peptides. In this work, we introduce a statistical testing framework, cascade search, that directly addresses this problem. The method requires that the user specify a priori a statistical confidence threshold as well as a series of peptide databases. For instance, such a cascade of databases could include fully tryptic, semitryptic, and nonenzymatic peptides or peptides with increasing numbers of modifications. Cascaded search then gradually expands the list of candidate peptides from more likely peptides toward rare peptides, sequestering at each stage any spectrum that is identified with a specified statistical confidence. We compare cascade search to a standard procedure that lumps all of the peptides into a single database, as well as to a previously described group FDR procedure that computes the FDR separately within each database. We demonstrate, using simulated and real data, that cascade search identifies more spectra at a fixed FDR threshold than with either the ungrouped or grouped approach. Cascade search thus provides a general method for maximizing the number of identified spectra in a statistically rigorous fashion.
Project description:We report numerical and experimental studies of dissipative-soliton-resonance (DSR) in a fiber laser with a nonlinear optical loop mirror. The DSR pulse presents temporally a flat-top profile and a clamped peak power. Its spectrum has a rectangle profile with characteristic steep edges. It shows a unique behavior as pulse energy increases: The rectangle part of the spectrum is unchanged while the newly emerging spectrum sits on the center part and forms a peak. Experimental observations match well with the numerical results. Moreover, the detailed evolution of the DSR pulse compression is both numerically and experimentally demonstrated for the first time. An experimentally obtained DSR pulse of 63 ps duration is compressed down to 760 fs, with low-intensity pedestals using a grating pair. Before being compressed to its narrowest width, the pulse firstly evolves into a cat-ear profile, and the corresponding autocorrelation trace shows a crown shape, which distinguishes itself from properties of other solitons formed in fiber lasers.
Project description:Open modification searching (OMS) is a powerful search strategy that identifies peptides carrying any type of modification by allowing a modified spectrum to match against its unmodified variant by using a very wide precursor mass window. A drawback of this strategy, however, is that it leads to a large increase in search time. Although performing an open search can be done using existing spectral library search engines by simply setting a wide precursor mass window, none of these tools have been optimized for OMS, leading to excessive runtimes and suboptimal identification results. We present the ANN-SoLo tool for fast and accurate open spectral library searching. ANN-SoLo uses approximate nearest neighbor indexing to speed up OMS by selecting only a limited number of the most relevant library spectra to compare to an unknown query spectrum. This approach is combined with a cascade search strategy to maximize the number of identified unmodified and modified spectra while strictly controlling the false discovery rate as well as a shifted dot product score to sensitively match modified spectra to their unmodified counterparts. ANN-SoLo achieves state-of-the-art performance in terms of speed and the number of identifications. On a previously published human cell line data set, ANN-SoLo confidently identifies more spectra than SpectraST or MSFragger and achieves a speedup of an order of magnitude compared with SpectraST. ANN-SoLo is implemented in Python and C++. It is freely available under the Apache 2.0 license at https://github.com/bittremieux/ANN-SoLo .
Project description:Gas chromatography-mass spectrometry (GC-MS) acquisitions routinely yield hundreds to thousands of Electron Ionization (EI) mass spectra. The chemical identification of these spectra typically involves a search protocol that seeks an exact match to a reference spectrum. Reference spectra are found in comprehensive libraries of small molecule EI spectra curated by commercial and public entities. We developed ARISTO (Automatic Reduction of Ion Spectra To Ontology), a webtool, which provides information regarding the general chemical nature of the compound underlying an input EI mass spectrum. Importantly, ARISTO can provide such annotation without necessitating an exact match to a specific compound. ARISTO provides assignments to a subset of the ChEBI (Chemical Entities of Biological Interest) dictionary, an ontology, which aims to cover biologically relevant small molecules. Our system takes as input a mass spectrum represented as a series of mass and intensity pairs; the system returns a graphical representation of the supported ontology as well as a detailed table of suggested annotations along with their associated statistical evidence. ARISTO is accessible at this URL: http://www.ionspectra.org/aristo. The system is free, open to all and does not require registration of any sort.
Project description:The rapidly increasing volume of sequence and structure information available for proteins poses the daunting task of determining their functional importance. Computational methods can prove to be very useful in understanding and characterizing the biochemical and evolutionary information contained in this wealth of data, particularly at functionally important sites. Therefore, we perform a detailed survey of compositional and evolutionary constraints at the molecular and biological function level for a large set of known functionally important sites extracted from a wide range of protein families. We compare the degree of conservation across different functional categories and provide detailed statistical insight to decipher the varying evolutionary constraints at functionally important sites. The compositional and evolutionary information at functionally important sites has been compiled into a library of functional templates. We developed a module that predicts functionally important columns (FIC) of an alignment based on the detection of a significant "template match score" to a library template. Our template match score measures an alignment column's similarity to a library template and combines a term explicitly representing a column's residue composition with various evolutionary conservation scores (information content and position-specific scoring matrix-derived statistics). Our benchmarking studies show good sensitivity/specificity for the prediction of functional sites and high accuracy in attributing correct molecular function type to the predicted sites. This prediction method is based on information derived from homologous sequences and no structural information is required. Therefore, this method could be extremely useful for large-scale functional annotation.
Project description:BACKGROUND:Studies find that identifying additional study data is possible by contacting study authors or experts. What is less certain is the time taken, costs involved and value found by using this supplementary search method. The purpose of this study is to determine the effectiveness, efficiency, cost and value of contacting study authors by e-mail, updating the evidence available for this search method. METHODS:Eighty-eight study authors, whose studies met title/abstract inclusion in a. systematic review, were contacted by e-mail. * effectiveness was assessed by comparing the number of study authors contacted. compared to the number of replies received; * efficiency was assessed by recording the time taken to contact study authors; * cost was assessed by comparing the efficiency of contacting authors with the. effectiveness; and * value was assessed by reading and comparing the published studies with the replies received to see if any unique data was identified. RESULTS:Contacting study authors took 6 h, 54 min and 25 s across 7 weeks. 38 answers (46%) were received from 83 possible contacts. Contacting study authors cost £80.33 or £2.11 per reply. We identified unique data from author replies when compared with data reported in published studies, determining this method as 'valuable'. CONCLUSIONS:Whilst our effectiveness findings differ from other studies, we believe that this study demonstrates the effectiveness of contacting study authors. By linking effectiveness to value and cost, we offer a new way to interpret the 'effectiveness' of this supplementary search method.
Project description:The statistical validation of database search results is a complex issue in bottom-up proteomics. The correct and incorrect peptide spectrum match (PSM) scores overlap significantly, making an accurate assessment of true peptide matches challenging. Since the complete separation between the true and false hits is practically never achieved, there is need for better methods and rescoring algorithms to improve upon the primary database search results. Here we describe the calibration and False Discovery Rate (FDR) estimation of database search scores through a dynamic FDR calculation method, FlexiFDR, which increases both the sensitivity and specificity of search results. Modelling a simple linear regression on the decoy hits for different charge states, the method maximized the number of true positives and reduced the number of false negatives in several standard datasets of varying complexity (18-mix, 49-mix, 200-mix) and few complex datasets (E. coli and Yeast) obtained from a wide variety of MS platforms. The net positive gain for correct spectral and peptide identifications was up to 14.81% and 6.2% respectively. The approach is applicable to different search methodologies--separate as well as concatenated database search, high mass accuracy, and semi-tryptic and modification searches. FlexiFDR was also applied to Mascot results and showed better performance than before. We have shown that appropriate threshold learnt from decoys, can be very effective in improving the database search results. FlexiFDR adapts itself to different instruments, data types and MS platforms. It learns from the decoy hits and sets a flexible threshold that automatically aligns itself to the underlying variables of data quality and size.
Project description:A reverse intensity correction method was developed for spectral library searches to correct for instrument response without the side effect of magnifying the noise in the low responsivity region of test spectra. Instead of applying relative intensity correction to the sample test spectra to match the standardized library spectra, a reverse intensity correction is applied to the standardized library spectra to match the uncorrected sample spectrum. This simple procedural change improves library search performance, especially for dispersive charge-coupled device Raman analyzers using near-infrared excitations, where the instrument response often varies greatly across the spectral range, and signal-to-noise ratio in the low responsivity regions is typically poor.
Project description:Much progress has been made in Protein structure prediction during the last few decades. As the predicted models can span a broad range of accuracy spectrum, the accuracy of quality estimation becomes one of the key elements of successful protein structure prediction. Over the past years, a number of methods have been developed to address this issue, and these methods could be roughly divided into three categories: the single-model methods, clustering-based methods and quasi single-model methods. In this study, we develop a single-model method MQAPRank based on the learning-to-rank algorithm firstly, and then implement a quasi single-model method Quasi-MQAPRank. The proposed methods are benchmarked on the 3DRobot and CASP11 dataset. The five-fold cross-validation on the 3DRobot dataset shows the proposed single model method outperforms other methods whose outputs are taken as features of the proposed method, and the quasi single-model method can further enhance the performance. On the CASP11 dataset, the proposed methods also perform well compared with other leading methods in corresponding categories. In particular, the Quasi-MQAPRank method achieves a considerable performance on the CASP11 Best150 dataset.