Metaprotein expression modeling for label-free quantitative proteomics.
ABSTRACT: Label-free quantitative proteomics holds a great deal of promise for the future study of both medicine and biology. However, the data generated is extremely intricate in its correlation structure, and its proper analysis is complex. There are issues with missing identifications. There are high levels of correlation between many, but not all, of the peptides derived from the same protein. Additionally, there may be systematic shifts in the sensitivity of the machine between experiments or even through time within the duration of a single experiment.We describe a hierarchical model for analyzing unbiased, label-free proteomics data which utilizes the covariance of peptide expression across samples as well as MS/MS-based identifications to group peptides-a strategy we call metaprotein expression modeling. Our metaprotein model acknowledges the possibility of misidentifications, post-translational modifications and systematic differences between samples due to changes in instrument sensitivity or differences in total protein concentration. In addition, our approach allows us to validate findings from unbiased, label-free proteomics experiments with further unbiased, label-free proteomics experiments. Finally, we demonstrate the clinical/translational utility of the model for building predictors capable of differentiating biological phenotypes as well as for validating those findings in the context of three novel cohorts of patients with Hepatitis C.Mass-spectrometry proteomics is quickly becoming a powerful tool for studying biological and translational questions. Making use of all of the information contained in a particular set of data will be critical to the success of those endeavors. Our proposed model represents an advance in the ability of statistical models of proteomic data to identify and utilize correlation between features. This allows validation of predictors without translation to targeted assays in addition to informing the choice of targets when it is appropriate to generate those assays.
Project description:Label-free quantitative methods are advantageous in bottom-up (shotgun) proteomics because they are robust and can easily be applied to different workflows without additional cost. Both label-based and label-free approaches are routinely applied to discovery-based proteomics experiments and are widely accepted as semiquantitative. Label-free quantitation approaches are segregated into two distinct approaches: peak-abundance-based approaches and spectral counting (SpC). Peak abundance approaches like MaxLFQ, which is integrated into the MaxQuant environment, require precursor peak alignment that is computationally intensive and cannot be routinely applied to low-resolution data. Not limited by these constraints, SpC approaches simply use the number of peptide identifications corresponding to a given protein as a measurement of protein abundance. We show here that spectral counts from multidimensional proteomic data sets have a mean-dispersion relationship that can be modeled in edgeR. Furthermore, by simulating spectral counts, we show that this approach can routinely be applied to large-scale discovery proteomics data sets to determine differential protein expression.
Project description:Quantitative mass spectrometry-based proteomics requires protein-level estimates and associated confidence measures. Challenges include the presence of low quality or incorrectly identified peptides and informative missingness. Furthermore, models are required for rolling peptide-level information up to the protein level.We present a statistical model that carefully accounts for informative missingness in peak intensities and allows unbiased, model-based, protein-level estimation and inference. The model is applicable to both label-based and label-free quantitation experiments. We also provide automated, model-based, algorithms for filtering of proteins and peptides as well as imputation of missing values. Two LC/MS datasets are used to illustrate the methods. In simulation studies, our methods are shown to achieve substantially more discoveries than standard alternatives.The software has been made available in the open-source proteomics platform DAnTE (http://omics.pnl.gov/software/).
Project description:High resolution proteomics approaches have been successfully utilized for the comprehensive characterization of the cell proteome. However, in the case of quantitative proteomics an open question still remains, which quantification strategy is best suited for identification of biologically relevant changes, especially in clinical specimens. In this study, a thorough comparison of a label-free approach (intensity-based) and 8-plex iTRAQ was conducted as applied to the analysis of tumor tissue samples from non-muscle invasive and muscle-invasive bladder cancer. For the latter, two acquisition strategies were tested including analysis of unfractionated and fractioned iTRAQ-labeled peptides. To reduce variability, aliquots of the same protein extract were used as starting material, whereas to obtain representative results per method further sample processing and MS analysis were conducted according to routinely applied protocols. Considering only multiple-peptide identifications, LC-MS/MS analysis resulted in the identification of 910, 1092 and 332 proteins by label-free, fractionated and unfractionated iTRAQ, respectively. The label-free strategy provided higher protein sequence coverage compared to both iTRAQ experiments. Even though pre-fraction of the iTRAQ labeled peptides allowed for a higher number of identifications, this was not accompanied by a respective increase in the number of differentially expressed changes detected. Validity of the proteomics output related to protein identification and differential expression was determined by comparison to existing data in the field (Protein Atlas and published data on the disease). All methods predicted changes which to a large extent agreed with published data, with label-free providing a higher number of significant changes than iTRAQ. Conclusively, both label-free and iTRAQ (when combined to peptide fractionation) provide high proteome coverage and apparently valid predictions in terms of differential expression, nevertheless label-free provides higher sequence coverage and ultimately detects a higher number of differentially expressed proteins. The risk for receiving false associations still exists, particularly when analyzing highly heterogeneous biological samples, raising the need for the analysis of higher sample numbers and/or application of adjustment for multiple testing.
Project description:MS-based proteomics generates rapidly increasing amounts of precise and quantitative information. Analysis of individual proteomic experiments has made great strides, but the crucial ability to compare and store information across different proteome measurements still presents many challenges. For example, it has been difficult to avoid contamination of databases with low quality peptide identifications, to control for the inflation in false positive identifications when combining data sets, and to integrate quantitative data. Although, for example, the contamination with low quality identifications has been addressed by joint analysis of deposited raw data in some public repositories, we reasoned that there should be a role for a database specifically designed for high resolution and quantitative data. Here we describe a novel database termed MaxQB that stores and displays collections of large proteomics projects and allows joint analysis and comparison. We demonstrate the analysis tools of MaxQB using proteome data of 11 different human cell lines and 28 mouse tissues. The database-wide false discovery rate is controlled by adjusting the project specific cutoff scores for the combined data sets. The 11 cell line proteomes together identify proteins expressed from more than half of all human genes. For each protein of interest, expression levels estimated by label-free quantification can be visualized across the cell lines. Similarly, the expression rank order and estimated amount of each protein within each proteome are plotted. We used MaxQB to calculate the signal reproducibility of the detected peptides for the same proteins across different proteomes. Spearman rank correlation between peptide intensity and detection probability of identified proteins was greater than 0.8 for 64% of the proteome, whereas a minority of proteins have negative correlation. This information can be used to pinpoint false protein identifications, independently of peptide database scores. The information contained in MaxQB, including high resolution fragment spectra, is accessible to the community via a user-friendly web interface at http://www.biochem.mpg.de/maxqb.
Project description:Quantitative proteomics by mass spectrometry is widely used in biomarker research and basic biology research for investigation of phenotype level cellular events. Despite the wide application, the methodology for statistical analysis of differentially expressed proteins has not been unified. Various methods such as t test, linear model and mixed effect models are used to define changes in proteomics experiments. However, none of these methods consider the specific structure of MS-data. Choices between methods, often originally developed for other types of data, are based on compromises between features such as statistical power, general applicability and user friendliness. Furthermore, whether to include proteins identified with one peptide in statistical analysis of differential protein expression varies between studies. Here we present DEqMS, a robust statistical method developed specifically for differential protein expression analysis in mass spectrometry data. In all data sets investigated there is a clear dependence of variance on the number of PSMs or peptides used for protein quantification. DEqMS takes this feature into account when assessing differential protein expression. This allows for a more accurate data-dependent estimation of protein variance and inclusion of single peptide identifications without increasing false discoveries. The method was tested in several data sets including E. coli proteome spike-in data, using both label-free and TMT-labeled quantification. Compared with previous statistical methods used in quantitative proteomics, DEqMS showed consistently better accuracy in detecting altered protein levels compared with other statistical methods in both label-free and labeled quantitative proteomics data. DEqMS is available as an R package in Bioconductor.
Project description:Liquid chromatography (LC) prefractionation is often implemented to increase proteomic coverage; however, while effective, this approach is laborious, requires considerable sample amount, and can be cumbersome. We describe how interfacing a recently described high-field asymmetric waveform ion mobility spectrometry (FAIMS) device between a nanoelectrospray ionization (nanoESI) emitter and an Orbitrap hybrid mass spectrometer (MS) enables the collection of single-shot proteomic data with comparable depth to that of conventional two-dimensional LC approaches. This next generation FAIMS device incorporates improved ion sampling at the ESI-FAIMS interface, increased electric field strength, and a helium-free ion transport gas. With fast internal compensation voltage (CV) stepping (25 ms/transition), multiple unique gas-phase fractions may be analyzed simultaneously over the course of an MS analysis. We have comprehensively demonstrated how this device performs for bottom-up proteomics experiments as well as characterized the effects of peptide charge state, mass loading, analysis time, and additional variables. We also offer recommendations for the number of CVs and which CVs to use for different lengths of experiments. Internal CV stepping experiments increase protein identifications from a single-shot experiment to >8000, from over 100?000 peptide identifications in as little as 5 h. In single-shot 4 h label-free quantitation (LFQ) experiments of a human cell line, we quantified 7818 proteins with FAIMS using intra-analysis CV switching compared to 6809 without FAIMS. Single-shot FAIMS results also compare favorably with LC fractionation experiments. A 6 h single-shot FAIMS experiment generates 8007 protein identifications, while four fractions analyzed for 1.5 h each produce 7776 protein identifications.
Project description:Complex human biomolecular processes are made possible by the diversity of human proteoforms. Constructing proteoform families, groups of proteoforms derived from the same gene, is one way to represent this diversity. Comprehensive, high-confidence identification of human proteoforms remains a central challenge in mass spectrometry-based proteomics. We have previously reported a strategy for proteoform identification using intact-mass measurements, and we have since improved that strategy by mass calibration based on search results, the use of a global post-translational modification discovery database, and the integration of top-down proteomics results with intact-mass analysis. In the present study, we combine these strategies for enhanced proteoform identification in total cell lysate from the Jurkat human T lymphocyte cell line. We collected, processed, and integrated three types of proteomics data (NeuCode-labeled intact-mass, label-free top-down, and multi-protease bottom-up) to maximize the number of confident proteoform identifications. The integrated analysis revealed 5950 unique experimentally observed proteoforms, which were assembled into 848 proteoform families. Twenty percent of the observed proteoforms were confidently identified at a 3.9% false discovery rate, representing 1207 unique proteoforms derived from 484 genes.
Project description:Urine is an important, noninvasively collected body fluid source for the diagnosis and prognosis of human diseases. Liquid chromatography mass spectrometry (LC-MS) based shotgun proteomics has evolved as a sensitive and informative technique to discover candidate disease biomarkers from urine specimens. Filter-aided sample preparation (FASP) generates peptide samples from protein mixtures of cell lysate or body fluid origin. Here, we describe a FASP method adapted to 96-well filter plates, named 96FASP. Soluble urine concentrates containing ~10 ?g of total protein were processed by 96FASP and LC-MS resulting in 700-900 protein identifications at a 1% false discovery rate (FDR). The experimental repeatability, as assessed by label-free quantification and Pearson correlation analysis for shared proteins among replicates, was high (R ? 0.97). Application to urinary pellet lysates which is of particular interest in the context of urinary tract infection analysis was also demonstrated. On average, 1700 proteins (±398) were identified in five experiments. In a pilot study using 96FASP for analysis of eight soluble urine samples, we demonstrated that protein profiles of technical replicates invariably clustered; the protein profiles for distinct urine donors were very different from each other. Robust, highly parallel methods to generate peptide mixtures from urine and other body fluids are critical to increase cost-effectiveness in clinical proteomics projects. This 96FASP method has potential to become a gold standard for high-throughput quantitative clinical proteomics.
Project description:Interpreting the potentially vast number of hypotheses generated by a shotgun proteomics experiment requires a valid and accurate procedure for assigning statistical confidence estimates to identified tandem mass spectra. Despite the crucial role such procedures play in most high-throughput proteomics experiments, the scientific literature has not reached a consensus about the best confidence estimation methodology. In this work, we evaluate, using theoretical and empirical analysis, four previously proposed protocols for estimating the false discovery rate (FDR) associated with a set of identified tandem mass spectra: two variants of the target-decoy competition protocol (TDC) of Elias and Gygi and two variants of the separate target-decoy search protocol of Käll et al. Our analysis reveals significant biases in the two separate target-decoy search protocols. Moreover, the one TDC protocol that provides an unbiased FDR estimate among the target PSMs does so at the cost of forfeiting a random subset of high-scoring spectrum identifications. We therefore propose the mix-max procedure to provide unbiased, accurate FDR estimates in the presence of well-calibrated scores. The method avoids biases associated with the two separate target-decoy search protocols and also avoids the propensity for target-decoy competition to discard a random subset of high-scoring target identifications.
Project description:We report a significantly-enhanced bioinformatics suite and database for proteomics research called Yale Protein Expression Database (YPED) that is used by investigators at more than 300 institutions worldwide. YPED meets the data management, archival, and analysis needs of a high-throughput mass spectrometry-based proteomics research ranging from a single laboratory, group of laboratories within and beyond an institution, to the entire proteomics community. The current version is a significant improvement over the first version in that it contains new modules for liquid chromatography-tandem mass spectrometry (LC-MS/MS) database search results, label and label-free quantitative proteomic analysis, and several scoring outputs for phosphopeptide site localization. In addition, we have added both peptide and protein comparative analysis tools to enable pairwise analysis of distinct peptides/proteins in each sample and of overlapping peptides/proteins between all samples in multiple datasets. We have also implemented a targeted proteomics module for automated multiple reaction monitoring (MRM)/selective reaction monitoring (SRM) assay development. We have linked YPED's database search results and both label-based and label-free fold-change analysis to the Skyline Panorama repository for online spectra visualization. In addition, we have built enhanced functionality to curate peptide identifications into an MS/MS peptide spectral library for all of our protein database search identification results.