Protein quantification across hundreds of experimental conditions.
ABSTRACT: Quantitative studies of protein abundance rarely span more than a small number of experimental conditions and replicates. In contrast, quantitative studies of transcript abundance often span hundreds of experimental conditions and replicates. This situation exists, in part, because extracting quantitative data from large proteomics datasets is significantly more difficult than reading quantitative data from a gene expression microarray. To address this problem, we introduce two algorithmic advances in the processing of quantitative proteomics data. First, we use space-partitioning data structures to handle the large size of these datasets. Second, we introduce techniques that combine graph-theoretic algorithms with space-partitioning data structures to collect relative protein abundance data across hundreds of experimental conditions and replicates. We validate these algorithmic techniques by analyzing several datasets and computing both internal and external measures of quantification accuracy. We demonstrate the scalability of these techniques by applying them to a large dataset that comprises a total of 472 experimental conditions and replicates.
Project description:A fundamental challenge in calcium imaging has been to infer spike rates of neurons from the measured noisy fluorescence traces. We systematically evaluate different spike inference algorithms on a large benchmark dataset (>100,000 spikes) recorded from varying neural tissue (V1 and retina) using different calcium indicators (OGB-1 and GCaMP6). In addition, we introduce a new algorithm based on supervised learning in flexible probabilistic models and find that it performs better than other published techniques. Importantly, it outperforms other algorithms even when applied to entirely new datasets for which no simultaneously recorded data is available. Future data acquired in new experimental conditions can be used to further improve the spike prediction accuracy and generalization performance of the model. Finally, we show that comparing algorithms on artificial data is not informative about performance on real data, suggesting that benchmarking different methods with real-world datasets may greatly facilitate future algorithmic developments in neuroscience.
Project description:A common task in microarray data analysis is to identify informative genes that are differentially expressed between two different states. Owing to the high-dimensional nature of microarray data, identification of significant genes has been essential in analyzing the data. However, the performances of many gene selection techniques are highly dependent on the experimental conditions, such as the presence of measurement error or a limited number of sample replicates.We have proposed new filter-based gene selection techniques, by applying a simple modification to significance analysis of microarrays (SAM). To prove the effectiveness of the proposed method, we considered a series of synthetic datasets with different noise levels and sample sizes along with two real datasets. The following findings were made. First, our proposed methods outperform conventional methods for all simulation set-ups. In particular, our methods are much better when the given data are noisy and sample size is small. They showed relatively robust performance regardless of noise level and sample size, whereas the performance of SAM became significantly worse as the noise level became high or sample size decreased. When sufficient sample replicates were available, SAM and our methods showed similar performance. Finally, our proposed methods are competitive with traditional methods in classification tasks for microarrays.The results of simulation study and real data analysis have demonstrated that our proposed methods are effective for detecting significant genes and classification tasks, especially when the given data are noisy or have few sample replicates. By employing weighting schemes, we can obtain robust and reliable results for microarray data analysis.
Project description:Extensive genomic characterization of multi-species acid mine drainage microbial consortia combined with laboratory cultivation has enabled the application of quantitative proteomic analyses at the community level. In this study, quantitative proteomic comparisons were used to functionally characterize laboratory-cultivated acidophilic communities sustained in pH 1.45 or 0.85 conditions. The distributions of all proteins identified for individual organisms indicated biases for either high or low pH, and suggests pH-specific niche partitioning for low abundance bacteria and archaea. Although the proteome of the dominant bacterium, Leptospirillum group II, was largely unaffected by pH treatments, analysis of functional categories indicated proteins involved in amino acid and nucleotide metabolism, as well as cell membrane/envelope biogenesis were overrepresented at high pH. Comparison of specific protein abundances indicates higher pH conditions favor Leptospirillum group III, whereas low pH conditions promote the growth of certain archaea. Thus, quantitative proteomic comparisons revealed distinct differences in community composition and metabolic function of individual organisms during different pH treatments. Proteomic analysis revealed other aspects of community function. Different numbers of phage proteins were identified across biological replicates, indicating stochastic spatial heterogeneity of phage outbreaks. Additionally, proteomic data were used to identify a previously unknown genotypic variant of Leptospirillum group II, an indication of selection for a specific Leptospirillum group II population in laboratory communities. Our results confirm the importance of pH and related geochemical factors in fine-tuning acidophilic microbial community structure and function at the species and strain level, and demonstrate the broad utility of proteomics in laboratory community studies.
Project description:RNAi screening using pooled shRNA libraries is a valuable tool for identifying genetic regulators of biological processes. However, for a successful pooled shRNA screen, it is imperative to thoroughly optimize experimental conditions to obtain reproducible data. Here we performed viability screens with a library of ?10,000 shRNAs at two different fold representations (100- and 500-fold at transduction) and report the reproducibility of shRNA abundance changes between screening replicates determined by microarray and next generation sequencing analyses. We show that the technical reproducibility between PCR replicates from a pooled screen can be drastically improved by ensuring that PCR amplification steps are kept within the exponential phase and by using an amount of genomic DNA input in the reaction that maintains the average template copies per shRNA used during library transduction. Using these optimized PCR conditions, we then show that higher reproducibility of biological replicates is obtained by both microarray and next generation sequencing when screening with higher average shRNA fold representation. shRNAs that change abundance reproducibly in biological replicates (primary hits) are identified from screens performed with both 100- and 500-fold shRNA representation, however a higher percentage of primary hit overlap between screening replicates is obtained from 500-fold shRNA representation screens. While strong hits with larger changes in relative abundance were generally identified in both screens, hits with smaller changes were identified only in the screens performed with the higher shRNA fold representation at transduction.
Project description:BACKGROUND:The number of publicly available metagenomic experiments in various environments has been rapidly growing, empowering the potential to identify similar shifts in species abundance between different experiments. This could be a potentially powerful way to interpret new experiments, by identifying common themes and causes behind changes in species abundance. RESULTS:We propose a novel framework for comparing microbial shifts between conditions. Using data from one of the largest human metagenome projects to date, the American Gut Project (AGP), we obtain differential abundance vectors for microbes using experimental condition information provided with the AGP metadata, such as patient age, dietary habits, or health status. We show it can be used to identify similar and opposing shifts in microbial species, and infer putative interactions between microbes. Our results show that groups of shifts with similar effects on microbiome can be identified and that similar dietary interventions display similar microbial abundance shifts. CONCLUSIONS:Without comparison to prior data, it is difficult for experimentalists to know if their observed changes in species abundance have been observed by others, both in their conditions and in others they would never consider comparable. Yet, this can be a very important contextual factor in interpreting the significance of a shift. We've proposed and tested an algorithmic solution to this problem, which also allows for comparing the metagenomic signature shifts between conditions in the existing body of data.
Project description:BACKGROUND:Harmonization techniques make different gene expression profiles and their sets compatible and ready for comparisons. Here we present a new bioinformatic tool termed Shambhala for harmonization of multiple human gene expression datasets obtained using different experimental methods and platforms of microarray hybridization and RNA sequencing. RESULTS:Unlike previously published methods enabling good quality data harmonization for only two datasets, Shambhala allows conversion of multiple datasets into the universal form suitable for further comparisons. Shambhala harmonization is based on the calibration of gene expression profiles using the auxiliary standardization dataset. Each profile is transformed to make it similar to the output of microarray hybridization platform Affymetrix Human Gene. This platform was chosen because it has the biggest number of human gene expression profiles deposited in public databases. We evaluated Shambhala ability to retain biologically important features after harmonization. The same four biological samples taken in multiple replicates were profiled independently using three and four different experimental platforms, respectively, then Shambhala-harmonized and investigated by hierarchical clustering. CONCLUSION:Our results showed that unlike other frequently used methods: quantile normalization and DESeq/DESeq2 normalization, Shambhala harmonization was the only method supporting sample-specific and platform-independent biologically meaningful clustering for the data obtained from multiple experimental platforms.
Project description:Comprehensive and accurate evaluation of data quality and false-positive biomarker discovery is critical to direct the method development/optimization for quantitative proteomics, which nonetheless remains challenging largely due to the high complexity and unique features of proteomic data. Here we describe an experimental null (EN) method to address this need. Because the method experimentally measures the null distribution (either technical or biological replicates) using the same proteomic samples, the same procedures and the same batch as the case-vs-contol experiment, it correctly reflects the collective effects of technical variability (e.g., variation/bias in sample preparation, LC-MS analysis, and data processing) and project-specific features (e.g., characteristics of the proteome and biological variation) on the performances of quantitative analysis. To show a proof of concept, we employed the EN method to assess the quantitative accuracy and precision and the ability to quantify subtle ratio changes between groups using different experimental and data-processing approaches and in various cellular and tissue proteomes. It was found that choices of quantitative features, sample size, experimental design, data-processing strategies, and quality of chromatographic separation can profoundly affect quantitative precision and accuracy of label-free quantification. The EN method was also demonstrated as a practical tool to determine the optimal experimental parameters and rational ratio cutoff for reliable protein quantification in specific proteomic experiments, for example, to identify the necessary number of technical/biological replicates per group that affords sufficient power for discovery. Furthermore, we assessed the ability of EN method to estimate levels of false-positives in the discovery of altered proteins, using two concocted sample sets mimicking proteomic profiling using technical and biological replicates, respectively, where the true-positives/negatives are known and span a wide concentration range. It was observed that the EN method correctly reflects the null distribution in a proteomic system and accurately measures false altered proteins discovery rate (FADR). In summary, the EN method provides a straightforward, practical, and accurate alternative to statistics-based approaches for the development and evaluation of proteomic experiments and can be universally adapted to various types of quantitative techniques.
Project description:Unraveling the functional dynamics of phosphorylation networks is a crucial step in understanding the way in which biological networks form a living cell. Recently there has been an enormous increase in the number of measured phosphorylation events. Nevertheless, comparative and integrative analysis of phosphoproteomes is confounded by incomplete coverage and biases introduced by different experimental workflows. As a result, we cannot differentiate whether phosphosites indentified in only one or two samples are the result of condition or species specific phosphorylation, or reflect missing data. Here, we evaluate the impact of incomplete phosphoproteomics datasets on comparative analysis, and we present bioinformatics strategies to quantify the impact of different experimental workflows on measured phosphoproteomes. We show that plotting the saturation in observed phosphosites in replicates provides a reproducible picture of the extent of a particular phosphoproteome. Still, we are still far away from a complete picture of the total human phosphoproteome. The impact of different experimental techniques on the similarity between phosphoproteomes can be estimated by comparing datasets from different experimental pipelines to a common reference. Our results show that comparative analysis is most powerful when datasets have been generated using the same experimental workflow. We show this experimentally by measuring the tyrosine phosphoproteome from Caenorhabditis elegans and comparing it to the tyrosine phosphoproteome of HeLa cells, resulting in an overlap of about 4%. This overlap between very different organisms represents a three-fold increase when compared to dataset of older studies, wherein different workflows were used. The strategies we suggest enable an estimation of the impact of differences in experimental workflows on the overlap between datasets. This will allow us to perform comparative analyses not only on datasets specifically generated for this purpose, but also to extract insights through comparative analysis of the ever-increasing wealth of publically available phosphorylation data.
Project description:Protein quantification at proteome-wide scale is an important aim, enabling insights into fundamental cellular biology and serving to constrain experiments and theoretical models. While proteome-wide quantification is not yet fully routine, many datasets approaching proteome-wide coverage are becoming available through biophysical and MS techniques. Data of this type can be accessed via a variety of sources, including publication supplements and online data repositories. However, access to the data is still fragmentary, and comparisons across experiments and organisms are not straightforward. Here, we describe recent updates to our database resource "PaxDb" (Protein Abundances Across Organisms). PaxDb focuses on protein abundance information at proteome-wide scope, irrespective of the underlying measurement technique. Quantification data is reprocessed, unified, and quality-scored, and then integrated to build a meta-resource. PaxDb also allows evolutionary comparisons through precomputed gene orthology relations. Recently, we have expanded the scope of the database to include cell-line samples, and more systematically scan the literature for suitable datasets. We report that a significant fraction of published experiments cannot readily be accessed and/or parsed for quantitative information, requiring additional steps and efforts. The current update brings PaxDb to 414 datasets in 53 organisms, with (semi-) quantitative abundance information covering more than 300,000 proteins.
Project description:Many recent RNA-seq studies were focused mainly on detecting the differentially expressed genes (DEGs) between two or more conditions. In contrast, only a few attempts have been made to detect genes associated with quantitative traits, such as obesity index and milk yield, on RNA-seq experiment with large number of biological replicates. This study illustrates the linear model application on trait associated genes (TAGs) detection in two real RNA-seq datasets: 89 replicated human obesity related data and 21 replicated Holsteins' milk production related RNA-seq data. Based on these two datasets, the performance between suggesting methods, such as ordinary regression and robust regression, and existing methods: DESeq2 and Voom, were compared. The results indicate that suggesting methods have much lower false discoveries compared to the precedent two group comparisons based approaches in our simulation study and qRT-PCR experiment. In particular, the robust regression outperforms existing DEG finding method as well as ordinary regression in terms of precision. Given the current trend in RNA-seq pricing, we expect our methods to be successfully applied in various RNA-seq studies with numerous biological replicates that handle continuous response traits.