Project description:The use of RNA-seq as the preferred method for the discovery and validation of small RNA biomarkers is hindered by high variability and biased sequence counts. In this paper we develop a statistical model for sequence counts that accounts for ligase bias and stochastic variation in library amplification steps and sequencing depth variation. Our analytical contributions are the description of the Linear Quadratic (LQ) relation between the mean and variance of the sequence counts in an RNA-seq experiment and the derivation of the Poisson truncated mixture as the underlying probability distribution for RNA-seq data. Using a large number of sequencing datasets, we demonstrate here how one can use this modeling framework to calculate empirical correction factors for ligase bias, while accounting for random variation in sequence counts. Bias correction may remove the majority of bias in the absence of differential expression and more than 40% of the bias in the presence of variable expression of miRNAs. Empirical bias correction factors appear to be nearly constant over at least one and up to four orders of magnitude of total RNA input and independent of sample composition.
Project description:Small RNA-seq is increasingly being used for profiling of small RNAs. Quantitative characteristics of long RNA-seq have been extensively described, but small RNA-seq involves fundamentally different methods for library preparation, with distinct protocols and technical variations that have not been fully and systematically studied. Using common sets of reference samples, we evaluated the accuracy, reproducibility and bias of small RNA-seq library preparation for five distinct protocols and across nine different laboratories. As part of this larger study, we assessed sequencing bias and reproducibility using an equimolar pool of 1,152 small RNA sequences ranging from 15-90 nt, and primarily comprised of annotated human microRNAs. We observed extensive protocol-specific and sequence-specific bias that was largely mitigated in protocols employing sequencing adapters with randomized end-nucleotides. We find that sequencing bias is highly reproducible across labs using the same library preparation technologies, and use the data to calculate inter-protocol bias correction factors. These results provide strong evidence for the feasibility of reproducible cross-laboratory small RNA-seq studies, even those involving analysis of data generated using different protocols.