Project description:While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a training data set, which is typically either from a part of the data set being M-bM-^@M-^\recalibratedM-bM-^@M-^] (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 units, and by as much as 13 units M-BM- at CpG sites. In addition, since reads mapping to the genome are not used for recalibration, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration. Four human RNA samples with equimolar ERCC spike-in standards were sequenced on Illumina. Two human brain/liver/muscle RNA mixtures with dynamic range of ERCC spike-in standards were sequenced on SOLiD.

Project description:Acute cellular stress is known to induce a global reduction in protein translation through suppression of cap dependent translation. However, selective translation in response to acute stress has been shown to play important roles in regulating the stress response. An accurate transcriptome-wide profile of acute cellular stress-induced translational changes has been challenging to obtain. Commonly used data normalization methods, such as quantile normalization, operate based on the assumption that any systematic shifts are artifacts introduced from experimental procedures. Consequently, if applied to profiling acute cellular stress-induced protein translation changes, these methods are expected to produce biased estimates. To address this issue, here we designed, generated, and evaluated a panel of 16 oligomers to serve as external standards for ribosome profiling studies. Using Sodium Arsenite treatment-induced oxidative stress in lymphoblastoid cell lines as a model system, we applied spike-in oligomers as external standards based on quantifications of monosomal RNA extracted from each sample. We found our spike-in oligomers to display a linear correlation between the observed and the expected, with small but significant ratio compression at the lower concentration range, and span the expected quantitative range in the observed data, which covers 97 % of the quantitated endogenous genes. We found popular global scaling normalization approaches to introduce both high levels of false positives and false negatives in differential expression analysis. Using the expected fold changes constructed from spike-in external controls, we found in our dataset that TMM normalization produced 87.5% false positives when a P value cutoff of 0.1 is used (i.e. 10% expected false positive rate) and on average produced a systematic shift of fold change by 3.25 fold. These results highlight the consequences of applying global scaling approaches to conditions that clearly violate their key assumptions. As an alternative, we found using spike-in quantifications as control genes in RUVg normalization recapitulated the expected stress induced global reduction of translation, and resulted in little, if any, systematic shifts in spike-in constructed true positives. Finally, using spike-in constructed true positives and true negatives, we explored alternative normalization approaches for acute cellular stress response ribo-seq studies. We found that a simple approach that quantile normalized data from control and treated samples separately, which we termed respective quantile normalization, produced expected results in spike-in quantification, and resulted in little, if any, systematic bias on fold change in endogenous genes. Additionally, we found that under certain parameters, using endogenous control genes for RUVg normalization best recapitulate the expected. Our results clearly demonstrated the utility of our spike-in oligomers, both for constructing expected results as controls and for data normalization. Our exploration of different normalization approaches highlights the issues in applying global scaling normalization when key assumptions are clearly not met. We show that a respective quantile normalization approach or normalization with endogenous control genes are viable options worth considering as a more generalizable approach for stress response ribo-seq studies. This conclusion is likely applicable to other types of studies that involve global shifts in expression profiles between comparison groups of interests.

Dataset Information

Homo sapiens

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure