Unknown,Transcriptomics,Genomics,Proteomics

Dataset Information

IVT-seq reveals extreme bias in RNA-sequencing

ABSTRACT: Background RNA sequencing (RNA-seq) is a powerful technique for identifying and quantifying transcription and splicing events, both known and novel. However, given its recent development and the proliferation of library construction methods, understanding the bias it introduces is incomplete but critical to realizing its value. Results Here we present a method, in vitro transcription sequencing (IVT-seq), for identifying and assessing the technical biases in RNA-seq library generation and sequencing at scale. We created a pool of > 1000 in vitro transcribed (IVT) RNAs from a full-length human cDNA library and sequenced them with poly-A and total RNA-seq, the most common protocols. Because each cDNA is full length and we show IVT is incredibly processive, each base in each transcript should be equivalently represented. However, with common RNA-seq applications and platforms, we find ~50% of transcripts have > 2-fold and ~10% have > 10-fold differences in within-transcript sequence coverage. Strikingly, we also find > 6% of transcripts have regions of high, unpredictable sequencing coverage, where the same transcript varies dramatically in coverage between samples, confounding accurate determination of their expression. To get at causal factors, we used a combination of experimental and computational approaches to show that rRNA depletion is responsible for the most significant variability in coverage and that several sequence determinants also strongly influence representation. Conclusions In sum, these results show the utility of IVT-seq in promoting better understanding of bias introduced by RNA-seq and suggest caution in its interpretation. Furthermore, we find that rRNA-depletion is responsible for substantial, unappreciated biases in coverage. Perhaps most importantly, these coverage biases introduced during library preparation suggest exon level expression analysis may be inadvisable. 5 rRNA-depleted samples with duplicates, 1 polyA selected, 1 total RNA, and 1 plasmid library all without replicates.

ORGANISM(S): Homo sapiens

SUBMITTER: Nicholas Lahens

PROVIDER: E-GEOD-50445 | biostudies-arrayexpress |

REPOSITORIES: biostudies-arrayexpress

ACCESS DATA

Similar Datasets

Project description:To evaluate the effect of CG methylation on DNA binding of sequence-specific B-ZIP transcription factors (TFs) in a high-throughput manner, we enzymatically methylated the cytosine in the CG dinucleotide on protein binding microarrays. Two Agilent DNA array designs were used. One contained 40,000 features using de Bruijn sequences where each 8-mer occurs 32 times in various positions in the DNA sequence. The second contained 180,000 features with each CG containing 8-mer present three times. The first design was better for identification of binding motifs, while the second was better for quantification. Using this novel technology, we show that CG methylation enhanced binding for CEBPA and CEBPB and inhibited binding for CREB, ATF4, JUN, JUND, CEBPD and CEBPG. The CEBPB|ATF4 heterodimer bound a novel motif CGAT|GCAA 10-fold better when methylated. EMSA confirmed these results. CEBPB ChIP-seq data using primary female mouse dermal fibroblasts with 50X methylome coverage for each strand indicate that the methylated sequences well-bound on the arrays are also bound in vivo. CEBPB bound 39% of the methylated canonical 10-mers ATTGC|GCAAT in the mouse genome. After ATF4 protein induction by thapsigargin which results in ER stress, CEBPB binds methylated CGAT|GCAA in vivo, recapitulating what was observed on the arrays. This methodology can be used to identify new methylated DNA sequences preferentially bound by TF, which may be functional in vivo. To evaluate the effect of CG methylation on DNA binding of sequence-specific B-ZIP transcription factors (TFs) in a high-throughput manner, we enzymatically methylated the cytosine in the CG dinucleotide on protein binding microarrays. Two Agilent DNA array designs were used. One contained 40,000 features using de Bruijn sequences where each 8-mer occurs 32 times in various positions in the DNA sequence. The second contained 180,000 features with each CG containing 8-mer present three times. The first design was better for identification of binding motifs, while the second was better for quantification. Using this novel technology, we show that CG methylation enhanced binding for CEBPA and CEBPB and inhibited binding for CREB, ATF4, JUN, JUND, CEBPD and CEBPG. The CEBPB|ATF4 heterodimer bound a novel motif CGAT|GCAA 10-fold better when methylated. EMSA confirmed these results. CEBPB ChIP-seq data using primary female mouse dermal fibroblasts with 50X methylome coverage for each strand indicate that the methylated sequences well-bound on the arrays are also bound in vivo. CEBPB bound 39% of the methylated canonical 10-mers ATTGC|GCAAT in the mouse genome. After ATF4 protein induction by thapsigargin which results in ER stress, CEBPB binds methylated CGAT|GCAA in vivo, recapitulating what was observed on the arrays. This methodology can be used to identify new methylated DNA sequences preferentially bound by TF, which may be functional in vivo. Protein binding microarray (PBM) experiments were performed for a set of 8 mouse B-ZIP homodimers and one hetrodimer transcription factors. Briefly, the PBMs involved binding GST-tagged DNA-binding proteins to double-stranded and methylated or unmethylated 44K Agilent microarrays, containing a DeBruijn sequence design, in order to determine their sequence preferences. Details of the PBM protocol are described in Berger et al., Nature Biotechnology 2006.

Dataset Information

IVT-seq reveals extreme bias in RNA-sequencing

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets