High-resolution transcriptome analysis with long-read RNA sequencing
ABSTRACT: Ongoing improvements to next generation sequencing technologies are leading to longer sequencing read lengths, but a thorough understanding of the impact of longer reads on RNA sequencing analyses is lacking. To address this issue, we generated and compared two RNA sequencing datasets of differing read lengths -- 2x75 bp (L75) and 2x262 bp (L262) -- and investigated the impact of read length on various aspects of analysis, including the performance of currently available read-mapping tools, gene and transcript quantification, and detection of allele-specific expression patterns. Our results indicate that, while the scalability of read-mapping tools and the cost-effectiveness of long read protocol is an issue that requires further attention, longer reads enable more accurate quantification of diverse aspects of gene expression, including individual-specific patterns of allele-specific expression and alternative splicing. Overall design: Two RNA-Seq datasets of differing read lengths (2x262 bp and 2x75 bp)
Project description:Ongoing improvements to next generation sequencing technologies are leading to longer sequencing read lengths, but a thorough understanding of the impact of longer reads on RNA sequencing analyses is lacking. To address this issue, we generated and compared two RNA sequencing datasets of differing read lengths -- 2x75 bp (L75) and 2x262 bp (L262) -- and investigated the impact of read length on various aspects of analysis, including the performance of currently available read-mapping tools, gene and transcript quantification, and detection of allele-specific expression patterns. Our results indicate that, while the scalability of read-mapping tools and the cost-effectiveness of long read protocol is an issue that requires further attention, longer reads enable more accurate quantification of diverse aspects of gene expression, including individual-specific patterns of allele-specific expression and alternative splicing. Two RNA-Seq datasets of differing read lengths (2x262 bp and 2x75 bp)
Project description:Background: Whole exome sequencing (WES) has been proven to serve as a valuable basis for various applications such as variant calling and copy number variation (CNV) analyses. For those analyses the read coverage should be optimally balanced throughout protein coding regions at sufficient read depth. Unfortunately, WES is known for its uneven coverage within coding regions due to GC-rich regions or off-target enrichment. Results: In order to examine the irregularities of WES within genes, we applied Agilent SureSelectXT exome capture on human samples and sequenced these via Illumina in 2x101 paired-end mode. As we suspected the sequenced insert length to be crucial in the uneven coverage of exome captured samples, we sheared 12 genomic DNA samples to two different DNA insert size lengths, namely 130 and 170 bp. Interestingly, although mean coverages of target regions were clearly higher in samples of 130 bp insert length, the level of evenness was more pronounced in 170 bp samples. Moreover, merging overlapping paired-end reads revealed a positive effect on evenness indicating overlapping reads as another reason for the unevenness. In addition, mutation analysis on a subset of the samples was performed. In these isogenic subclones almost twofold mutations were failed in the 130 bp samples when compared to the 170 bp samples. Visual inspection of the discarded mutation sites exposed low coverages at the sites embedded in high amplitudes of coverage depth in the affected region. Conclusions: Producing longer insert reads could be a good strategy to achieve better uniform read coverage in coding regions and hereby enhancing the effective sequencing yield to provide an improved basis for further variant calling and CNV analyses.
Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (mailto:firstname.lastname@example.org for data coordination/informatics/experimental questions, mailto:email@example.com for informatics questions, mailto:firstname.lastname@example.org for experimental questions). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:email@example.com). This track is produced as part of the ENCODE Project. RNA-seq is a method for mapping and quantifying the transcriptome of any organism that has a genomic DNA sequence assembly. RNA-seq is performed by reverse-transcribing an RNA sample into cDNA, followed by high throughput DNA sequencing, which was done here on an Illumina Genome Analyzer (GAI or GAIIx) (Mortazavi et al., 2008). The transcriptome measurements shown on these tracks were performed on polyA selected RNA (http://genome.ucsc.edu/cgi-bin/hgEncodeVocab?term=longPolyA&type=rnaExtract) from total cellular RNA (http://genome.ucsc.edu/cgi-bin/hgEncodeVocab?term=cell&type=localization) using two different protocols - one that preserves information about which strand the read is coming from and one that does not. Due to the specifics of the enzymology of library construction, gene and transcript quantification is more accurate based on the non-strand-specific protocol, while the strand-specific protocol is useful for assigning strandedness, but in general less reliable for quantification. Non-strand-specific protocol (deep "reference" transcriptome measurements, 2x75 bp reads): PolyA-selected RNA was fragmented by magnesium-catalyzed hydrolysis and then converted into cDNA by random priming and amplified. Data have been produced in two formats: single reads, each of which comes from one end of a cDNA molecule, and paired-end reads, which are obtained as pairs from both ends of cDNAs. This RNA-seq protocol does not specify the coding strand. As a result, there will be ambiguity at loci where both strands are transcribed. The "randomly primed" reverse transcription is, apparently, not fully random. This is inferred from a sequence bias in the first residues of the read population, and this likely contributes to observed unevenness in sequence coverage across transcripts. Strand specific protocol (1x75 bp reads): PolyA-selected RNA was fragmented by magnesium-catalyzed hydrolysis. 3' adapters were ligated to the 3' end of fragments, then 5' adapters were ligated to the 5' end. The resulting RNA molecules were converted to cDNA and amplified. This RNA-seq protocol does specify the coding strand as each read is in the same 5'-3' orientation as the original RNA strand. As a result, loci where both strands are transcribed can be disambiguated. However, RNA ligation is an inherently biased process and as a result greater unevenness in sequence coverage across transcripts is observed compared to the non-strand-specific data, and quantification is less accurate. Data Analysis: Reads were aligned to the hg19 human reference genome using TopHat, a program specifically designed to align RNA-seq reads and discover splice junctions de novo. Cufflinks, a de novo transcript assembly and quantification software package, was run on the TopHat alignments to discover and quantify novel transcripts and to obtain transcript expression estimates based on the GENCODE annotation. All sequence files, alignments, gene and transcript models and expression estimates files are available for download. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Experimental Procedures: Cells were grown according to the approved ENCODE cell culture protocols except for H1-hESC for which frozen cell pellets were purchased from Cellular Dynamics. Cells were lysed in RLT buffer (Qiagen RNEasy kit) and processed on RNEasy midi columns according to the manufacturer's protocol, with the inclusion of the "on-column" DNAse digestion step to remove residual genomic DNA. 75 µgs of total RNA was selected twice with oligo-dT beads (Dynal) according to the manufacturer's protocol to isolate mRNA from each of the preparations. For 2x75 bp non-stranded RNA-seq, 100 ngs of mRNA was then processed according to the protocol in Mortazavi et al (2008), and prepared for sequencing on the Genome Analyzer flow cell according to the protocol for the ChIPSeq DNA genomic DNA kit (Illumina). The majority of paired-end libraries were size-selected around 200 bp (fragment length) with the exception of a few additional replicates that were size-selected at 400 bp with the specific intent to investigate the effect of fragment length on results. Strand-specific RNA-seq libraries were prepared from 100ng of mRNA from the same preparation following Illumina's Strand-Specific RNA-seq protocol . Libraries were sequenced with an Illumina Genome Analyzer I or an Illumina Genome Analyzer IIx according to the manufacturer's recommendations. Reads of 75 bp length were obtained, single end for directional, strand-specific libraries (1x75D) and paired end for non-strand-specific libraries (2x75). Data Processing and Analysis: Reads were mapped to the reference human genome (version hg19), with or without the Y chromosome, depending on the sex of the cell line, and without the random chromosomes and haplotypes in all cases, using TopHat (version 1.0.14). TopHat was used with default settings with the exception of specifying an empirically determined mean inner-mate distance. After mapping reads to the genome and identifying splice junctions, the data was further analyzed using the transcript assembly and quantification software Cufflinks (version 0.9.3) using the sequence bias detection and correction option. Cufflinks was used in two modes: first, expression for genes and individual transcripts was quantified based on the GENCODE annotation, for both versions v3c and v4 of GENCODE GRCh37, and second, Cufflinks was run in de novo transcript assembly and quantification mode to obtain candidate novel transcript and gene models and expression estimates for them.
Project description:Two biological replicates of Madin-Darby Canine Kidney Epithelial Cells grown as 3D cysts in Collagen Type I (7 days old) were exposed to six different concentrations of Hepatocyte Growth Factor (HGF) (0, 1.03, 2.07, 4.15, 8.33 and 16.67 ng/ml). Total RNA was isolated from the cysts after 12 hours of HGF induction. The data submitted here are the raw sequence files of the single read lengths of 50 bp for the 12 samples (2 replicates X 6 conditions) after RNA sequencing experiment using Illumina HiSeq 2000.15
Project description:This data was produced by the Wold lab at Caltech as part of the ENCODE Project. RNA-Seq is a method for mapping and quantifying the transcriptome of any organism that has a genomic DNA sequence assembly. RNA-Seq is performed by reverse-transcribing an RNA sample into cDNA, followed by high throughput DNA sequencing. The resulting sequence reads are then informatically mapped onto the genome sequence. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf The transcriptome measurements shown on these tracks were performed on polyA selected RNA from total cellular RNA. Data have been produced in two formats: single reads, each of which comes from one end of a randomly primed cDNA molecule; and paired-end reads, which are obtained as pairs from both ends cDNAs resulting from random priming. The resulting sequence reads are then informatically mapped onto the genome sequence (Alignments). Those that don't map to the genome are mapped to known RNA splice junctions (Splice Sites). These mapped reads are then counted to determine their frequency of occurrence at known gene models. Sequence reads that cluster at genome locations that lack an existing transcript model are also identified informatically and they are quantified. RNA-Seq is especially suited for giving information about RNA splicing patterns and for determining unequivocally the presence or absence of lower abundance class RNAs. As performed here, internal RNA standards are used to assist in quantification and to provide internal process controls. This RNA-Seq protocol does not specify the coding strand. As a result, there will be ambiguity at loci where both strands are transcribed. The "randomly primed" reverse transcription is, apparently, not fully random. This is inferred from a sequence bias in the first residues of the read population, and this likely contributes to observed unevenness in sequence coverage across transcripts. These tracks show 1x32 n.t. or 2x75 n.t. sequence reads of cDNA obtained from biological replicate samples (different culture plates) of the ENCODE cell lines. The 32 n.t. sequences were aligned to the human genome (hg18) and UCSC known-gene splice junctions using different sequence alignment programs. The 2x75 n.t. reads were mapped serially, first with the Bowtie program (Langmead et al., 2009) against the genome and UCSC known-gene splice junctions (Splice Sites). Bowtie-unmapped reads were then mapped using BLAT to find evidence of novel splicing, by requiring at least 10 bp on the short-side of the splice.
Project description:Purpose:To dissect the mechanisms underlying altered gene expression in aneuploids, we measured transcript abundance in colonies of haploid yeast strain F45 and derived strains, including strains disomic for chromosomes XV and XVI, using RNA-seq. F45 colonies display complex “fluffy” morphologies, while the disomic colonies are smooth, resembling laboratory strains Methods: RNA-seq analysis was carried out on RNA isolated from fully developed S. cerevisiae colonies, grown on solid medium for four days, either in triplicate or quadruplicate. Stranded, paired-end sequencing was carried out in two batches. In the first batch 2x51 bp sequencing was carried out on an Illumina Hiseq2000 and in the second batch 2x75 bp sequencing was carried out on an Illumina NextSeq. Readpairs were aligned using Bowtie2 (version 2.1.0)with the parameters [-N 1 -I 50 -X 450 -p 6 --reorder -x -S] and allowing 1 mismatch per read. Differential transcription was detected and quantified using EdgeR (v. 3.6.8) Results: Our two disomes displayed similar transcriptional profiles, a phenomenon not driven by their shared smooth colony morphology nor specified purely by the karyotype. Surprisingly, the environmental stress response (ESR) was induced in euploid F45, relative to the two disomes, rather than vice-versa. We also identified genes whose expression reflected a non-linear interaction between the copy number of a transcriptional regulatory gene on chromosome XVI, DIG1, and the copy number of other chromosome XVI genes. DIG1 and the remaining chromosome XVI genes also demonstrated distinct contributions to the effect of the chromosome XVI disome on ESR gene expression. Conclusions: Expression changes in aneuploids reflect a mixture of effects shared between different aneuploidies, including stress responses, and effects unique to perturbing the copy number of particular chromosomes, including non-linear copy number interactions between genes. The balance between these two phenomena is likely to be genotype and environment specific. Overall design: mRNA profiles of 4 day old haploid F45 colonies, and colonies derived from F45 were generated by deep sequencing, in triplicate or quadruplicate, using Illumina Hiseq2000 or Illumina Nextseq sequencing.
Project description:Using RNA seq, gene expression profiles were compared between wild-type Salmonella enterica Typhimurium 14028s and its isogenic srfM2 mutant and genes differentially expressed in the srfM2 mutant were identified. Overall design: Three independent total RNAs of wild-type and srfM2 mutant cells were sequenced on the Illimina HiSeq 4000 in paired-end mode (2x75 bp).
Project description:Up until now, the existence of Dnmt2-mediated DNA methylation has mostly been supported by focal analyses in organisms that contain Dnmt2, but no Dnmt1 or Dnmt3 DNA methyltransferase. In these organisms, several independent studies have also provided support for a biologically important function of Dnmt2-dependent DNA methylation. For example, Dnmt2-dependent methylation in Entamoeba histolytica, the causative agent of amebic dysentery, has been connected to the parasite s virulence. However, global DNA methylation levels in Entamoeba have been found to be very low. In addition, no specific features, such as CpG-specificity and specificity for certain genetic subcompartments have been described. This distinguishes Dnmt2-dependent methylation patterns from all other known methylomes and has raised questions about the validity of the underlying results. We have used whole-genome bisulfite sequencing for an unbiased characterization of the Entamoeba histolytica methylome at single-base resolution in a E.histolytica strain HM-1:IMSS devoid of significant level of EhDnmt2 (Ehmeth) expression. Paired-end BS-sequencing was performed on an Illumina Genome Analyzer with read lengths of 105 base pairs and an average insert size of 200 bp.
Project description:Comparison of TopHat alignments and assessment of spurious splice junctions for 32nt and 76nt read lengths. Total RNA from 2-week-old Arabidopsis thaliana (ecotype Columbia) seedlings grown on MS plates was isolated using RNeasy Plant Mini Kit from Qiagen. To remove any contaminating DNA, RNA was treated with DNAse. Isolation of poly (A) mRNA and preparation of cDNA library were carried out using the Illumina TrueSeq RNA kit. Sequencing (72 cycle) was done on Illumina Genome Analyzer II. 2 replicates
Project description:Background Next Generation Sequencing technologies have facilitated differential gene expression analysis through RNA-seq and Tag-seq methods. RNA-seq has biases associated with transcript lengths, lacks uniform coverage of regions in mRNA and requires 10–20 times more reads than a typical Tag-seq. Most existing Tag-seq methods either have biases or not high throughput due to use of restriction enzymes or enzymatic manipulation of 5’ ends of mRNA or use of RNA ligations. Results We have developed EXpression Profiling through Randomly Sheared cDNA tag Sequencing (EXPRSS) that employs acoustic waves to randomly shear cDNA and generate sequence tags at a relatively defined position (~150-200 bp) from the 3′ end of each mRNA. Implementation of the method was verified through comparative analysis of expression data generated from EXPRSS, NlaIII-DGE and Affymetrix microarray and through qPCR quantification of selected genes. EXPRSS is a strand specific and restriction enzyme independent tag sequencing method that does not require cDNA length-based data transformations. EXPRSS is highly reproducible, is high-throughput and it also reveals alternative polyadenylation and polyadenylated antisense transcripts. It is cost-effective using barcoded multiplexing, avoids the biases of existing SAGE and derivative methods and can reveal polyadenylation position from paired-end sequencing. Conclusions EXPRSS Tag-seq provides sensitive and reliable gene expression data and enables high-throughput expression profiling with relatively simple downstream analysis. Five weeks old Arabidopsis (Col-0) leaf discs treated with water or flg22 for 60min; mRNA profiles were generated by deep sequencing on Illumina GAIIx using EXPRSS and NlaIII-DGE protocols, in quadruplicate.