Metabolomics,Unknown,Transcriptomics,Genomics,Proteomics

Dataset Information

Small RNA sequencing data of breast cancer cell lines (Illumina iDEA Challenge 2011)

ABSTRACT: Small RNA sequencing has been performed using eight commonly used breast cancer cell lines by Illumina as part of the Illumina iDEA Challange 2011. Total RNA of the cell lines was isolated and the sequencing library was prepared using Illuminas Small RNA v1.5 library prep protocol. This data was analyzed with Flicker, an add-on to the Illumina Genome Analyzer Pipeline software that allows for the preliminary analysis of small RNA runs on the GA. Flicker was used in this case for: * Trimming: Attempts to trim the adaptor sequence from small RNA reads Trimming is accomplished by aligning the adaptor with each read, both internally (the adaptor is fully contained within the read) and as a suffix (the adaptor overlaps the 3' end of the read). A similarity matrix is used to generate the matching score, taking into account both substitutions and indels. The alignment with the best matching score is returned, and that best alignment is trimmed from the read. A trimming threshold may be specified, and the adaptor is not trimmed unless the best score exceeds the threshold. * Aligning with Iterated ELAND: Attempts to aligns trimmed reads to targets (squashed genomes or databases such as miRBase) using the ELAND short read aligner. Each read is aligned by ELAND with specified alignment targets. ELAND was designed to align reads of a uniform length, and the trimmed reads from a small RNA run are several different lengths. We currently address this by running ELAND for reads of each length: reads are binned by length, and each bin of a given length is aligned against a specified target. Once all bins are aligned against all targets (in this case, gDNA and miRBase), the results are recombined into a single file with all the annotation information. ===== Key to Experiments HCT ID, Lane, Sample ID, Cell Line, ER +/- Status, Reads HCT20061, Lane 1, HS378, MCF7, +, 14,555,390 HCT20062, Lane 2, HS379, MDA-MB-231, -, 15,147,037 HCT20063, Lane 3, HS381, T47D, +, 14,701,254 HCT20064, Lane 4, HS377, BT-20, -, 16,529,823 HCT20065, Lane 5, HS375, BT-474, +, 16,443,361 HCT20066, Lane 6, HS380, MDA-MB-468, -, 16,746,981 HCT20067, Lane 7, HS382, ZR-75-1, +, 17,631,005 HCT20068, Lane 8, HS376, MCF10A, normal, 19,155,607

INSTRUMENT(S): Illumina Genome Analyzer

ORGANISM(S): Homo sapiens

SUBMITTER: Cindy Korner

PROVIDER: E-MTAB-4539 | biostudies-arrayexpress |

REPOSITORIES: biostudies-arrayexpress

ACCESS DATA

Similar Datasets

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Jonathan Preall jpreall@cshl.edu (Generation 0 Data from Hannon Lab), Carrie Davis davisc@cshl.edu (experimental), Alex Dobin dobin@cshl.edu (computational), Wei Lin wlin@cshl.edu (computational), Tom Gingeras gingeras@cshl.edu (primary investigator)). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). hg18: This data was produced by Hannon lab part of Cold Spring Harbor as part of the ENCODE Project. The series depicts NextGen sequencing information for RNAs between the sizes of 20-200 nt isolated from RNA samples from tissues or sub cellular compartments of cell lines. hg19: This track depicts NextGen sequencing information for RNAs between the sizes of 20-200 nt isolated from RNA samples from tissues or sub cellular compartments from ENCODE cell lines. The overall goal of the ENCODE project is to identify and characterize all functional elements in the sequence of the human genome. hg19: This cloning protocol generates directional libraries that are read from the 5' ends of the inserts, which should largely correspond to the 5' ends of the mature RNAs. The libraries were sequenced on a Solexa platform for a total of 36, 50 or 76 cycles however the reads undergo post-processing resulting in trimming of their 3' ends. Consequently, the mapped read lengths are variable. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf hg18: Small RNAs between 20-200 nt were ribominus treated according to the manufacturer's protocol (Invitrogen) using custom LNA probes targeting ribosomal RNAs (some datasets are also depleted of U snRNAs and high abundant microRNAs). The RNA was treated with Tobacco Alkaline Pyrophosphatase to eliminate any 5' cap structure. Poly-A Polymerase was used to catalyze the addition of C's to the 3' end. The 5' ends were phosphorylated using T4 PNK and an RNA linker was ligated onto the 5' end. Reverse transcription was carried out using a poly-G oligo with a defined 5' extension. The inserts were then amplified using oligos targeting the 5' linker and poly-G extension and containing sequencing adapters. The library was sequenced on an Illumina GA machine for a total of 36, 50 or 76 cycles. Initially 1 lane is run. If an appreciable number of mappable reads are obtained, additional lanes are run. Sequence reads underwent quality filtration using Illumina standard pipeline (Gerlad). The read lengths may exceed the insert sizes and consequently introduce 3' adaptor sequence into the 3' end of the reads. The 3' sequencing adaptor was removed from the reads using a custom clipper program, which aligned the adaptor sequence to the short-reads, allowing up to 2 mismatches and no indels. Regions that aligned were "clipped" off from the read. The trimmed portions were collapsed into identical reads, their count noted and aligned to the human genome (NCBI build 36, hg18 unmasked) using Nexalign (Lassmann et al., not published). The alignment parameters are tuned to tolerate up to 2 mismatches with no indels and will allow for trimmed portions as small as 5 nucleotides to be mapped. We report reads that mapped 10 or fewer times. Data obtained from each lane is processed and mapped independently. The processed/mapped data from each lane is then complied as a single track without additional processing and submitted to UCSC. Consequently, identical reads within a lane were collapsed and their value is reported as the "transfrag" signal value. However, the redundancy between lanes has not been eliminated so the same transfrag may appear multiple times within a signal. hg19: Small RNAs between 20-200 nt were ribominus treated according to the manufacturer's protocol (Invitrogen) using custom LNA probes targeting ribosomal RNAs (some datasets are also depleted of U snRNAs and high abundant microRNAs). The RNA was treated with Tobacco Alkaline Pyrophosphatase to eliminate any 5' cap structures. Poly-A Polymerase was used to catalyze the addition of C's to the 3' end. The 5' ends were phosphorylated using T4 PNK and an RNA linker was ligated onto the 5' end. Reverse transcription was carried out using a poly-G oligo with a defined 5' extension. The inserts were then amplified using oligos targeting the 5' linker and poly-G extension and containing sequencing adapters. The library was sequenced on an Illumina GA machine for a total of 36, 50 or 76 cycles. Initially, one lane was run. If an appreciable number of mappable reads were obtained, additional lanes were run. Sequence reads underwent quality filtration using Illumina standard pipeline (GERALD). The Illumina reads were initially trimmed to discard any bases following a quality score less than or equal to 20 and converted into FASTA format, thereby discarding quality information for the rest of the pipeline. As a result, the sequence quality scores in the BAM output are all displayed as "40" to indicate no quality information. The read lengths may exceed the insert sizes and consequently introduce 3' adapter sequence into the 3' end of the reads. The 3' sequencing adapter was removed from the reads using a custom clipper program (available at http://hannonlab.cshl.edu/fastx_toolkit/), which aligned the adapter sequence to the short-reads using up to 2 mismatches and no indels. Regions that aligned were "clipped" off from the read. Terminal C nucleotides introduced at the 3' end of the RNA via the cloning procedure are also trimmed. The trimmed portions were collapsed into identical reads, their count noted and aligned to the human genome (version hg19, using the gender build appropriate to the sample in question - female/male) using Bowtie (Langmead B. et al). The alignment parameter allowed 0, 1, or 2 mismatches iteratively. We report reads that mapped 20 or fewer times. Discrepancies between hg18 and hg19 versions of CSHL small RNA data: The alignment pipeline for the CSHL small RNA data was updated upon the release of the human genome version hg19, resulting in a few noteworthy discrepancies with the hg18 dataset. First, mapping was conducted with the open-source Bowtie algorithm (http://bowtie-bio.sourceforge.net/index.shtml) rather than the custom NexAlign software. As each algorithm uses different strategies to perform alignments, the mapping results may vary even in genomic regions that do not differ between builds. The read processing pipeline also varies slightly, in that we no longer retain information regarding whether a read was 'clipped' off adapter sequence.

Project description:Background High-throughput RNA-Sequencing (RNA-Seq) has become the preferred technique for studying gene expression differences between biological samples and for discovering novel isoforms, though the techniques to analyze the resulting data are still immature. One pre-processing step that is widely but heterogeneously applied is trimming, in which low quality bases, identified by the probability that they are called incorrectly, are removed. However, the impact of trimming on subsequent alignment to a genome could influence downstream analyses including gene expression estimation; we hypothesized that this might occur in an inconsistent manner across different genes, resulting in differential bias. Results To assess the effects of trimming on gene expression, we generated RNA-Seq data sets from four samples of larval Drosophila melanogaster sensory neurons, and used three trimming algorithmsâSolexaQA, Trimmomatic, and ConDeTriâto perform quality-based trimming across a wide range of stringencies. After aligning the reads to the D. melanogaster genome with TopHat2, we used Cuffdiff2 to compare the original, untrimmed gene expression estimates to those following trimming. With the most aggressive trimming parameters, over ten percent of genes had significant changes in their estimated expression levels. This trend was seen with two additional RNA-Seq data sets and with alternative differential expression analysis pipelines. We found that the majority of the expression changes could be mitigated by imposing a minimum length filter following trimming, suggesting that the differential gene expression was primarily being driven by spurious mapping of short reads. Slight differences with the untrimmed data set remained after length filtering, which were associated with genes with low exon numbers and high GC content. Finally, an analysis of paired RNA-seq/microarray data sets suggests that no or modest trimming results in the most biologically accurate gene expression estimates. Conclusions We find that aggressive quality-based trimming has a large impact on the apparent makeup of RNA-Seq-based gene expression estimates, and that short reads can have a particularly strong impact. We conclude that implementation of trimming in RNA-Seq analysis workflows warrants caution, and if used, should be used in conjunction with a minimum read length filter to minimize the introduction of unpredictable changes in expression estimates. Four biological replicates of 100 Drosophila melanogaster larval multi-dendritic sensory neurons were profiled by mRNA-Seq

Project description:Background High-throughput RNA-Sequencing (RNA-Seq) has become the preferred technique for studying gene expression differences between biological samples and for discovering novel isoforms, though the techniques to analyze the resulting data are still immature. One pre-processing step that is widely but heterogeneously applied is trimming, in which low quality bases, identified by the probability that they are called incorrectly, are removed. However, the impact of trimming on subsequent alignment to a genome could influence downstream analyses including gene expression estimation; we hypothesized that this might occur in an inconsistent manner across different genes, resulting in differential bias. Results To assess the effects of trimming on gene expression, we generated RNA-Seq data sets from four samples of larval Drosophila melanogaster sensory neurons, and used three trimming algorithms—SolexaQA, Trimmomatic, and ConDeTri—to perform quality-based trimming across a wide range of stringencies. After aligning the reads to the D. melanogaster genome with TopHat2, we used Cuffdiff2 to compare the original, untrimmed gene expression estimates to those following trimming. With the most aggressive trimming parameters, over ten percent of genes had significant changes in their estimated expression levels. This trend was seen with two additional RNA-Seq data sets and with alternative differential expression analysis pipelines. We found that the majority of the expression changes could be mitigated by imposing a minimum length filter following trimming, suggesting that the differential gene expression was primarily being driven by spurious mapping of short reads. Slight differences with the untrimmed data set remained after length filtering, which were associated with genes with low exon numbers and high GC content. Finally, an analysis of paired RNA-seq/microarray data sets suggests that no or modest trimming results in the most biologically accurate gene expression estimates. Conclusions We find that aggressive quality-based trimming has a large impact on the apparent makeup of RNA-Seq-based gene expression estimates, and that short reads can have a particularly strong impact. We conclude that implementation of trimming in RNA-Seq analysis workflows warrants caution, and if used, should be used in conjunction with a minimum read length filter to minimize the introduction of unpredictable changes in expression estimates.

Project description:This track depicts high throughput sequencing of long RNAs (>200 nt) from RNA samples from tissues or subcellular compartments from ENCODE cell lines. The overall goal of the ENCODE project is to identify and characterize all functional elements in the sequence of the human genome. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Cells were grown according to the approved ENCODE cell culture protocols. Sample preparation and sequencing: K562 and GM12878 total cell, total RNA: Standard Illumina Pair-end kit with the sole exception that a "tagged" random hexamer was used to prime the 1st strand synthesis: 5'-ACTGTAGGN6-3'. The addition of this tag is what permits us to make strand assignments for the reads. The sequence of the tag is reported in the 5' end of the read. Asymmetric PCR can place the tag on either the 1st or 2nd read depending on which strand it used as a template. Strand assignments are made by looking for the tag at the 5' end of either read 1 or read 2. Read 1 is physically linked to read 2. Therefore, if a tag is present on one end strand assignments are made for both ends. We noted during analysis that the tags are generally 5' truncated. We only "strand" reads that contain ACTGTAGG, CTGTAGG, TGTAGG, GTAGG. Between 63-68% of reads could be stranded in these libraries. It is possible to cull additional stranded reads that contain non-templated TAGG, AGG, GG, or G sequences at their 5' end. The peak in insert size distribution is between 200-250 nucleotides. K562 cytosol, polyA+ RNA: Oligo-dT selected poly-A+ RNA was RiboMinus-treated according to the manufacturer's protocol (Invitrogen). The RNA was treated with tobacco alkaline pyrophosphatase to eliminate any 5' cap structures and hydrolyzed to ~200 bases via alkaline hydrolysis. The 3' end was repaired using calf intestinal alkaline phosphatase, and poly-A polymerase was used to catalyze the addition of Cs to the 3' end. The 5' end was phosphorylated using T4 PNK, and an RNA linker was ligated onto the 5' end. Reverse transcription was carried out using a poly-G oligo with a defined 5' extension. The inserts were then amplified using oligos targeting the 5' linker and poly-G extension. This cloning protocol generated stranded reads that were read from the 5' ends of the inserts. The library was sequenced on a Solexa platform for a total of 36 cycles; however, the reads underwent post-processing, resulting in trimming of their 3' ends. Consequently, the mapped read lengths are variable. Analysis: K562 and GM12878 total cell, total RNA: Tags were removed from the 5' ends of the reads in accordance to their lengths and strand assignments made. Subsequently, the reads were trimmed from their 3' ends to a final length of 50 nucleotides and were mapped using NexAlign, a program developed by Timo Lassman, RIKEN. We allowed up to 2 mismatches across the entire length and only report reads that mapped to a single/unique locus in the assembled hg18 genome. K562 cytosol, polyA+ RNA: Reads were mapped to the human (hg18, March 2006) assembly using Nexalign, with only uniquely mapping (one loci), exactly matching (no mis-matches) aligned reads reported in the processed files, as follows: 1) Collect the read sequences from Illumina non-filtered output files. 2) Filter out all reads that contain undefined nucleotides ('N'). 3) Perform iterative alignment/C-tail chopping algorithm (below). On each alignment step, the reads are aligned to the genome with 100% identity. All reads that align to a single locus are withdrawn from the alignment pool and only the reads that could not be aligned continue to the next step. a) Align to the hg18 genome using Nexalign 1.3.3 (© Timo Lassmann) without chopping off any nucleotides. b) Chop off any C-blocks (until the first non-C) at the ends of the reads. c) Align to the genome -> remove and save those that align. d) Chop off any non-Cs until the next C. e) Chop off C-block until the next non-C. f) Align to the genome -> remove and save those that align. g) Repeat steps d, e, and f until the reads align to the genome, or chopping results in the reduction of the reads' lengths to below 16 (default), or there are no non-Cs left.

Dataset Information

Small RNA sequencing data of breast cancer cell lines (Illumina iDEA Challenge 2011)

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets