Project description:This track depicts high throughput sequencing of long RNAs (>200 nt) from RNA samples from tissues or subcellular compartments from ENCODE cell lines. The overall goal of the ENCODE project is to identify and characterize all functional elements in the sequence of the human genome. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Cells were grown according to the approved ENCODE cell culture protocols. Sample preparation and sequencing: K562 and GM12878 total cell, total RNA: Standard Illumina Pair-end kit with the sole exception that a "tagged" random hexamer was used to prime the 1st strand synthesis: 5'-ACTGTAGGN6-3'. The addition of this tag is what permits us to make strand assignments for the reads. The sequence of the tag is reported in the 5' end of the read. Asymmetric PCR can place the tag on either the 1st or 2nd read depending on which strand it used as a template. Strand assignments are made by looking for the tag at the 5' end of either read 1 or read 2. Read 1 is physically linked to read 2. Therefore, if a tag is present on one end strand assignments are made for both ends. We noted during analysis that the tags are generally 5' truncated. We only "strand" reads that contain ACTGTAGG, CTGTAGG, TGTAGG, GTAGG. Between 63-68% of reads could be stranded in these libraries. It is possible to cull additional stranded reads that contain non-templated TAGG, AGG, GG, or G sequences at their 5' end. The peak in insert size distribution is between 200-250 nucleotides. K562 cytosol, polyA+ RNA: Oligo-dT selected poly-A+ RNA was RiboMinus-treated according to the manufacturer's protocol (Invitrogen). The RNA was treated with tobacco alkaline pyrophosphatase to eliminate any 5' cap structures and hydrolyzed to ~200 bases via alkaline hydrolysis. The 3' end was repaired using calf intestinal alkaline phosphatase, and poly-A polymerase was used to catalyze the addition of Cs to the 3' end. The 5' end was phosphorylated using T4 PNK, and an RNA linker was ligated onto the 5' end. Reverse transcription was carried out using a poly-G oligo with a defined 5' extension. The inserts were then amplified using oligos targeting the 5' linker and poly-G extension. This cloning protocol generated stranded reads that were read from the 5' ends of the inserts. The library was sequenced on a Solexa platform for a total of 36 cycles; however, the reads underwent post-processing, resulting in trimming of their 3' ends. Consequently, the mapped read lengths are variable. Analysis: K562 and GM12878 total cell, total RNA: Tags were removed from the 5' ends of the reads in accordance to their lengths and strand assignments made. Subsequently, the reads were trimmed from their 3' ends to a final length of 50 nucleotides and were mapped using NexAlign, a program developed by Timo Lassman, RIKEN. We allowed up to 2 mismatches across the entire length and only report reads that mapped to a single/unique locus in the assembled hg18 genome. K562 cytosol, polyA+ RNA: Reads were mapped to the human (hg18, March 2006) assembly using Nexalign, with only uniquely mapping (one loci), exactly matching (no mis-matches) aligned reads reported in the processed files, as follows: 1) Collect the read sequences from Illumina non-filtered output files. 2) Filter out all reads that contain undefined nucleotides ('N'). 3) Perform iterative alignment/C-tail chopping algorithm (below). On each alignment step, the reads are aligned to the genome with 100% identity. All reads that align to a single locus are withdrawn from the alignment pool and only the reads that could not be aligned continue to the next step. a) Align to the hg18 genome using Nexalign 1.3.3 (© Timo Lassmann) without chopping off any nucleotides. b) Chop off any C-blocks (until the first non-C) at the ends of the reads. c) Align to the genome -> remove and save those that align. d) Chop off any non-Cs until the next C. e) Chop off C-block until the next non-C. f) Align to the genome -> remove and save those that align. g) Repeat steps d, e, and f until the reads align to the genome, or chopping results in the reduction of the reads' lengths to below 16 (default), or there are no non-Cs left.

Project description:This data was produced by the Wold lab at Caltech as part of the ENCODE Project. RNA-Seq is a method for mapping and quantifying the transcriptome of any organism that has a genomic DNA sequence assembly. RNA-Seq is performed by reverse-transcribing an RNA sample into cDNA, followed by high throughput DNA sequencing. The resulting sequence reads are then informatically mapped onto the genome sequence. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf The transcriptome measurements shown on these tracks were performed on polyA selected RNA from total cellular RNA. Data have been produced in two formats: single reads, each of which comes from one end of a randomly primed cDNA molecule; and paired-end reads, which are obtained as pairs from both ends cDNAs resulting from random priming. The resulting sequence reads are then informatically mapped onto the genome sequence (Alignments). Those that don't map to the genome are mapped to known RNA splice junctions (Splice Sites). These mapped reads are then counted to determine their frequency of occurrence at known gene models. Sequence reads that cluster at genome locations that lack an existing transcript model are also identified informatically and they are quantified. RNA-Seq is especially suited for giving information about RNA splicing patterns and for determining unequivocally the presence or absence of lower abundance class RNAs. As performed here, internal RNA standards are used to assist in quantification and to provide internal process controls. This RNA-Seq protocol does not specify the coding strand. As a result, there will be ambiguity at loci where both strands are transcribed. The "randomly primed" reverse transcription is, apparently, not fully random. This is inferred from a sequence bias in the first residues of the read population, and this likely contributes to observed unevenness in sequence coverage across transcripts. These tracks show 1x32 n.t. or 2x75 n.t. sequence reads of cDNA obtained from biological replicate samples (different culture plates) of the ENCODE cell lines. The 32 n.t. sequences were aligned to the human genome (hg18) and UCSC known-gene splice junctions using different sequence alignment programs. The 2x75 n.t. reads were mapped serially, first with the Bowtie program (Langmead et al., 2009) against the genome and UCSC known-gene splice junctions (Splice Sites). Bowtie-unmapped reads were then mapped using BLAT to find evidence of novel splicing, by requiring at least 10 bp on the short-side of the splice.

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Yijun Ruan mailto:ruanyj@gis.a-star.edu.sg). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track is produced as part of the ENCODE Transcriptome Project. It shows the starts and ends of full length mRNA transcripts determined by GIS paired-end ditag (PET) sequencing using RNA extracts (http://hgwdev.cse.ucsc.edu/cgi-bin/hgEncodeVocab?type=rnaExtract) from different sub-cellular localizations (http://hgwdev.cse.ucsc.edu/cgi-bin/hgEncodeVocab?type=localization) in different cell lines (http://hgwdev.cse.ucsc.edu/cgi-bin/hgEncodeVocab?type=cellType). The RNA-PET information provided in this track is composed of two different PET length versions based on how the PETs were extracted. The cloning-based PET (18 bp and 16 bp) is an earlier version and detailed information can be found from reference (Ng et al. 2006). The cloning-free PET (25 bp and 25 bp) is a recently modified version which uses Type II enzyme EcoP15I to generate a longer length of PET (unpublished), which results in a significant enhancement in both library construction and mapping efficiency. Both versions of PET templates were sequenced by Illumina platform at 2 x 36 bp Paired End sequencing. See the Methods and References sections below for more details. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Cells were grown according to the approved ENCODE cell culture protocols (http://hgwdev.cse.ucsc.edu/ENCODE/protocols/cell). Two different GIS RNA-PET protocols were used to generate the full length transcriptome PETs: one is based on a cloning-free RNA-PET library construction and sequencing strategy (unpublished), and the other is a cloning-based library construction (Ng et al. 2005) and recent Illumina paired end sequencing. Cloning-free RNA-PET (50 bp reads, 25 bp and 25 bp tag for each of the 5' and 3' ends)--Method: The cloning-free RNA-PET libraries were generated from polyA mRNA samples and constructed using a recently modified GIS protocol (unpublished). Total RNA in good quality was used as starting material and purified through MACs polyT column to obtain full length polyA mRNAs. Approximately 5 micrograms of enriched polyA mRNA were used for reverse transcription to convert polyA mRNA to full length cDNA. The obtained full length cDNA was modified and ligated with specific linker sequences, followed by circularization through ligation to generate circular cDNA molecules. The 25 bp tag from each end of the full length cDNA was extracted by type II enzyme EcoP15I digestion. The resulting PETs were ligated with sequencing adaptors at the both ends, amplified by PCR, and further purified as complex templates for paired end (PE) sequencing using Illumina platforms. Data: The sequenced RNA-PETs are unified in 25/25 bp length from each end of a cDNA. After filtering out redundant and noise tags, the unique PETs will proceed to analysis pipeline. Initially, the orientation of each tag will be screened out by the barcode built in the sequencing-template, then paired into a given orientation-PET. The orientation-determined RNA-PET is mapped onto reference genome allowing up to two mismatches. Majority of PETs are mapped on the known transcripts, or splice variants. A small portion of misaligned PETs, defined as discordant PETs, are mapped either too far from each tag, have wrong orientations, or mapped in different chromosomes, indicating exist some transcription variations which could be caused by genome structure variations: such as fusion, deletion, insertion, inversion, tandem repeat and translocation; or RNA trans-splicing etc. Cloning-based RNA-PET (34 bp reads, 18 bp and 16 bp tag for each of the 5' and 3' ends)--Method: The cloning-based RNA-PET (GIS-PET) libraries were generated from polyA RNA samples and constructed using the protocol described by Ng et al. (2005). Total RNA in good quality was used as starting material and further purified through MACs polyT column to enrich polyA mRNA and remove any contaminants (e.g., rRNA, tRNA, DNA, protein etc). Approximately 10 micrograms of polyA mRNA were then used for reverse transcription to convert polyA mRNA into full length cDNA. The obtained full length cDNA was modified with specific linker sequences, then, ligated to a GIS-developed (pGIS4) vector to form a complex full length cDNA library, which was cloned into E. coli. The plasmid DNA was then isolated from the library, followed by MmeI (a type II enzyme) digestion to generate a final length of 18 bp/16 bp ditags from each end of the full length cDNA. The single ditag (or called PET) was then ligated to form a diPET structure (a concatemer with two unrelated PET linked by a linker sequence) to facilitate Illuminaa Paired End sequencing. Data: The cloning-based RNA-PETs are unified in 18 bp and 16 bp length, respectively extracted from 5' and 3' end of each cDNA. The redundant reads were filtered out initially and unique ones were included for analysis. PET sequences were then mapped to (GRCh37, hg19, excluding mitochondirion, haplotypes, randoms and chromosome Y) reference genome using the following specific criteria (Ruan et al. 2007): A minimal continuous 16 bp match must exist for the 5' signature; the 3' signature must have a minimal continuous 14 bp match. Both 5' and 3' signatures must be present on the same chromosome. Their 5' to 3' orientation must be correct (5' signature followed by 3' signature). The maximal genomic span of a PET genomic alignment must be less than one million bp. PETs mapping to 2-10 locations are also included and may represent duplicated genes or pseudogenes in the genome. A majority of PETs mapped on the known transcripts or splice variants. A small portion of misaligned PETs, defined as discordant PETs, were mapped either too far from each other, mapped in the wrong orientation, or mapped to different chromosomes, indicating that some transcription variations exist which could be caused by genome structure variations: such as fusion, deletion, insertion, inversion, tandem repeat and translocation; or RNA trans-splicing etc. Clusters: To cluster the PETs the following procedure was applied: the mapping location of the 5' and 3' tag of a given PET was extended by 100 bp in both directions creating 5' and 3' search windows. If the 5' and 3' tags of a second PET mapped within the 5' and 3' search window of the first PET then the two PETs were clustered and the search windows were adjusted so that they contained the tag extensions of the second PET. PETs which subsequently mapped with their 5' and 3' tags within the adjusted 5' and 3' search window, respectively, were also assigned to this cluster and search window readjusted. This iterative process continued till no new PET was found to fall within the search window, at which stage all the found PETs are classified as belonging to a single cluster. This process is repeated till all PETs are assigned to a cluster. Verification: To assess overall PET quality and mapping specificity, the top ten most abundant PET clusters that mapped to well-characterized known genes were examined. Over 99% of the PETs represented full-length transcripts, and the majority fell within 10 bp of the known 5' and 3' boundaries of these transcripts. The PET mapping was further verified by confirming the existence of physical cDNA clones represented by the ditags. PCR primers were designed based on the PET sequences and amplified the corresponding cDNA inserts either from full length cDNA library (cloning-based PET) or from total RNA isolate (cloning-free PET) for sequencing confirmation.

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Richard Sandstrom mailto:sull@u.washington.edu). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track is produced as part of the ENCODE Project. This track shows DNaseI sensitivity measured genome-wide in different cell lines using the Digital DNaseI methodology (see below), and DNaseI hypersensitive sites. DNaseI has long been used to map general chromatin accessibility and DNaseI hypersensitivity is a universal feature of active cis-regulatory sequences. The use of this method has led to the discovery of functional regulatory elements that include enhancers, insulators, promotors, locus control regions and novel elements. For each experiment (cell type) this track shows DNaseI sensitivity as a continuous function using sequencing tag density (Raw Signal), and discrete loci of DNaseI sensitive zones (HotSpots) and hypersensitive sites (Peaks)." For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Cells were grown according to the approved ENCODE cell culture protocols. Digital DNaseI was performed by DNaseI digestion of intact nuclei, isolating DNaseI 'double-hit' fragments as described in Sabo et al. (2006), and direct sequencing of fragment ends (which correspond to in vivo DNaseI cleavage sites) using the Solexa platform (36 bp reads). Uniquely mapping high-quality reads were mapped to the genome. DNaseI sensitivity is directly reflected in raw tag density (Raw Signal), which is shown in the track as density of tags mapping within a 150 bp sliding window (at a 20 bp step across the genome). DNaseI sensitive zones (HotSpots) were identified using the HotSpot algorithm described in Sabo et al. (2004). 1.0% false discovery rate thresholds (FDR 0.01) were computed for each cell type by applying the HotSpot algorithm to an equivalent number of random uniquely mapping 36mers. DNaseI hypersensitive sites (DHSs or Peaks) were identified as signal peaks within FDR 1.0% hypersensitive zones using a peak-finding algorithm.

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Piero Carninci mailto:carninci@riken.jp). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track shows 5' cap analysis gene expression (CAGE) tags and clusters in RNA extracts (http://hgwdev.cse.ucsc.edu/cgi-bin/hgEncodeVocab?type=rnaExtract) from different sub-cellular localizations (http://hgwdev.cse.ucsc.edu/cgi-bin/hgEncodeVocab?type=localization) in multiple cell lines (http://hgwdev.cse.ucsc.edu/cgi-bin/hgEncodeVocab?type=cellType). A CAGE cluster is a region of overlapping tags with an assigned value that represents the expression level. The data in this track were produced as part of the ENCODE Transcriptome Project. Release 2 has three new downloads only files per experiment (Clusters, TSS Gencode 7 and TSS HMM) and four new cell lines (A459, AG04450, BJ and SK-N-SH_RA). Release 1 on hg19 contained the original data on hg18 (http://hgwdev.cse.ucsc.edu/cgi-bin/hgTrackUi?db=hg18&g=wgEncodeRikenCage) that was remapped and indicated in this release as Generation 0 since that data had no replicates. If there is both old and new generation data available for a particular experiment, only the new generation data is displayed and the older data is available for download. The new data for this track was done with a different process and has standard replicate numbers. The replicate labeling in the genome browser view is a counter indicating the total number of replicates submitted. The producing lab has replicate numbers that correspond to their internal bio-replicate numbering. Where these two numbering systems conflict, both are listed in the long label of the specific track. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Cells were grown according to the approved ENCODE cell culture protocols (http://hgwdev.cse.ucsc.edu/ENCODE/protocols/cell). RNA molecules longer than 200 nt were isolated from each subcellular compartment and then were fractionated into polyA+ and polyA- fractions as described in these protocols (http://hgwdev.cse.ucsc.edu/ENCODE/protocols/general/rnaExtracts.txt). The CAGE tags were sequenced from the 5' ends of cap-trapped cDNAs produced using RIKEN CAGE technology (Kodzius et al. 2006; Valen et al. 2009). To create the tag, a linker was attached to the 5' end of polyA+ or polyA- reverse-transcribed cDNAs which were selected by cap trapping (Carninci et al. 1996). The first 27 bp of the cDNA were cleaved using class II restriction enzymes. A linker was then attached to the 3' end of the cDNA. After PCR amplification, the tags were sequenced (36 bp single reads) using Illumina's Genome analyzer. Tags were mapped to the human genome (hg19) using the program delve (T. Lassmann manuscript in preparation). Delve is a new probabilistic aligner focused on giving the best possible alignment of reads to a genome rather than focusing on speed. It calculates the mapping accuracy (probability of each alignment being true or not) for each alignment. There is no set limit on the number of errors allowed and therefore the mapping rate is commonly 100%. However, for analysis it is recommended to discard alignments with low mapping qualities. Exceptions to the above protocol are the polyA- RNA samples from K562 cytosol, K562 nucleus, and prostate whole cell which were sequenced using ABI SOLiD (http://www.appliedbiosystems.com/absite/us/en/home/applications-technologies/solid-next-generation-sequencing.html) technology. These reads were mapped using Bowtie using default parameters. Clusters were defined as regions of overlapping CAGE reads. The expression level was computed as the number of reads making up the cluster, divided by the total number of reads sequenced, times 1 million.

Project description:RNA-seq is a method for mapping and quantifying the transcriptome of any organism that has a genomic DNA sequence assembly (Mortazavi et al., 2008). RNA-seq is performed by reverse-transcribing an RNA sample into cDNA, followed by high-throughput DNA sequencing, which was done here on the Illumina HiSeq sequencer. The transcriptome measurements shown on these tracks were performed on polyA selected RNA (http://hgwdev.cse.ucsc.edu/cgi-bin/hgEncodeVocab?term=longPolyA&type=rnaExtract) from total cellular RNA (http://hgwdev.cse.ucsc.edu/cgi-bin/hgEncodeVocab?term=cell&type=localization). PolyA-selected RNA was fragmented by magnesium-catalyzed hydrolysis and then converted into cDNA by random priming and amplified. Paired-end 2x100 bp reads were obtained from each end of a cDNA fragment. Reads were aligned to the mm9 human reference genome using TopHat (Trapnell et al., 2009), a program specifically designed to align RNA-seq reads and discover splice junctions de novo. All sequence and alignments files are available at http://hgwdev.cse.ucsc.edu/cgi-bin/hgFileUi?db=mm9&g=wgEncodeCaltechRnaSeq. Cells were grown according to the approved ENCODE cell culture protocols (http://hgwdev.cse.ucsc.edu/ENCODE/protocols/cell/mouse). Cells were lysed in RLT buffer (Qiagen RNEasy kit), and processed on RNEasy midi columns according to the manufacturer's protocol, with the inclusion of the "on-column" DNAse digestion step to remove residual genomic DNA. A quantity of 75 µgs of total RNA was selected twice with oligo-dT beads (Dynal) according to the manufacturer's protocol to isolate mRNA from each of the preparations. A quantity of 100 ngs of mRNA was then processed according to the protocol in Mortazavi et al. (2008), and prepared for sequencing on the Illumina GAIIx or HiSeq platforms according to the protocol for the ChIP-Seq DNA genomic DNA kit (Illumina). Paired-end libraries were size-selected around 200 bp (fragment length). Libraries were sequenced with the Illumina HiSeq according to the manufacturer's recommendations. Paired-end reads of 100 bp length were obtained. Reads were mapped to the reference mouse genome (version mm9 with or without the Y chromosome, depending on the sex of the cell line, and without the random chromosomes in all cases) using TopHat (version 1.3.1) (http://tophat.cbcb.umd.edu/). TopHat was used with default settings with the exception of specifying an empirically determined mean inner-mate distance and supplying known ENSEMBL version 63 splice junctions.

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Yijun Ruan mailto:ruanyj@gis.a-star.edu.sg). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track, produced as part of the ENCODE Project, contains deep sequencing DNase data that will be used to identify sites where regulatory factors bind to the genome (footprints). Footprinting is a technique used to define the DNA sequences that interact with and bind DNA-binding proteins, such as transcription factors, zinc-finger proteins, hormone-receptor complexes, and other chromatin-modulating factors like CTCF. The technique depends upon the strength and tight nature of protein-DNA interactions. In their native chromatin state, DNA sequences that interact directly with DNA-binding proteins are relatively protected from DNA degrading endonucleases, while the exposed/unbound portions are readily degraded by such endonucleases. A massively parallel next-generation sequencing technique to define the DNase hypersensitive sites in the genome was adopted. Sequencing these next-generation-sequencing DNase samples to significantly higher depths of 300-fold or greater produces a base-pair level resolution of the DNase susceptibility maps of the native chromatin state. These base-pair resolution maps represent and are dependent upon the nature and the specificity of interaction of the DNA with the regulatory/modulatory proteins binding at specific loci in the genome; thus they represent the native chromatin state of the genome under investigation. The deep sequencing approach has been used to define the footprint landscape of the genome by identifying DNA motifs that interact with known or novel DNA binding proteins. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Cells were grown according to the approved ENCODE cell culture protocols. Digital DNaseI was performed by DNaseI digestion of intact nuclei, followed by isolating DNaseI 'double-hit' fragments as described in Sabo et al. (2006), and direct sequencing of fragment ends (which correspond to in vivo DNaseI cleavage sites) using the Solexa platform (27 bp reads). High-quality reads were mapped to the GRCh37/hg19 human genome using Bowtie 0.12.5 (Eland was used to map to NCBI36/hg18); only unique mappings were kept. DNaseI sensitivity is directly reflected in raw tag density (Signal), which is shown in the track as density of tags mapping within a 150 bp sliding window (at a 20 bp step across the genome). DNaseI hypersensitive zones (HotSpots) were identified using the HotSpot algorithm described in Sabo et al. (2004). False discovery rate thresholds of 1.0% (FDR 0.01) were computed for each cell type by applying the HotSpot algorithm to an equivalent number of random uniquely mapping 36-mers. DNaseI hypersensitive sites (DHSs or Peaks) were identified as signal peaks within 1.0% (FDR 0.01) hypersensitive zones using a peak-finding algorithm. Only DNase Solexa libraries from unique cell types producing the highest quality data, as defined by Percent Tags in Hotspots (PTIH ~40%) were designated for deep sequencing to a depth of over 200 million tags.

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Richard Sandstrom mailto:sull@u.washington.edu). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track is produced as part of the ENCODE Project. This track displays maps of genome-wide binding of the CTCF transcription factor in different cell lines using ChIP-seq high-throughput sequencing For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Cells were grown according to the approved ENCODE cell culture protocols. Cells were crosslinked with 1% formaldehyde, and the reaction was quenched by the addition of glycine. Fixed cells were rinsed with PBS, lysed in nuclei lysis buffer, and the chromatin was sheared to 200-500 bp fragments using Fisher Dismembrator (model 500). Sheared chromatin fragments were immunoprecipitated with specific polyclonal antibodies at 4 degrees C with gentle rotation. Antibody-chromatin complexes were washed and eluted. The cross linking in immunoprecipitated DNA was reversed and treated with RNase-A. Following proteinase K treatment, the DNA fragments were purified by phenol-chloroform-isoamyl alcohol extraction and ethanol precipitation. 20-50 ng of ChIP DNA was end-repaired, adenine ligated to Illumina adapters was added, and then a Solexa library was made for sequencing. ChIP-seq affinity is directly reflected in raw tag density (Raw Signal), which is shown in the track as density of tags mapping within a 150 bp sliding window (at a 20 bp step across the genome). ChIP-seq affinity zones (HotSpots) were identified using the HotSpot algorithm described in Sabo et al. (2004). 1.0% false discovery rate thresholds (FDR 0.01) were computed for each cell type by applying the HotSpot algorithm to an equivalent number of random uniquely mapping 36mers. ChIP-seq affinity (Peaks) were identified as signal peaks within FDR 1.0% hypersensitive zones using a peak-finding algorithm.

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Florencia Pauli mailto:fpauli@hudsonalpha.org). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track is produced as part of the ENCODE project. The track reports the percentage of DNA molecules that exhibit cytosine methylation at specific CpG dinucleotides. In general, DNA methylation within a gene's promoter is associated with gene silencing, and DNA methylation within the exons and introns of a gene is associated with gene expression. Proper regulation of DNA methylation is essential during development and aberrant DNA methylation is a hallmark of cancer. DNA methylation status is assayed at more than 500,000 CpG dinucleotides in the genome using Reduced Representation Bisulfite Sequencing (RRBS). Genomic DNA is digested with the methyl-insensitive restriction enzyme MspI, small genomic DNA fragments are purified by gel electrophoresis, and then used to construct an Illumina sequencing library. The library fragments are treated with sodium bisulfite and amplified by PCR to convert every unmethylated cytosine to a thymidine while leaving methylated cytosines intact. The sequenced fragments are aligned to a customized reference genome sequence and for each assayed CpG we report the number of sequencing reads covering that CpG and the percentage of those reads that are methylated. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf DNA methylation at CpG sites was assayed with a modified version of Reduced Representation Bisulfite Sequencing (RRBS; Meissner et al., 2008). RRBS was performed on cell lines grown by many ENCODE production groups. The production group that grew the cells and isolated genomic DNA is indicated in the "obtainedBy" field of the metadata. When a cell type was provided by more than one lab, the data for the cells from only one lab are displayed in the table above. However, the data for every cell type from every lab is available from the Downloads page. RRBS was carried out by the Myers production group at the HudsonAlpha Institute for Biotechnology. Isolation of genomic DNA Genomic DNA is isolated from biological replicates of each cell line using the QIAGEN DNeasy Blood & Tissue Kit according to the instructions provided by the manufacturer. DNA concentrations for each genomic DNA preparation are determined using fluorescent DNA binding dye and a fluorometer (Invitrogen Quant-iT dsDNA High Sensitivity Kit and Qubit Fluorometer). Typically, 1 µg of DNA is used to make an RRBS library; however, we have also had success in making libraries with 200 ng genomic DNA from rare or precious samples. RRBS library construction and sequencing RRBS library construction starts with MspI digestion of genomic DNA , which cuts at every CCGG regardless of methylation status. Klenow exo- DNA Polymerase is then used to fill in the recessed end of the genomic DNA and add an adenosine as a 3prime overhang. Next, a methylated version of the Illumina paired-end adapters is ligated onto the DNA. Adapter ligated genomic DNA fragments between 105 and 185 basepairs are selected using agarose gel electrophoresis and Qiagen Qiaquick Gel Extraction Kit. The selected adapter-ligated fragments are treated with sodium bisulfite using the Zymo Research EZ DNA Methylation Gold Kit, which converts unmethylated cytosines to uracils and leaves methylated cytosines unchanged. Bisulfite treated DNA is amplified in a final PCR reaction which has been optimized to uniformly amplify diverse fragment sizes and sequence contexts in the same reaction. During this final PCR reaction uracils are copied as thymines resulting in a thymine in the PCR products wherever an unmethylated cytosine existed in the genomic DNA. The sample is now ready for sequencing on the Illumina sequencing platform. These libraries were sequenced with an Illumina Genome Analyzer IIx according to the manufacturer's recommendations. Data analysis To analyze the sequence data, a reference genome is created that contains only the 36 base pairs adjacent to every MspI site and every C in those sequences is changed to T. A converted sequence read file is then created by changing each C in the original sequence reads to a T. The converted sequence reads are aligned to the converted reference genome, and only reads that map uniquely to the reference genome are kept. Once reads are aligned the percent methylation is calculated for each CpG using the original sequence reads. The percent methylation and number of reads is reported for each CpG.

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Richard Sandstrom mailto:sull@u.washington.edu). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track, produced as part of the ENCODE Project, contains deep sequencing DNase data that will be used to identify sites where regulatory factors bind to the genome (footprints). Footprinting is a technique used to define the DNA sequences that interact with and bind DNA-binding proteins, such as transcription factors, zinc-finger proteins, hormone-receptor complexes, and other chromatin-modulating factors like CTCF. The technique depends upon the strength and tight nature of protein-DNA interactions. In their native chromatin state, DNA sequences that interact directly with DNA-binding proteins are relatively protected from DNA degrading endonucleases, while the exposed/unbound portions are readily degraded by such endonucleases. A massively parallel next-generation sequencing technique to define the DNase hypersensitive sites in the genome was adopted. Sequencing these next-generation-sequencing DNase samples to significantly higher depths of 300-fold or greater produces a base-pair level resolution of the DNase susceptibility maps of the native chromatin state. These base-pair resolution maps represent and are dependent upon the nature and the specificity of interaction of the DNA with the regulatory/modulatory proteins binding at specific loci in the genome; thus they represent the native chromatin state of the genome under investigation. The deep sequencing approach has been used to define the footprint landscape of the genome by identifying DNA motifs that interact with known or novel DNA binding proteins. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Cells were grown according to the approved ENCODE cell culture protocols. Digital DNaseI was performed by DNaseI digestion of intact nuclei, followed by isolating DNaseI 'double-hit' fragments as described in Sabo et al. (2006), and direct sequencing of fragment ends (which correspond to in vivo DNaseI cleavage sites) using the Solexa platform (27 bp reads). High-quality reads were mapped to the GRCh37/hg19 human genome using Bowtie 0.12.5 (Eland was used to map to NCBI36/hg18); only unique mappings were kept. DNaseI sensitivity is directly reflected in raw tag density (Signal), which is shown in the track as density of tags mapping within a 150 bp sliding window (at a 20 bp step across the genome). DNaseI hypersensitive zones (HotSpots) were identified using the HotSpot algorithm described in Sabo et al. (2004). False discovery rate thresholds of 1.0% (FDR 0.01) were computed for each cell type by applying the HotSpot algorithm to an equivalent number of random uniquely mapping 36-mers. DNaseI hypersensitive sites (DHSs or Peaks) were identified as signal peaks within 1.0% (FDR 0.01) hypersensitive zones using a peak-finding algorithm. Only DNase Solexa libraries from unique cell types producing the highest quality data, as defined by Percent Tags in Hotspots (PTIH ~40%) were designated for deep sequencing to a depth of over 200 million tags.

Dataset Information

ENCODE Genome Institute of Singapore RNA-Seq

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets