BioTile, a Perl based tool for the identification of differentially enriched regions in tiling microarray data.
ABSTRACT: Genome-wide tiling array experiments are increasingly used for the analysis of DNA methylation. Because DNA methylation patterns are tissue and cell type specific, the detection of differentially methylated regions (DMRs) with small effect size is a necessary feature of tiling microarray 'peak' finding algorithms, as cellular heterogeneity within a studied tissue may lead to a dilution of the phenotypically relevant effects. Additionally, the ability to detect short length DMRs is necessary as biologically relevant signal may occur in focused regions throughout the genome.We present a free open-source Perl application, Binding Intensity Only Tile array analysis or "BioTile", for the identification of differentially enriched regions (DERs) in tiling array data. The application of BioTile to non-smoothed data allows for the identification of shorter length and smaller effect-size DERs, while correcting for probe specific variation by inversely weighting on probe variance through a permutation corrected meta-analysis procedure employed at identified regions. BioTile exhibits higher power to identify significant DERs of low effect size and across shorter genomic stretches as compared to other peak finding algorithms, while not sacrificing power to detect longer DERs.BioTile represents an easy to use analysis option applicable to multiple microarray platforms, allowing for its integration into the analysis workflow of array data analysis.
Project description:Previous studies examining the reproductive health of alligators in Florida lakes indicate that a variety of developmental and health impacts can be attributed to a combination of environmental quality and exposures to environmental contaminants. The majority of these environmental contaminants have been shown to disrupt normal endocrine signaling. The potential that these environmental conditions and contaminants may influence epigenetic status and correlate to the health abnormalities was investigated in the current study. The red blood cell (RBC) (erythrocyte) in the alligator is nucleated so was used as an easily purified marker cell to investigate epigenetic programming. RBCs were collected from adult male alligators captured at three sites in Florida, each characterized by varying degrees of contamination. While Lake Woodruff (WO) has remained relatively pristine, Lake Apopka (AP) and Merritt Island (MI) convey exposures to different suites of contaminants. DNA was isolated and methylated DNA immunoprecipitation (MeDIP) was used to isolate methylated DNA that was then analyzed in a competitive hybridization using a genome-wide alligator tiling array for a MeDIP-Chip analysis. Pairwise comparisons of alligators from AP and MI to WO revealed alterations in the DNA methylome. The AP vs. WO comparison identified 85 differential DNA methylation regions (DMRs) with ?3 adjacent oligonucleotide tiling array probes and 15,451 DMRs with a single oligo probe analysis. The MI vs. WO comparison identified 75 DMRs with the ?3 oligo probe and 17,411 DMRs with the single oligo probe analysis. There was negligible overlap between the DMRs identified in AP vs. WO and MI vs. WO comparisons. In both comparisons DMRs were primarily associated with CpG deserts which are regions of low CpG density (1-2CpG/100bp). Although the alligator genome is not fully annotated, gene associations were identified and correlated to major gene class functional categories and pathways of endocrine relevance. Observations demonstrate that environmental quality may be associated with epigenetic programming and health status in the alligator. The epigenetic alterations may provide biomarkers to assess the environmental exposures and health impacts on these populations of alligators.
Project description:Genomic imprinting arises from allele-specific epigenetic modifications that are established during gametogenesis and that are maintained throughout somatic development. These parental-specific modifications include DNA methylation and post-translational modifications to histones, which create allele-specific active and repressive domains at imprinted regions. Through the use of a high-density genomic tiling array, we generated DNA and histone methylation profiles at 11 imprinted gene clusters in the mouse from DNA and from chromatin immunoprecipitated from sperm, heart, and cerebellum. Our analysis revealed that despite high levels of differential DNA methylation at non-CpG islands within these regions, imprinting control regions (ICRs) and secondary differentially methylated regions (DMRs) were identified by an overlapping pattern of H3K4 trimethylation (active chromatin) and H3K9 trimethylation (repressive chromatin) modifications in somatic tissue, and a sperm differentially methylated region (sDMR; sperm not equal somatic tissue). Using these features as a common signature of DMRs, we identified 11 unique regions that mapped to known imprinted genes, to uncharacterized genes, and to intergenic regions flanking known imprinted genes. A common feature among these regions was the presence of a CpG island and an array of tandem repeats. Collectively, this study provides a comprehensive analysis of DNA methylation and histone H3K4me3 and H3K9me3 modifications at imprinted gene clusters, and identifies common epigenetic and genetic features of regions regulating genomic imprinting.
Project description:BACKGROUND: Tiling-arrays are applicable to multiple types of biological research questions. Due to its advantages (high sensitivity, resolution, unbiased), the technology is often employed in genome-wide investigations. A major challenge in the analysis of tiling-array data is to define regions-of-interest, i.e., contiguous probes with increased signal intensity (as a result of hybridization of labeled DNA) in a region. Currently, no standard criteria are available to define these regions-of-interest as there is no single probe intensity cut-off level, different regions-of-interest can contain various numbers of probes, and can vary in genomic width. Furthermore, the chromosomal distance between neighboring probes can vary across the genome among different arrays. RESULTS: We have developed Hypergeometric Analysis of Tiling-arrays (HAT), and first evaluated its performance for tiling-array datasets from a Chromatin Immunoprecipitation study on chip (ChIP-on-chip) for the identification of genome-wide DNA binding profiles of transcription factor Cebpa (used for method comparison). Using this assay, we can refine the detection of regions-of-interest by illustrating that regions detected by HAT are more highly enriched for expected motifs in comparison with an alternative detection method (MAT). Subsequently, data from a retroviral insertional mutagenesis screen were used to examine the performance of HAT among different applications of tiling-array datasets. In both studies, detected regions-of-interest have been validated with (q)PCR. CONCLUSIONS: We demonstrate that HAT has increased specificity for analysis of tiling-array data in comparison with the alternative method, and that it accurately detects regions-of-interest in two different applications of tiling-arrays. HAT has several advantages over previous methods: i) as there is no single cut-off level for probe-intensity, HAT can detect regions-of-interest at various thresholds, ii) it can detect regions-of-interest of any size, iii) it is independent of probe-resolution across the genome, and across tiling-array platforms and iv) it employs a single user defined parameter: the significance level. Regions-of-interest are detected by computing the hypergeometric-probability, while controlling the Family Wise Error. Furthermore, the method does not require experimental replicates, common regions-of-interest are indicated, a sequence-of-interest can be examined for every detected region-of-interest, and flanking genes can be reported.
Project description:The most widely used method for detecting genome-wide protein-DNA interactions is chromatin immunoprecipitation on tiling microarrays, commonly known as ChIP-chip. Here, we conducted the first objective analysis of tiling array platforms, amplification procedures, and signal detection algorithms in a simulated ChIP-chip experiment. Mixtures of human genomic DNA and "spike-ins" comprised of nearly 100 human sequences at various concentrations were hybridized to four tiling array platforms by eight independent groups. Blind to the number of spike-ins, their locations, and the range of concentrations, each group made predictions of the spike-in locations. We found that microarray platform choice is not the primary determinant of overall performance. In fact, variation in performance between labs, protocols, and algorithms within the same array platform was greater than the variation in performance between array platforms. However, each array platform had unique performance characteristics that varied with tiling resolution and the number of replicates, which have implications for cost versus detection power. Long oligonucleotide arrays were slightly more sensitive at detecting very low enrichment. On all platforms, simple sequence repeats and genome redundancy tended to result in false positives. LM-PCR and WGA, the most popular sample amplification techniques, reproduced relative enrichment levels with high fidelity. Performance among signal detection algorithms was heavily dependent on array platform. The spike-in DNA samples and the data presented here provide a stable benchmark against which future ChIP platforms, protocol improvements, and analysis methods can be evaluated.
Project description:Changes in DNA methylation patterns are a common characteristic of cancer cells. Recent studies suggest that DNA methylation affects not only discrete genes, but it can also affect large chromosomal regions, potentially leading to LRES. It is unclear whether such long-range epigenetic events are relatively rare or frequent occurrences in cancer. Here, we use a high-resolution promoter tiling array approach to analyze DNA methylation in breast cancer specimens and normal breast tissue to address this question. We identified 3,506 cancer-specific differentially methylated regions (DMR) in human breast cancer with 2,033 being hypermethylation events and 1,473 hypomethylation events. Most of these DMRs are recurrent in breast cancer; 90% of the identified DMRs occurred in at least 33% of the samples. Interestingly, we found a nonrandom spatial distribution of aberrantly methylated regions across the genome that showed a tendency to concentrate in relatively small genomic regions. Such agglomerates of hypermethylated and hypomethylated DMRs spanned up to several hundred kilobases and were frequently found at gene family clusters. The hypermethylation events usually occurred in the proximity of the transcription start site in CpG island promoters, whereas hypomethylation events were frequently found in regions of segmental duplication. One example of a newly discovered agglomerate of hypermethylated DMRs associated with gene silencing in breast cancer that we examined in greater detail involved the protocadherin gene family clusters on chromosome 5 (PCDHA, PCDHB, and PCDHG). Taken together, our results suggest that agglomerative epigenetic aberrations are frequent events in human breast cancer.
Project description:BACKGROUND: Growing evidence suggests that DNA methylation plays a role in tissue-specific differentiation. Current approaches to methylome analysis using enrichment with the methyl-binding domain protein (MBD) are restricted to large (?1 ?g) DNA samples, limiting the analysis of small tissue samples. Here we present a technique that enables characterization of genome-wide tissue-specific methylation patterns from nanogram quantities of DNA. RESULTS: We have developed a methodology utilizing MBD2b/MBD3L1 enrichment for methylated DNA, kinase pre-treated ligation-mediated PCR amplification (MeKL) and hybridization to the comprehensive high-throughput array for relative methylation (CHARM) customized tiling arrays, which we termed MeKL-chip. Kinase modification in combination with the addition of PEG has increased ligation-mediated PCR amplification over 20-fold, enabling >400-fold amplification of starting DNA. We have shown that MeKL-chip can be applied to as little as 20 ng of DNA, enabling comprehensive analysis of small DNA samples. Applying MeKL-chip to the mouse retina (a limited tissue source) and brain, 2,498 tissue-specific differentially methylated regions (T-DMRs) were characterized. The top five T-DMRs (Rgs20, Hes2, Nfic, Cckbr and Six3os1) were validated by pyrosequencing. CONCLUSIONS: MeKL-chip enables genome-wide methylation analysis of nanogram quantities of DNA with a wide range of observed-to-expected CpG ratios due to the binding properties of the MBD2b/MBD3L1 protein complex. This methodology enabled the first analysis of genome-wide methylation in the mouse retina, characterizing novel T-DMRs.
Project description:Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-seq) or ChIP followed by genome tiling array analysis (ChIP-chip) have become standard technologies for genome-wide identification of DNA-binding protein target sites. A number of algorithms have been developed in parallel that allow identification of binding sites from ChIP-seq or ChIP-chip datasets and subsequent visualization in the University of California Santa Cruz (UCSC) Genome Browser as custom annotation tracks. However, summarizing these tracks can be a daunting task, particularly if there are a large number of binding sites or the binding sites are distributed widely across the genome.We have developed ChIPpeakAnno as a Bioconductor package within the statistical programming environment R to facilitate batch annotation of enriched peaks identified from ChIP-seq, ChIP-chip, cap analysis of gene expression (CAGE) or any experiments resulting in a large number of enriched genomic regions. The binding sites annotated with ChIPpeakAnno can be viewed easily as a table, a pie chart or plotted in histogram form, i.e., the distribution of distances to the nearest genes for each set of peaks. In addition, we have implemented functionalities for determining the significance of overlap between replicates or binding sites among transcription factors within a complex, and for drawing Venn diagrams to visualize the extent of the overlap between replicates. Furthermore, the package includes functionalities to retrieve sequences flanking putative binding sites for PCR amplification, cloning, or motif discovery, and to identify Gene Ontology (GO) terms associated with adjacent genes.ChIPpeakAnno enables batch annotation of the binding sites identified from ChIP-seq, ChIP-chip, CAGE or any technology that results in a large number of enriched genomic regions within the statistical programming environment R. Allowing users to pass their own annotation data such as a different Chromatin immunoprecipitation (ChIP) preparation and a dataset from literature, or existing annotation packages, such as GenomicFeatures and BSgenome, provides flexibility. Tight integration to the biomaRt package enables up-to-date annotation retrieval from the BioMart database.
Project description:Gene expression maps for model organisms, including Arabidopsis thaliana, have typically been created using gene-centric expression arrays. Here, we describe a comprehensive expression atlas, Arabidopsis thaliana Tiling Array Express (At-TAX), which is based on whole-genome tiling arrays. We demonstrate that tiling arrays are accurate tools for gene expression analysis and identified more than 1,000 unannotated transcribed regions. Visualizations of gene expression estimates, transcribed regions, and tiling probe measurements are accessible online at the At-TAX homepage.
Project description:Statistical analysis on tiling array data is extremely challenging due to the astronomically large number of sequence probes, high noise levels of individual probes and limited number of replicates in these data. To overcome these difficulties, we first developed statistical error estimation and weighted ANOVA modeling approaches to high-density tiling array data, especially the former based on an advanced error-pooling method to accurately obtain heterogeneous technical error of small-sample tiling array data. Based on these approaches, we analyzed the high-density tiling array data of the temporal replication patterns during cell-cycle S phase of synchronized HeLa cells on human chromosomes 21 and 22. We found many novel temporal replication patterns, identifying about 26% of over 1 million tiling array sequence probes with significant differential replication during the four 2-h time periods of S phase. Among these differentially replicated probes, 126 941 sequence probes were matched to 417 known genes. The majority of these genes were found to be replicated within one or two consecutive time periods, while the others were replicated at two non-consecutive time periods. Also, coding regions found to be more differentially replicated in particular time periods than noncoding regions in the gene-poor chromosome 21 (25% differentially replicated among genic probes versus 18.6% among intergenic probes), while such a phenomenon was less prominent in gene-rich chromosome 22. A rigorous statistical testing for local proximity of differentially replicated genic and intergenic probes was performed to identify significant stretches of differentially replicated sequence regions. From this analysis, we found that adjacent genes were frequently replicated at different time periods, potentially implying the existence of quite dense replication origins. Evaluating the conditional probability significance of identified gene ontology terms on chromosomes 21 and 22, we detected some over-represented molecular functions and biological processes among these differentially replicated genes, such as the ones relevant to hydrolase, transferase and receptor-binding activities. Some of these results were confirmed showing >70% consistency with cDNA microarray data that were independently generated in parallel with the tiling arrays. Thus, our improved analysis approaches specifically designed for high-density tiling array data enabled us to reliably and sensitively identify many novel temporal replication patterns on human chromosomes.
Project description:The genetic basis of phenotypic variation can be partially explained by the presence of copy-number variations (CNVs). Currently available methods for CNV assessment include high-density single-nucleotide polymorphism (SNP) microarrays that have become an indispensable tool in genome-wide association studies (GWAS). However, insufficient concordance rates between different CNV assessment methods call for cautious interpretation of results from CNV-based genetic association studies. Here we provide a cross-population, microarray-based map of copy-number variant regions (CNVRs) to enable reliable interpretation of CNV association findings. We used the Affymetrix Genome-Wide Human SNP Array 6.0 to scan the genomes of 1167 individuals from two ethnically distinct populations (Europe, N=717; Rwanda, N=450). Three different CNV-finding algorithms were tested and compared for sensitivity, specificity, and feasibility. Two algorithms were subsequently used to construct CNVR maps, which were also validated by processing subsamples with additional microarray platforms (Illumina 1M-Duo BeadChip, Nimblegen 385K aCGH array) and by comparing our data with publicly available information. Both algorithms detected a total of 42669 CNVs, 74% of which clustered in 385 CNVRs of a cross-population map. These CNVRs overlap with 862 annotated genes and account for approximately 3.3% of the haploid human genome.We created comprehensive cross-populational CNVR-maps. They represent an extendable framework that can leverage the detection of common CNVs and additionally assist in interpreting CNV-based association studies.