Differential motif enrichment analysis of paired ChIP-seq experiments.
ABSTRACT: Motif enrichment analysis of transcription factor ChIP-seq data can help identify transcription factors that cooperate or compete. Previously, little attention has been given to comparative motif enrichment analysis of pairs of ChIP-seq experiments, where the binding of the same transcription factor is assayed under different conditions. Such comparative analysis could potentially identify the distinct regulatory partners/competitors of the assayed transcription factor under different conditions or at different stages of development.We describe a new methodology for identifying sequence motifs that are differentially enriched in one set of DNA or RNA sequences relative to another set, and apply it to paired ChIP-seq experiments. We show that, using paired ChIP-seq data for a single transcription factor, differential motif enrichment analysis identifies all the known key transcription factors involved in the transformation of non-cancerous immortalized breast cells (MCF10A-ER-Src cells) into cancer stem cells whereas non-differential motif enrichment analysis does not. We also show that differential motif enrichment analysis identifies regulatory motifs that are significantly enriched at constrained locations within the bound promoters, and that these motifs are not identified by non-differential motif enrichment analysis. Our methodology differs from other approaches in that it leverages both comparative enrichment and positional enrichment of motifs in ChIP-seq peak regions or in the promoters of genes bound by the transcription factor.We show that differential motif enrichment analysis of paired ChIP-seq experiments offers biological insights not available from non-differential analysis. In contrast to previous approaches, our method detects motifs that are enriched in a constrained region in one set of sequences, but not enriched in the same region in the comparative set. We have enhanced the web-based CentriMo algorithm to allow it to perform the constrained differential motif enrichment analysis described in this paper, and CentriMo's on-line interface (http://meme.ebi.edu.au) provides dozens of databases of DNA- and RNA-binding motifs from a full range of organisms. All data and output files presented here are available at http://research.imb.uq.edu.au/t.bailey/supplementary\_data/Lesluyes2014.
Project description:BACKGROUND:Motif enrichment analysis (MEA) identifies over-represented transcription factor binding (TF) motifs in the DNA sequence of regulatory regions, enabling researchers to infer which transcription factors can regulate transcriptional response to a stimulus, or identify sequence features found near a target protein in a ChIP-seq experiment. Score-based MEA determines motifs enriched in regions exhibiting extreme differences in regulatory activity, but existing methods do not control for biases in GC content or dinucleotide composition. This lack of control for sequence bias, such as those often found in CpG islands, can obscure the enrichment of biologically relevant motifs. RESULTS:We developed Motif Enrichment In Ranked Lists of Peaks (MEIRLOP), a novel MEA method that determines enrichment of TF binding motifs in a list of scored regulatory regions, while controlling for sequence bias. In this study, we compare MEIRLOP against other MEA methods in identifying binding motifs found enriched in differentially active regulatory regions after interferon-beta stimulus, finding that using logistic regression and covariates improves the ability to call enrichment of ISGF3 binding motifs from differential acetylation ChIP-seq data compared to other methods. Our method achieves similar or better performance compared to other methods when quantifying the enrichment of TF binding motifs from ENCODE TF ChIP-seq datasets. We also demonstrate how MEIRLOP is broadly applicable to the analysis of numerous types of NGS assays and experimental designs. CONCLUSIONS:Our results demonstrate the importance of controlling for sequence bias when accurately identifying enriched DNA sequence motifs using score-based MEA. MEIRLOP is available for download from https://github.com/npdeloss/meirlop under the MIT license.
Project description:Advances in high-throughput sequencing have resulted in rapid growth in large, high-quality datasets including those arising from transcription factor (TF) ChIP-seq experiments. While there are many existing tools for discovering TF binding site motifs in such datasets, most web-based tools cannot directly process such large datasets.The MEME-ChIP web service is designed to analyze ChIP-seq 'peak regions'--short genomic regions surrounding declared ChIP-seq 'peaks'. Given a set of genomic regions, it performs (i) ab initio motif discovery, (ii) motif enrichment analysis, (iii) motif visualization, (iv) binding affinity analysis and (v) motif identification. It runs two complementary motif discovery algorithms on the input data--MEME and DREME--and uses the motifs they discover in subsequent visualization, binding affinity and identification steps. MEME-ChIP also performs motif enrichment analysis using the AME algorithm, which can detect very low levels of enrichment of binding sites for TFs with known DNA-binding motifs. Importantly, unlike with the MEME web service, there is no restriction on the size or number of uploaded sequences, allowing very large ChIP-seq datasets to be analyzed. The analyses performed by MEME-ChIP provide the user with a varied view of the binding and regulatory activity of the ChIP-ed TF, as well as the possible involvement of other DNA-binding TFs.MEME-ChIP is available as part of the MEME Suite at http://meme.nbcr.net.
Project description:Deciphering the interplay between chromatin accessibility and transcription factor (TF) binding is fundamental to understanding transcriptional regulation, control of cellular states, and the establishment of new phenotypes. Recent genome-wide chromatin accessibility profiling studies have provided catalogs of putative open regions, where TFs can recognize their motifs and regulate gene expression programs. Here, we present motif enrichment in differential elements of accessibility (MEDEA), a computational tool that analyzes high-throughput chromatin accessibility genomic data to identify cell-type-specific accessible regions and lineage-specific motifs associated with TF binding therein. To benchmark MEDEA, we used a panel of reference cell lines profiled by ENCODE and curated by the ENCODE Project Consortium for the ENCODE-DREAM Challenge. By comparing results with RNA-seq data, ChIP-seq peaks, and DNase-seq footprints, we show that MEDEA improves the detection of motifs associated with known lineage specifiers. We then applied MEDEA to 610 ENCODE DNase-seq data sets, where it revealed significant motifs even when absolute enrichment was low and where it identified novel regulators, such as NRF1 in kidney development. Finally, we show that MEDEA performs well on both bulk and single-cell ATAC-seq data. MEDEA is publicly available as part of our Glossary-GENRE suite for motif enrichment analysis.
Project description:Recent advances in technology have led to a dramatic increase in the number of available transcription factor ChIP-seq and ChIP-chip data sets. Understanding the motif content of these data sets is an important step in understanding the underlying mechanisms of regulation. Here we provide a systematic motif analysis for 427 human ChIP-seq data sets using motifs curated from the literature and also discovered de novo using five established motif discovery tools. We use a systematic pipeline for calculating motif enrichment in each data set, providing a principled way for choosing between motif variants found in the literature and for flagging potentially problematic data sets. Our analysis confirms the known specificity of 41 of the 56 analyzed factor groups and reveals motifs of potential cofactors. We also use cell type-specific binding to find factors active in specific conditions. The resource we provide is accessible both for browsing a small number of factors and for performing large-scale systematic analyses. We provide motif matrices, instances and enrichments in each of the ENCODE data sets. The motifs discovered here have been used in parallel studies to validate the specificity of antibodies, understand cooperativity between data sets and measure the variation of motif binding across individuals and species.
Project description:Cis-regulatory modules (CRMs) function by binding sequence specific transcription factors, but the relationship between in vivo physical binding and the regulatory capacity of factor-bound DNA elements remains uncertain. We investigate this relationship for the well-studied Twist factor in Drosophila melanogaster embryos by analyzing genome-wide factor occupancy and testing the functional significance of Twist occupied regions and motifs within regions. Twist ChIP-seq data efficiently identified previously studied Twist-dependent CRMs and robustly predicted new CRM activity in transgenesis, with newly identified Twist-occupied regions supporting diverse spatiotemporal patterns (>74% positive, n = 31). Some, but not all, candidate CRMs require Twist for proper expression in the embryo. The Twist motifs most favored in genome ChIP data (in vivo) differed from those most favored by Systematic Evolution of Ligands by EXponential enrichment (SELEX) (in vitro). Furthermore, the majority of ChIP-seq signals could be parsimoniously explained by a CABVTG motif located within 50 bp of the ChIP summit and, of these, CACATG was most prevalent. Mutagenesis experiments demonstrated that different Twist E-box motif types are not fully interchangeable, suggesting that the ChIP-derived consensus (CABVTG) includes sites having distinct regulatory outputs. Further analysis of position, frequency of occurrence, and sequence conservation revealed significant enrichment and conservation of CABVTG E-box motifs near Twist ChIP-seq signal summits, preferential conservation of ±150 bp surrounding Twist occupied summits, and enrichment of GA- and CA-repeat sequences near Twist occupied summits. Our results show that high resolution in vivo occupancy data can be used to drive efficient discovery and dissection of global and local cis-regulatory logic.
Project description:An essential component of genome function is the syntax of genomic regulatory elements that determine how diverse transcription factors interact to orchestrate a program of regulatory control. A precise characterization of in vivo spacing constraints between key transcription factors would reveal key aspects of this genomic regulatory language. To discover novel transcription factor spatial binding constraints in vivo, we developed a new integrative computational method, genome wide event finding and motif discovery (GEM). GEM resolves ChIP data into explanatory motifs and binding events at high spatial resolution by linking binding event discovery and motif discovery with positional priors in the context of a generative probabilistic model of ChIP data and genome sequence. GEM analysis of 63 transcription factors in 214 ENCODE human ChIP-Seq experiments recovers more known factor motifs than other contemporary methods, and discovers six new motifs for factors with unknown binding specificity. GEM's adaptive learning of binding-event read distributions allows it to further improve upon previous methods for processing ChIP-Seq and ChIP-exo data to yield unsurpassed spatial resolution and discovery of closely spaced binding events of the same factor. In a systematic analysis of in vivo sequence-specific transcription factor binding using GEM, we have found hundreds of spatial binding constraints between factors. GEM found 37 examples of factor binding constraints in mouse ES cells, including strong distance-specific constraints between Klf4 and other key regulatory factors. In human ENCODE data, GEM found 390 examples of spatially constrained pair-wise binding, including such novel pairs as c-Fos:c-Jun/USF1, CTCF/Egr1, and HNF4A/FOXA1. The discovery of new factor-factor spatial constraints in ChIP data is significant because it proposes testable models for regulatory factor interactions that will help elucidate genome function and the implementation of combinatorial control.
Project description:Genome-wide binding data from transcription factor ChIP-seq experiments is the best source of information for inferring the relative DNA-binding affinity of these proteins in vivo. However, standard motif enrichment analysis and motif discovery approaches sometimes fail to correctly identify the binding motif for the ChIP-ed factor. To overcome this problem, we propose 'central motif enrichment analysis' (CMEA), which is based on the observation that the positional distribution of binding sites matching the direct-binding motif tends to be unimodal, well centered and maximal in the precise center of the ChIP-seq peak regions. We describe a novel visualization and statistical analysis tool--CentriMo--that identifies the region of maximum central enrichment in a set of ChIP-seq peak regions and displays the positional distributions of predicted sites. Using CentriMo for motif enrichment analysis, we provide evidence that one transcription factor (Nanog) has different binding affinity in vivo than in vitro, that another binds DNA cooperatively (E2f1), and confirm the in vivo affinity of NFIC, rescuing a difficult ChIP-seq data set. In another data set, CentriMo strongly suggests that there is no evidence of direct DNA binding by the ChIP-ed factor (Smad1). CentriMo is now part of the MEME Suite software package available at http://meme.nbcr.net. All data and output files presented here are available at: http://research.imb.uq.edu.au/t.bailey/sd/Bailey2011a.
Project description:Recognition of composite elements consisting of two transcription factor binding sites gets behind the studies of tissue-, stage- and condition-specific transcription. Genome-wide data on transcription factor binding generated with ChIP-seq method facilitate an identification of composite elements, but the existing bioinformatics tools either require ChIP-seq datasets for both partner transcription factors, or omit composite elements with motifs overlapping. Here we present an universal Motifs Co-Occurrence Tool (MCOT) that retrieves maximum information about overrepresented composite elements from a single ChIP-seq dataset. This includes homo- and heterotypic composite elements of four mutual orientations of motifs, separated with a spacer or overlapping, even if recognition of motifs within composite element requires various stringencies. Analysis of 52 ChIP-seq datasets for 18 human transcription factors confirmed that for over 60% of analyzed datasets and transcription factors predicted co-occurrence of motifs implied experimentally proven protein-protein interaction of respecting transcription factors. Analysis of 164 ChIP-seq datasets for 57 mammalian transcription factors showed that abundance of predicted composite elements with an overlap of motifs compared to those with a spacer more than doubled; and they had 1.5-fold increase of asymmetrical pairs of motifs with one more conservative 'leading' motif and another one 'guided'.
Project description:Although ChIP-seq has become a routine experimental approach for quantitatively characterizing the genome-wide binding of transcription factors (TFs), computational analysis procedures remain far from standardized, making it difficult to compare ChIP-seq results across experiments. In addition, although genome-wide binding patterns must ultimately be determined by local constellations of DNA-binding sites, current analysis is typically limited to identifying enriched motifs in ChIP-seq peaks. Here we present Crunch, a completely automated computational method that performs all ChIP-seq analysis from quality control through read mapping and peak detecting and that integrates comprehensive modeling of the ChIP signal in terms of known and novel binding motifs, quantifying the contribution of each motif and annotating which combinations of motifs explain each binding peak. By applying Crunch to 128 data sets from the ENCODE Project, we show that Crunch outperforms current peak finders and find that TFs naturally separate into "solitary TFs," for which a single motif explains the ChIP-peaks, and "cobinding TFs," for which multiple motifs co-occur within peaks. Moreover, for most data sets, the motifs that Crunch identified de novo outperform known motifs, and both the set of cobinding motifs and the top motif of solitary TFs are consistent across experiments and cell lines. Crunch is implemented as a web server, enabling standardized analysis of any collection of ChIP-seq data sets by simply uploading raw sequencing data. Results are provided both in a graphical web interface and as downloadable files.
Project description:Combinatorial transcription factor (TF) binding is essential for cell-type-specific gene regulation. However, much remains to be learned about the mechanisms of TF interactions, including to what extent constrained spacing and orientation of interacting TFs are critical for regulatory element activity. To examine the relative prevalence of the 'enhanceosome' versus the 'TF collective' model of combinatorial TF binding, a comprehensive analysis of TF binding site sequences in large scale datasets is necessary. We developed a motif-pair discovery pipeline to identify motif co-occurrences with preferential distance(s) between motifs in TF-bound regions. Utilizing a compendium of 289 mouse haematopoietic TF ChIP-seq datasets, we demonstrate that haematopoietic-related motif-pairs commonly occur with highly conserved constrained spacing and orientation between motifs. Furthermore, motif clustering revealed specific associations for both heterotypic and homotypic motif-pairs with particular haematopoietic cell types. We also showed that disrupting the spacing between motif-pairs significantly affects transcriptional activity in a well-known motif-pair-E-box and GATA, and in two previously unknown motif-pairs with constrained spacing-Ets and Homeobox as well as Ets and E-box. In this study, we provide evidence for widespread sequence-specific TF pair interaction with DNA that conforms to the 'enhanceosome' model, and furthermore identify associations between specific haematopoietic cell-types and motif-pairs.