ABSTRACT: MOTIVATION:Increasing evidence has shown that nucleotide modifications such as methylation and hydroxymethylation on cytosine would greatly impact the binding of transcription factors (TFs). However, there is a lack of motif finding algorithms with the function to search for motifs with modified bases. In this study, we expand on our previous motif finding pipeline Epigram to provide systematic de novo motif discovery and performance evaluation on methylated DNA motifs. RESULTS:mEpigram outperforms both MEME and DREME on finding modified motifs in simulated data that mimics various motif enrichment scenarios. Furthermore we were able to identify methylated motifs in Arabidopsis DNA affinity purification sequencing (DAP-seq) data that were previously demonstrated to contain such motifs. When applied to TF ChIP-seq and DNA methylome data in H1 and GM12878, our method successfully identified novel methylated motifs that can be recognized by the TFs or their co-factors. We also observed spacing constraint between the canonical motif of the TF and the newly discovered methylated motifs, which suggests operative recognition of these cis-elements by collaborative proteins. AVAILABILITY AND IMPLEMENTATION:The mEpigram program is available at http://wanglab.ucsd.edu/star/mEpigram. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:Advances in high-throughput sequencing have resulted in rapid growth in large, high-quality datasets including those arising from transcription factor (TF) ChIP-seq experiments. While there are many existing tools for discovering TF binding site motifs in such datasets, most web-based tools cannot directly process such large datasets.The MEME-ChIP web service is designed to analyze ChIP-seq 'peak regions'--short genomic regions surrounding declared ChIP-seq 'peaks'. Given a set of genomic regions, it performs (i) ab initio motif discovery, (ii) motif enrichment analysis, (iii) motif visualization, (iv) binding affinity analysis and (v) motif identification. It runs two complementary motif discovery algorithms on the input data--MEME and DREME--and uses the motifs they discover in subsequent visualization, binding affinity and identification steps. MEME-ChIP also performs motif enrichment analysis using the AME algorithm, which can detect very low levels of enrichment of binding sites for TFs with known DNA-binding motifs. Importantly, unlike with the MEME web service, there is no restriction on the size or number of uploaded sequences, allowing very large ChIP-seq datasets to be analyzed. The analyses performed by MEME-ChIP provide the user with a varied view of the binding and regulatory activity of the ChIP-ed TF, as well as the possible involvement of other DNA-binding TFs.MEME-ChIP is available as part of the MEME Suite at http://meme.nbcr.net.
Project description:High-throughput ChIP-seq studies typically identify thousands of peaks for a single transcription factor (TF). It is common for traditional motif discovery tools to predict motifs that are statistically significant against a naïve background distribution but are of questionable biological relevance.We describe a simple yet effective algorithm for discovering differential motifs between two sequence datasets that is effective in eliminating systematic biases and scalable to large datasets. Tested on 207 ENCODE ChIP-seq datasets, our method identifies correct motifs in 78% of the datasets with known motifs, demonstrating improvement in both accuracy and efficiency compared with DREME, another state-of-art discriminative motif discovery tool. More interestingly, on the remaining more challenging datasets, we identify common technical or biological factors that compromise the motif search results and use advanced features of our tool to control for these factors. We also present case studies demonstrating the ability of our method to detect single base pair differences in DNA specificity of two similar TFs. Lastly, we demonstrate discovery of key TF motifs involved in tissue specification by examination of high-throughput DNase accessibility data.The motifRG package is publically available via the bioconductor firstname.lastname@example.orgSupplementary data are available at Bioinformatics online.
Project description:MOTIVATION:The availability of numerous ChIP-seq datasets for transcription factors (TF) has provided an unprecedented opportunity to identify all TF binding sites in genomes. However, the progress has been hindered by the lack of a highly efficient and accurate tool to find not only the target motifs, but also cooperative motifs in very big datasets. RESULTS:We herein present an ultrafast and accurate motif-finding algorithm, ProSampler, based on a novel numeration method and Gibbs sampler. ProSampler runs orders of magnitude faster than the fastest existing tools while often more accurately identifying motifs of both the target TFs and cooperators. Thus, ProSampler can greatly facilitate the efforts to identify the entire cis-regulatory code in genomes. AVAILABILITY AND IMPLEMENTATION:Source code and binaries are freely available for download at https://github.com/zhengchangsulab/prosampler. It was implemented in C++ and supported on Linux, macOS and MS Windows platforms. SUPPLEMENTARY INFORMATION:Supplementary materials are available at Bioinformatics online.
Project description:Measuring motif similarity is essential for identifying functionally related transcription factors (TFs) and RNA-binding proteins, and for annotating de novo motifs. Here, we describe Motif Similarity Based on Affinity of Targets (MoSBAT), an approach for measuring the similarity of motifs by computing their affinity profiles across a large number of random sequences. We show that MoSBAT successfully associates de novo ChIP-seq motifs with their respective TFs, accurately identifies motifs that are obtained from the same TF in different in vitro assays, and quantitatively reflects the similarity of in vitro binding preferences for pairs of TFs.MoSBAT is available as a webserver at mosbat.ccbr.utoronto.ca, and for download at github.com/csglab/MoSBAT.email@example.com or firstname.lastname@example.orgSupplementary information: Supplementary data are available at Bioinformatics online.
Project description:Current methods for motif discovery from chromatin immunoprecipitation followed by sequencing (ChIP-seq) data often identify non-targeted transcription factor (TF) motifs, and are even further limited when peak sequences are similar due to common ancestry rather than common binding factors. The latter aspect particularly affects a large number of proteins from the Cys2His2 zinc finger (C2H2-ZF) class of TFs, as their binding sites are often dominated by endogenous retroelements that have highly similar sequences. Here, we present recognition code-assisted discovery of regulatory elements (RCADE) for motif discovery from C2H2-ZF ChIP-seq data. RCADE combines predictions from a DNA recognition code of C2H2-ZFs with ChIP-seq data to identify models that represent the genuine DNA binding preferences of C2H2-ZF proteins. We show that RCADE is able to identify generalizable binding models even from peaks that are exclusively located within the repeat regions of the genome, where state-of-the-art motif finding approaches largely fail.RCADE is available as a webserver and also for download at http://rcade.ccbr.utoronto.ca/.Supplementary data are available at Bioinformatics email@example.com.
Project description:(1) Background: Transcription factors (TFs) are main regulators of eukaryotic gene expression. The cooperative binding to genomic DNA of at least two TFs is the widespread mechanism of transcription regulation. Cooperating TFs can be revealed through the analysis of co-occurrence of their motifs. (2) Methods: We applied the motifs co-occurrence tool (MCOT) that predicted pairs of spaced or overlapped motifs (composite elements, CEs) for a single ChIP-seq dataset. We improved MCOT capability for the prediction of asymmetric CEs with one of the participating motifs possessing higher conservation than another does. (3) Results: Analysis of 119 ChIP-seq datasets for 45 human TFs revealed that almost for all families of TFs the co-occurrence with an overlap between motifs of target TFs and more conserved partner motifs was significantly higher than that for less conserved partner motifs. The asymmetry toward partner TFs was the most clear for partner motifs of TFs from the ETS (E26 Transformation Specific) family. (4) Conclusion: Co-occurrence with an overlap of less conserved motif of a target TF and more conserved motifs of partner TFs explained a substantial portion of ChIP-seq data lacking conserved motifs of target TFs. Among other TF families, conservative motifs of TFs from ETS family were the most prone to mediate interaction of target TFs with its weak motifs in ChIP-seq.
Project description:Chromatin endogenous cleavage (ChEC) uses fusion of a protein of interest to micrococcal nuclease (MNase) to target calcium-dependent cleavage to specific genomic loci in vivo. Here we report the combination of ChEC with high-throughput sequencing (ChEC-seq) to map budding yeast transcription factor (TF) binding. Temporal analysis of ChEC-seq data reveals two classes of sites for TFs, one displaying rapid cleavage at sites with robust consensus motifs and the second showing slow cleavage at largely unique sites with low-scoring motifs. Sites with high-scoring motifs also display asymmetric cleavage, indicating that ChEC-seq provides information on the directionality of TF-DNA interactions. Strikingly, similar DNA shape patterns are observed regardless of motif strength, indicating that the kinetics of ChEC-seq discriminates DNA recognition through sequence and/or shape. We propose that time-resolved ChEC-seq detects both high-affinity interactions of TFs with consensus motifs and sites preferentially sampled by TFs during diffusion and sliding.
Project description:A key step in the regulation of gene expression is the sequence-specific binding of transcription factors (TFs) to their DNA recognition sites. However, elucidating TF binding site (TFBS) motifs in higher eukaryotes has been challenging, even when employing cross-species sequence conservation. We hypothesized that for human and mouse, many orthologous genes expressed in a similarly tissue-specific manner in both human and mouse gene expression data, are likely to be co-regulated by orthologous TFs that bind to DNA sequence motifs present within noncoding sequence conserved between these genomes.We performed automated motif searching and merging across four different motif finding algorithms, followed by filtering of the resulting motifs for those that contain blocks of information content. Applying this motif finding strategy to conserved noncoding regions surrounding co-expressed tissue-specific human genes allowed us to discover both previously known, and many novel candidate, regulatory DNA motifs in all 18 tissue-specific expression clusters that we examined. For previously known TFBS motifs, we observed that if a TF was expressed in the specified tissue of interest, then in most cases we identified a motif that matched its TRANSFAC motif; conversely, of all those discovered motifs that matched TRANSFAC motifs, most of the corresponding TF transcripts were expressed in the tissue(s) corresponding to the expression cluster for which the motif was found.Our results indicate that the integration of the results from multiple motif finding tools identifies and ranks highly more known and novel motifs than does the use of just one of these tools. In addition, we believe that our simultaneous enrichment strategies helped to identify likely human cis regulatory elements. A number of the discovered motifs may correspond to novel binding site motifs for as yet uncharacterized tissue-specific TFs. We expect this strategy to be useful for identifying motifs in other metazoan genomes.
Project description:In response to activating signals, transcription factors (TFs) bind DNA and regulate gene expression. TF binding can be measured by protection of the bound sequence from DNase digestion (i.e., footprint). Here, we report that 80% of TF binding motifs do not show a measurable footprint, partly because of a variable cleavage pattern within the motif sequence. To more faithfully portray the effect of TFs on chromatin, we developed an algorithm that captures two TF-dependent effects on chromatin accessibility: footprinting and motif-flanking accessibility. The algorithm, termed bivariate genomic footprinting (BaGFoot), efficiently detects TF activity. BaGFoot is robust to different accessibility assays (DNase-seq, ATAC-seq), all examined peak-calling programs, and a variety of cut bias correction approaches. BaGFoot reliably predicts TF binding and provides valuable information regarding the TFs affecting chromatin accessibility in various biological systems and following various biological events, including in cases where an absolute footprint cannot be determined.
Project description:Chromatin immunoprecipitation coupled with high-throughput sequencing (ChIP-seq) has become the dominant technique for mapping transcription factor (TF) binding regions genome-wide. We performed an integrative analysis centered around 457 ChIP-seq data sets on 119 human TFs generated by the ENCODE Consortium. We identified highly enriched sequence motifs in most data sets, revealing new motifs and validating known ones. The motif sites (TF binding sites) are highly conserved evolutionarily and show distinct footprints upon DNase I digestion. We frequently detected secondary motifs in addition to the canonical motifs of the TFs, indicating tethered binding and cobinding between multiple TFs. We observed significant position and orientation preferences between many cobinding TFs. Genes specifically expressed in a cell line are often associated with a greater occurrence of nearby TF binding in that cell line. We observed cell-line-specific secondary motifs that mediate the binding of the histone deacetylase HDAC2 and the enhancer-binding protein EP300. TF binding sites are located in GC-rich, nucleosome-depleted, and DNase I sensitive regions, flanked by well-positioned nucleosomes, and many of these features show cell type specificity. The GC-richness may be beneficial for regulating TF binding because, when unoccupied by a TF, these regions are occupied by nucleosomes in vivo. We present the results of our analysis in a TF-centric web repository Factorbook (http://factorbook.org) and will continually update this repository as more ENCODE data are generated.