The value of position-specific priors in motif discovery using MEME.
ABSTRACT: Position-specific priors have been shown to be a flexible and elegant way to extend the power of Gibbs sampler-based motif discovery algorithms. Information of many types-including sequence conservation, nucleosome positioning, and negative examples-can be converted into a prior over the location of motif sites, which then guides the sequence motif discovery algorithm. This approach has been shown to confer many of the benefits of conservation-based and discriminative motif discovery approaches on Gibbs sampler-based motif discovery methods, but has not previously been studied with methods based on expectation maximization (EM).We extend the popular EM-based MEME algorithm to utilize position-specific priors and demonstrate their effectiveness for discovering transcription factor (TF) motifs in yeast and mouse DNA sequences. Utilizing a discriminative, conservation-based prior dramatically improves MEME's ability to discover motifs in 156 yeast TF ChIP-chip datasets, more than doubling the number of datasets where it finds the correct motif. On these datasets, MEME using the prior has a higher success rate than eight other conservation-based motif discovery approaches. We also show that the same type of prior improves the accuracy of motifs discovered by MEME in mouse TF ChIP-seq data, and that the motifs tend to be of slightly higher quality those found by a Gibbs sampling algorithm using the same prior.We conclude that using position-specific priors can substantially increase the power of EM-based motif discovery algorithms such as MEME algorithm.
Project description:Position-specific priors (PSP) have been used with success to boost EM and Gibbs sampler-based motif discovery algorithms. PSP information has been computed from different sources, including orthologous conservation, DNA duplex stability, and nucleosome positioning. The use of prior information has not yet been used in the context of combinatorial algorithms. Moreover, priors have been used only independently, and the gain of combining priors from different sources has not yet been studied.We extend RISOTTO, a combinatorial algorithm for motif discovery, by post-processing its output with a greedy procedure that uses prior information. PSP's from different sources are combined into a scoring criterion that guides the greedy search procedure. The resulting method, called GRISOTTO, was evaluated over 156 yeast TF ChIP-chip sequence-sets commonly used to benchmark prior-based motif discovery algorithms. Results show that GRISOTTO is at least as accurate as other twelve state-of-the-art approaches for the same task, even without combining priors. Furthermore, by considering combined priors, GRISOTTO is considerably more accurate than the state-of-the-art approaches for the same task. We also show that PSP's improve GRISOTTO ability to retrieve motifs from mouse ChiP-seq data, indicating that the proposed algorithm can be applied to data from a different technology and for a higher eukaryote.The conclusions of this work are twofold. First, post-processing the output of combinatorial algorithms by incorporating prior information leads to a very efficient and effective motif discovery method. Second, combining priors from different sources is even more beneficial than considering them separately.
Project description:Advances in high-throughput sequencing have resulted in rapid growth in large, high-quality datasets including those arising from transcription factor (TF) ChIP-seq experiments. While there are many existing tools for discovering TF binding site motifs in such datasets, most web-based tools cannot directly process such large datasets.The MEME-ChIP web service is designed to analyze ChIP-seq 'peak regions'--short genomic regions surrounding declared ChIP-seq 'peaks'. Given a set of genomic regions, it performs (i) ab initio motif discovery, (ii) motif enrichment analysis, (iii) motif visualization, (iv) binding affinity analysis and (v) motif identification. It runs two complementary motif discovery algorithms on the input data--MEME and DREME--and uses the motifs they discover in subsequent visualization, binding affinity and identification steps. MEME-ChIP also performs motif enrichment analysis using the AME algorithm, which can detect very low levels of enrichment of binding sites for TFs with known DNA-binding motifs. Importantly, unlike with the MEME web service, there is no restriction on the size or number of uploaded sequences, allowing very large ChIP-seq datasets to be analyzed. The analyses performed by MEME-ChIP provide the user with a varied view of the binding and regulatory activity of the ChIP-ed TF, as well as the possible involvement of other DNA-binding TFs.MEME-ChIP is available as part of the MEME Suite at http://meme.nbcr.net.
Project description:As an increasing number of eukaryotic genomes are being sequenced, comparative studies aimed at detecting regulatory elements in intergenic sequences are becoming more prevalent. Most comparative methods for transcription factor (TF) binding site discovery make use of global or local alignments of orthologous regulatory regions to assess whether a particular DNA site is conserved across related organisms, and thus more likely to be functional. Since binding sites are usually short, sometimes degenerate, and often independent of orientation, alignment algorithms may not align them correctly. Here, we present a novel, alignment-free approach for using conservation information for TF binding site discovery. We relax the definition of conserved sites: we consider a DNA site within a regulatory region to be conserved in an orthologous sequence if it occurs anywhere in that sequence, irrespective of orientation. We use this definition to derive informative priors over DNA sequence positions, and incorporate these priors into a Gibbs sampling algorithm for motif discovery. Our approach is simple and fast. It requires neither sequence alignments nor the phylogenetic relationships between the orthologous sequences, yet it is more effective on real biological data than methods that do.
Project description:Genome-wide analyses of protein binding sites generate large amounts of data; a ChIP dataset might contain 10,000 sites. Unbiased motif discovery in such datasets is not generally feasible using current methods that employ probabilistic models. We propose an efficient method, GADEM, which combines spaced dyads and an expectation-maximization (EM) algorithm. Candidate words (four to six nucleotides) for constructing spaced dyads are prioritized by their degree of overrepresentation in the input sequence data. Spaced dyads are converted into starting position weight matrices (PWMs). GADEM then employs a genetic algorithm (GA), with an embedded EM algorithm to improve starting PWMs, to guide the evolution of a population of spaced dyads toward one whose entropy scores are more statistically significant. Spaced dyads whose entropy scores reach a pre-specified significance threshold are declared motifs. GADEM performed comparably with MEME on 500 sets of simulated "ChIP" sequences with embedded known P53 binding sites. The major advantage of GADEM is its computational efficiency on large ChIP datasets compared to competitors. We applied GADEM to six genome-wide ChIP datasets. Approximately, 15 to 30 motifs of various lengths were identified in each dataset. Remarkably, without any prior motif information, the expected known motif (e.g., P53 in P53 data) was identified every time. GADEM discovered motifs of various lengths (6-40 bp) and characteristics in these datasets containing from 0.5 to >13 million nucleotides with run times of 5 to 96 h. GADEM can be viewed as an extension of the well-known MEME algorithm and is an efficient tool for de novo motif discovery in large-scale genome-wide data. The GADEM software is available at (www.niehs.nih.gov/research/resources/software/GADEM/).
Project description:Regulatory elements are responsible for regulating gene transcription. Therefore, identification of these elements is a tremendous challenge in the field of gene expression. Transcription factors (TFs) play a key role in gene regulation by binding to target promoter sequences. A set of conserved sequence patterns with a highly similar structure that is bound by a TF is called a motif. Motif discovery has been a difficult problem over the past decades. Meanwhile, it is a foundation stone in meeting this challenge. Recent advances in obtaining genomic sequences and high-throughput gene expression analysis techniques have enabled the rapid development of computational methods for motif discovery. As a result, a large number of motif-finding algorithms aiming at various motif models have sprung up in the past few years. However, most of them are not suitable for analysis of the large data sets generated by next-generation sequencing. To better handle large-scale ChIP-Seq data and achieve better performance in computational time and motif detection accuracy, we propose an excellent motif-finding algorithm known as GSMC (Combining Parallel Gibbs Sampling with Maximal Cliques for hunting DNA Motif). The GSMC algorithm consists of two steps. First, we employ the commonly used Gibbs sampling to generating initial motifs. Second, we utilize maximal cliques to cluster motifs according to Similarity with Position Information Contents (SPIC). Consequently, we raise the detection accuracy in a great degree, in the meantime holding comparative computation efficiency. In addition, we can find much more credible cofactor interacting motifs.
Project description:MEME and many other popular motif finders use the expectation-maximization (EM) algorithm to optimize their parameters. Unfortunately, the running time of EM is linear in the length of the input sequences. This can prohibit its application to data sets of the size commonly generated by high-throughput biological techniques. A suffix tree is a data structure that can efficiently index a set of sequences. We describe an algorithm, Suffix Tree EM for Motif Elicitation (STEME), that approximates EM using suffix trees. To the best of our knowledge, this is the first application of suffix trees to EM. We provide an analysis of the expected running time of the algorithm and demonstrate that STEME runs an order of magnitude more quickly than the implementation of EM used by MEME. We give theoretical bounds for the quality of the approximation and show that, in practice, the approximation has a negligible effect on the outcome. We provide an open source implementation of the algorithm that we hope will be used to speed up existing and future motif search algorithms.
Project description:MOTIVATION:Increasing evidence has shown that nucleotide modifications such as methylation and hydroxymethylation on cytosine would greatly impact the binding of transcription factors (TFs). However, there is a lack of motif finding algorithms with the function to search for motifs with modified bases. In this study, we expand on our previous motif finding pipeline Epigram to provide systematic de novo motif discovery and performance evaluation on methylated DNA motifs. RESULTS:mEpigram outperforms both MEME and DREME on finding modified motifs in simulated data that mimics various motif enrichment scenarios. Furthermore we were able to identify methylated motifs in Arabidopsis DNA affinity purification sequencing (DAP-seq) data that were previously demonstrated to contain such motifs. When applied to TF ChIP-seq and DNA methylome data in H1 and GM12878, our method successfully identified novel methylated motifs that can be recognized by the TFs or their co-factors. We also observed spacing constraint between the canonical motif of the TF and the newly discovered methylated motifs, which suggests operative recognition of these cis-elements by collaborative proteins. AVAILABILITY AND IMPLEMENTATION:The mEpigram program is available at http://wanglab.ucsd.edu/star/mEpigram. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:BACKGROUND:Previous studies demonstrate the usefulness of using multiple tools and methods for improving the accuracy of motif detection. Over the past years, numerous motif discovery pipelines have been developed. However, they typically report only the top ranked results either from individual motif finders or from a combination of multiple tools and algorithms. RESULTS:Here we present MODSIDE, a motif discovery pipeline and similarity detector. The pipeline integrated four de novo motif finders: ChIPMunk, MEME, Weeder, and XXmotif. It also incorporated a motif similarity detection tool MOTIFSIM. MODSIDE was designed for delivering not only the predictive results from individual motif finders but also the comparison results for multiple tools. The results include the common significant motifs from multiple tools, the motifs detected by some tools but not by others, and the best matches for each motif in the motif collection of multiple tools. MODSIDE also possesses other useful features for merging similar motifs and clustering motifs into motif trees. CONCLUSIONS:We evaluated MODSIDE and its adopted motif finders on 16 benchmark datasets. The statistical results demonstrate MODSIDE achieves better accuracy than individual motif finders. We also compared MODSIDE with two popular motif discovery pipelines: MEME-ChIP and RSAT peak-motifs. The comparison results reveal MODSIDE attains similar performance as RSAT peak-motifs but better accuracy than MEME-ChIP. In addition, MODSIDE is able to deliver various comparison results that are not offered by MEME-ChIP, RSAT peak-motifs, and other existing motif discovery pipelines.
Project description:Finding functional DNA binding sites of transcription factors (TFs) throughout the genome is a crucial step in understanding transcriptional regulation. Unfortunately, these binding sites are typically short and degenerate, posing a significant statistical challenge: many more matches to known TF motifs occur in the genome than are actually functional. However, information about chromatin structure may help to identify the functional sites. In particular, it has been shown that active regulatory regions are usually depleted of nucleosomes, thereby enabling TFs to bind DNA in those regions. Here, we describe a novel motif discovery algorithm that employs an informative prior over DNA sequence positions based on a discriminative view of nucleosome occupancy. When a Gibbs sampling algorithm is applied to yeast sequence-sets identified by ChIP-chip, the correct motif is found in 52% more cases with our informative prior than with the commonly used uniform prior. This is the first demonstration that nucleosome occupancy information can be used to improve motif discovery. The improvement is dramatic, even though we are using only a statistical model to predict nucleosome occupancy; we expect our results to improve further as high-resolution genome-wide experimental nucleosome occupancy data becomes increasingly available.
Project description:Position weight matrices (PWMs) are the standard model for DNA and RNA regulatory motifs. In PWMs nucleotide probabilities are independent of nucleotides at other positions. Models that account for dependencies need many parameters and are prone to overfitting. We have developed a Bayesian approach for motif discovery using Markov models in which conditional probabilities of order k - 1 act as priors for those of order k This Bayesian Markov model (BaMM) training automatically adapts model complexity to the amount of available data. We also derive an EM algorithm for de-novo discovery of enriched motifs. For transcription factor binding, BaMMs achieve significantly (P ? ?= ?1/16) higher cross-validated partial AUC than PWMs in 97% of 446 ChIP-seq ENCODE datasets and improve performance by 36% on average. BaMMs also learn complex multipartite motifs, improving predictions of transcription start sites, polyadenylation sites, bacterial pause sites, and RNA binding sites by 26-101%. BaMMs never performed worse than PWMs. These robust improvements argue in favour of generally replacing PWMs by BaMMs.