The next generation of transcription factor binding site prediction.
ABSTRACT: Finding where transcription factors (TFs) bind to the DNA is of key importance to decipher gene regulation at a transcriptional level. Classically, computational prediction of TF binding sites (TFBSs) is based on basic position weight matrices (PWMs) which quantitatively score binding motifs based on the observed nucleotide patterns in a set of TFBSs for the corresponding TF. Such models make the strong assumption that each nucleotide participates independently in the corresponding DNA-protein interaction and do not account for flexible length motifs. We introduce transcription factor flexible models (TFFMs) to represent TF binding properties. Based on hidden Markov models, TFFMs are flexible, and can model both position interdependence within TFBSs and variable length motifs within a single dedicated framework. The availability of thousands of experimentally validated DNA-TF interaction sequences from ChIP-seq allows for the generation of models that perform as well as PWMs for stereotypical TFs and can improve performance for TFs with flexible binding characteristics. We present a new graphical representation of the motifs that convey properties of position interdependence. TFFMs have been assessed on ChIP-seq data sets coming from the ENCODE project, revealing that they can perform better than both PWMs and the dinucleotide weight matrix extension in discriminating ChIP-seq from background sequences. Under the assumption that ChIP-seq signal values are correlated with the affinity of the TF-DNA binding, we find that TFFM scores correlate with ChIP-seq peak signals. Moreover, using available TF-DNA affinity measurements for the Max TF, we demonstrate that TFFMs constructed from ChIP-seq data correlate with published experimentally measured DNA-binding affinities. Finally, TFFMs allow for the straightforward computation of an integrated TF occupancy score across a sequence. These results demonstrate the capacity of TFFMs to accurately model DNA-protein interactions, while providing a single unified framework suitable for the next generation of TFBS prediction.
Project description:Accurate prediction of transcription factor binding sites (TFBSs) is a prerequisite for identifying cis-regulatory modules that underlie transcriptional regulatory circuits encoded in the genome. Here, we present a computational framework for detecting TFBSs, when multiple position weight matrices (PWMs) for a transcription factor are available. Grouping multiple PWMs of a transcription factor (TF) based on their sequence similarity improves the specificity of TFBS prediction, which was evaluated using multiple genome-wide ChIP-Seq data sets from 26 TFs. The Z-scores of the area under a receiver operating characteristic curve (AUC) values of 368 TFs were calculated and used to statistically identify co-occurring regulatory motifs in the TF bound ChIP loci. Motifs that are co-occurring along with the empirical bindings of E2F, JUN or MYC have been evaluated, in the basal or stimulated condition. Results prove our method can be useful to systematically identify the co-occurring motifs of the TF for the given conditions.
Project description:Scanning through genomes for potential transcription factor binding sites (TFBSs) is becoming increasingly important in this post-genomic era. The position weight matrix (PWM) is the standard representation of TFBSs utilized when scanning through sequences for potential binding sites. However, many transcription factor (TF) motifs are short and highly degenerate, and methods utilizing PWMs to scan for sites are plagued by false positives. Furthermore, many important TFs do not have well-characterized PWMs, making identification of potential binding sites even more difficult. One approach to the identification of sites for these TFs has been to use the 3D structure of the TF to predict the DNA structure around the TF and then to generate a PWM from the predicted 3D complex structure. However, this approach is dependent on the similarity of the predicted structure to the native structure. We introduce here a novel approach to identify TFBSs utilizing structure information that can be applied to TFs without characterized PWMs, as long as a 3D complex structure (TF/DNA) exists. This approach utilizes an energy function that is uniquely trained on each structure. Our approach leads to increased prediction accuracy and robustness compared with those using a more general energy function. The software is freely available upon request.
Project description:Cells use thousands of regulatory sequences to recruit transcription factors (TFs) and produce specific transcriptional outcomes. Since TFs bind degenerate DNA sequences, discriminating functional TF binding sites (TFBSs) from background sequences represents a significant challenge. Here, we show that a Drosophila regulatory element that activates Epidermal Growth Factor signaling requires overlapping, low-affinity TFBSs for competing TFs (Pax2 and Senseless) to ensure cell- and segment-specific activity. Testing available TF binding models for Pax2 and Senseless, however, revealed variable accuracy in predicting such low-affinity TFBSs. To better define parameters that increase accuracy, we developed a method that systematically selects subsets of TFBSs based on predicted affinity to generate hundreds of position-weight matrices (PWMs). Counterintuitively, we found that degenerate PWMs produced from datasets depleted of high-affinity sequences were more accurate in identifying both low- and high-affinity TFBSs for the Pax2 and Senseless TFs. Taken together, these findings reveal how TFBS arrangement can be constrained by competition rather than cooperativity and that degenerate models of TF binding preferences can improve identification of biologically relevant low affinity TFBSs.
Project description:BACKGROUND: Scientists routinely scan DNA sequences for transcription factor (TF) binding sites (TFBSs). Most of the available tools rely on position-specific scoring matrices (PSSMs) constructed from aligned binding sites. Because of the resolutions of assays used to obtain TFBSs, databases such as TRANSFAC, ORegAnno and PAZAR store unaligned variable-length DNA segments containing binding sites of a TF. These DNA segments need to be aligned to build a PSSM. While the TRANSFAC database provides scoring matrices for TFs, nearly 78% of the TFs in the public release do not have matrices available. As work on TFBS alignment algorithms has been limited, it is highly desirable to have an alignment algorithm tailored to TFBSs. RESULTS: We designed a novel algorithm named LASAGNA, which is aware of the lengths of input TFBSs and utilizes position dependence. Results on 189 TFs of 5 species in the TRANSFAC database showed that our method significantly outperformed ClustalW2 and MEME. We further compared a PSSM method dependent on LASAGNA to an alignment-free TFBS search method. Results on 89 TFs whose binding sites can be located in genomes showed that our method is significantly more precise at fixed recall rates. Finally, we described LASAGNA-ChIP, a more sophisticated version for ChIP (Chromatin immunoprecipitation) experiments. Under the one-per-sequence model, it showed comparable performance with MEME in discovering motifs in ChIP-seq peak sequences. CONCLUSIONS: We conclude that the LASAGNA algorithm is simple and effective in aligning variable-length binding sites. It has been integrated into a user-friendly webtool for TFBS search and visualization called LASAGNA-Search. The tool currently stores precomputed PSSM models for 189 TFs and 133 TFs built from TFBSs in the TRANSFAC Public database (release 7.0) and the ORegAnno database (08Nov10 dump), respectively. The webtool is available at http://biogrid.engr.uconn.edu/lasagna_search/.
Project description:Identifying transcription factor (TF) binding sites (TFBSs) is important in the computational inference of gene regulation. Widely used computational methods of TFBS prediction based on position weight matrices (PWMs) usually have high false positive rates. Moreover, computational studies of transcription regulation in eukaryotes frequently require numerous PWM models of TFBSs due to a large number of TFs involved. To overcome these problems we developed DRAF, a novel method for TFBS prediction that requires only 14 prediction models for 232 human TFs, while at the same time significantly improves prediction accuracy. DRAF models use more features than PWM models, as they combine information from TFBS sequences and physicochemical properties of TF DNA-binding domains into machine learning models. Evaluation of DRAF on 98 human ChIP-seq datasets shows on average 1.54-, 1.96- and 5.19-fold reduction of false positives at the same sensitivities compared to models from HOCOMOCO, TRANSFAC and DeepBind, respectively. This observation suggests that one can efficiently replace the PWM models for TFBS prediction by a small number of DRAF models that significantly improve prediction accuracy. The DRAF method is implemented in a web tool and in a stand-alone software freely available at http://cbrc.kaust.edu.sa/DRAF.
Project description:Most of the position weight matrix (PWM) based bioinformatics methods developed to predict transcription factor binding sites (TFBS) assume each nucleotide in the sequence motif contributes independently to the interaction between protein and DNA sequence, usually producing high false positive predictions. The increasing availability of TF enrichment profiles from recent ChIP-Seq methodology facilitates the investigation of dependent structure and accurate prediction of TFBSs. We develop a novel Tree-based PWM (TPWM) approach to accurately model the interaction between TF and its binding site. The whole tree-structured PWM could be considered as a mixture of different conditional-PWMs. We propose a discriminative approach, called TPD (TPWM based Discriminative Approach), to construct the TPWM from the ChIP-Seq data with a pre-existing PWM. To achieve the maximum discriminative power between the positive and negative datasets, the cutoff value is determined based on the Matthew Correlation Coefficient (MCC). The resulting TPWMs are evaluated with respect to accuracy on extensive synthetic datasets. We then apply our TPWM discriminative approach on several real ChIP-Seq datasets to refine the current TFBS models stored in the TRANSFAC database. Experiments on both the simulated and real ChIP-Seq data show that the proposed method starting from existing PWM has consistently better performance than existing tools in detecting the TFBSs. The improved accuracy is the result of modelling the complete dependent structure of the motifs and better prediction of true positive rate. The findings could lead to better understanding of the mechanisms of TF-DNA interactions.
Project description:Several recent studies have portrayed DNA methylation as a new player in the recruitment of transcription factors (TF) within chromatin, highlighting a need to connect TF binding sites (TFBS) with their respective DNA methylation profiles. However, current TFBS databases are restricted to DNA binding motif sequences. Here, we present MethMotif, a two-dimensional TFBS database that records TFBS position weight matrices along with cell type specific CpG methylation information computed from a combination of ChIP-seq and whole genome bisulfite sequencing datasets. Integrating TFBS motifs with TFBS DNA methylation better portrays the features of DNA loci recognised by TFs. In particular, we found that DNA methylation patterns within TFBS can be cell specific (e.g. MAFF). Furthermore, for a given TF, different DNA methylation profiles are associated with different DNA binding motifs (e.g. REST). To date, MethMotif database records over 500 TFBSs computed from over 2000 ChIP-seq datasets in 11 different cell types. MethMotif portal is accessible through an open source web interface (https://bioinfo-csi.nus.edu.sg/methmotif) that allows users to intuitively explore the entire dataset and perform both single, and batch queries.
Project description:Interactions of transcription factors (TFs) with DNA comprise a complex interplay between base-specific amino acid contacts and readout of DNA structure. Recent studies have highlighted the complementarity of DNA sequence and shape in modeling TF binding in vitro. Here, we have provided a comprehensive evaluation of in vivo datasets to assess the predictive power obtained by augmenting various DNA sequence-based models of TF binding sites (TFBSs) with DNA shape features (helix twist, minor groove width, propeller twist, and roll). Results from 400 human ChIP-seq datasets for 76 TFs show that combining DNA shape features with position-specific scoring matrix (PSSM) scores improves TFBS predictions. Improvement has also been observed using TF flexible models and a machine-learning approach using a binary encoding of nucleotides in lieu of PSSMs. Incorporating DNA shape information is most beneficial for E2F and MADS-domain TF families. Our findings indicate that incorporating DNA sequence and shape information benefits the modeling of TF binding under complex in vivo conditions.
Project description:Transcription factors (TFs) are sequence-specific DNA binding proteins, fine-tuning spatiotemporal gene expression. Since genomic occupancy of a TF is highly dynamic, it is crucial to study TF binding sites (TFBSs) in a cell-specific context. To date, thousands of ChIP-seq datasets have portrayed the genomic binding landscapes of numerous TFs in different cell types. Although these datasets can be browsed via several platforms, tools that can operate on that data flow are still lacking. Here, we introduce TFregulomeR (https://github.com/benoukraflab/TFregulomeR), an R-library linked to an up-to-date compendium of cistrome and methylome datasets, implemented with functionalities that facilitate integrative analyses. In particular, TFregulomeR enables the characterization of TF binding partners and cell-specific TFBSs, along with the study of TF's functions in the context of different partnerships and DNA methylation levels. We demonstrated that TFs' target gene ontologies can differ notably depending on their partners and, by re-analyzing well characterized TFs, we brought to light that numerous leucine zipper TFBSs derived from ChIP-seq experiments documented in current databases were inadequately characterized, due to the fact that their position weight matrices were assembled using a mixture of homodimer and heterodimer binding sites. Altogether, analyses of context-specific transcription regulation with TFregulomeR foster our understanding of regulatory network-dependent TF functions.
Project description:BACKGROUND:Binding of transcription factors to transcription factor binding sites (TFBSs) is key to the mediation of transcriptional regulation. Information on experimentally validated functional TFBSs is limited and consequently there is a need for accurate prediction of TFBSs for gene annotation and in applications such as evaluating the effects of single nucleotide variations in causing disease. TFBSs are generally recognized by scanning a position weight matrix (PWM) against DNA using one of a number of available computer programs. Thus we set out to evaluate the best tools that can be used locally (and are therefore suitable for large-scale analyses) for creating PWMs from high-throughput ChIP-Seq data and for scanning them against DNA. RESULTS:We evaluated a set of de novo motif discovery tools that could be downloaded and installed locally using ENCODE-ChIP-Seq data and showed that rGADEM was the best-performing tool. TFBS prediction tools used to scan PWMs against DNA fall into two classes - those that predict individual TFBSs and those that identify clusters. Our evaluation showed that FIMO and MCAST performed best respectively. CONCLUSIONS:Selection of the best-performing tools for generating PWMs from ChIP-Seq data and for scanning PWMs against DNA has the potential to improve prediction of precise transcription factor binding sites within regions identified by ChIP-Seq experiments for gene finding, understanding regulation and in evaluating the effects of single nucleotide variations in causing disease.