Identification of co-occurring transcription factor binding sites from DNA sequence using clustered position weight matrices.
ABSTRACT: Accurate prediction of transcription factor binding sites (TFBSs) is a prerequisite for identifying cis-regulatory modules that underlie transcriptional regulatory circuits encoded in the genome. Here, we present a computational framework for detecting TFBSs, when multiple position weight matrices (PWMs) for a transcription factor are available. Grouping multiple PWMs of a transcription factor (TF) based on their sequence similarity improves the specificity of TFBS prediction, which was evaluated using multiple genome-wide ChIP-Seq data sets from 26 TFs. The Z-scores of the area under a receiver operating characteristic curve (AUC) values of 368 TFs were calculated and used to statistically identify co-occurring regulatory motifs in the TF bound ChIP loci. Motifs that are co-occurring along with the empirical bindings of E2F, JUN or MYC have been evaluated, in the basal or stimulated condition. Results prove our method can be useful to systematically identify the co-occurring motifs of the TF for the given conditions.
Project description:Finding where transcription factors (TFs) bind to the DNA is of key importance to decipher gene regulation at a transcriptional level. Classically, computational prediction of TF binding sites (TFBSs) is based on basic position weight matrices (PWMs) which quantitatively score binding motifs based on the observed nucleotide patterns in a set of TFBSs for the corresponding TF. Such models make the strong assumption that each nucleotide participates independently in the corresponding DNA-protein interaction and do not account for flexible length motifs. We introduce transcription factor flexible models (TFFMs) to represent TF binding properties. Based on hidden Markov models, TFFMs are flexible, and can model both position interdependence within TFBSs and variable length motifs within a single dedicated framework. The availability of thousands of experimentally validated DNA-TF interaction sequences from ChIP-seq allows for the generation of models that perform as well as PWMs for stereotypical TFs and can improve performance for TFs with flexible binding characteristics. We present a new graphical representation of the motifs that convey properties of position interdependence. TFFMs have been assessed on ChIP-seq data sets coming from the ENCODE project, revealing that they can perform better than both PWMs and the dinucleotide weight matrix extension in discriminating ChIP-seq from background sequences. Under the assumption that ChIP-seq signal values are correlated with the affinity of the TF-DNA binding, we find that TFFM scores correlate with ChIP-seq peak signals. Moreover, using available TF-DNA affinity measurements for the Max TF, we demonstrate that TFFMs constructed from ChIP-seq data correlate with published experimentally measured DNA-binding affinities. Finally, TFFMs allow for the straightforward computation of an integrated TF occupancy score across a sequence. These results demonstrate the capacity of TFFMs to accurately model DNA-protein interactions, while providing a single unified framework suitable for the next generation of TFBS prediction.
Project description:Cells use thousands of regulatory sequences to recruit transcription factors (TFs) and produce specific transcriptional outcomes. Since TFs bind degenerate DNA sequences, discriminating functional TF binding sites (TFBSs) from background sequences represents a significant challenge. Here, we show that a Drosophila regulatory element that activates Epidermal Growth Factor signaling requires overlapping, low-affinity TFBSs for competing TFs (Pax2 and Senseless) to ensure cell- and segment-specific activity. Testing available TF binding models for Pax2 and Senseless, however, revealed variable accuracy in predicting such low-affinity TFBSs. To better define parameters that increase accuracy, we developed a method that systematically selects subsets of TFBSs based on predicted affinity to generate hundreds of position-weight matrices (PWMs). Counterintuitively, we found that degenerate PWMs produced from datasets depleted of high-affinity sequences were more accurate in identifying both low- and high-affinity TFBSs for the Pax2 and Senseless TFs. Taken together, these findings reveal how TFBS arrangement can be constrained by competition rather than cooperativity and that degenerate models of TF binding preferences can improve identification of biologically relevant low affinity TFBSs.
Project description:Scanning through genomes for potential transcription factor binding sites (TFBSs) is becoming increasingly important in this post-genomic era. The position weight matrix (PWM) is the standard representation of TFBSs utilized when scanning through sequences for potential binding sites. However, many transcription factor (TF) motifs are short and highly degenerate, and methods utilizing PWMs to scan for sites are plagued by false positives. Furthermore, many important TFs do not have well-characterized PWMs, making identification of potential binding sites even more difficult. One approach to the identification of sites for these TFs has been to use the 3D structure of the TF to predict the DNA structure around the TF and then to generate a PWM from the predicted 3D complex structure. However, this approach is dependent on the similarity of the predicted structure to the native structure. We introduce here a novel approach to identify TFBSs utilizing structure information that can be applied to TFs without characterized PWMs, as long as a 3D complex structure (TF/DNA) exists. This approach utilizes an energy function that is uniquely trained on each structure. Our approach leads to increased prediction accuracy and robustness compared with those using a more general energy function. The software is freely available upon request.
Project description:Human endogenous retroviruses (HERVs) and other long terminal repeat (LTR)-type retrotransposons (HERV/LTRs) have regulatory elements that possibly influence the transcription of host genes. We systematically identified and characterized these regulatory elements based on publicly available datasets of ChIP-Seq of 97 transcription factors (TFs) provided by ENCODE and Roadmap Epigenomics projects. We determined transcription factor-binding sites (TFBSs) using the ChIP-Seq datasets and identified TFBSs observed on HERV/LTR sequences (HERV-TFBSs). Overall, 794,972 HERV-TFBSs were identified. Subsequently, we identified "HERV/LTR-shared regulatory element (HSRE)," defined as a TF-binding motif in HERV-TFBSs, shared within a substantial fraction of a HERV/LTR type. HSREs could be an indication that the regulatory elements of HERV/LTRs are present before their insertions. We identified 2,201 HSREs, comprising specific associations of 354 HERV/LTRs and 84 TFs. Clustering analysis showed that HERV/LTRs can be grouped according to the TF binding patterns; HERV/LTR groups bounded to pluripotent TFs (e.g., SOX2, POU5F1, and NANOG), embryonic endoderm/mesendoderm TFs (e.g., GATA4/6, SOX17, and FOXA1/2), hematopoietic TFs (e.g., SPI1 (PU1), GATA1/2, and TAL1), and CTCF were identified. Regulatory elements of HERV/LTRs tended to locate nearby and/or interact three-dimensionally with the genes involved in immune responses, indicating that the regulatory elements play an important role in controlling the immune regulatory network. Further, we demonstrated subgroup-specific TF binding within LTR7, LTR5B, and LTR5_Hs, indicating that gains or losses of the regulatory elements occurred during genomic invasions of the HERV/LTRs. Finally, we constructed dbHERV-REs, an interactive database of HERV/LTR regulatory elements (http://herv-tfbs.com/). This study provides fundamental information in understanding the impact of HERV/LTRs on host transcription, and offers insights into the transcriptional modulation systems of HERV/LTRs and ancestral HERVs.
Project description:In multi-cellular organisms, spatiotemporal activity of cis-regulatory DNA elements depends on their occupancy by different transcription factors (TFs). In recent years, genome-wide ChIP-on-Chip, ChIP-Seq and DamID assays have been extensively used to unravel the combinatorial interaction of TFs with cis-regulatory modules (CRMs) in the genome. Even though genome-wide binding profiles are increasingly becoming available for different TFs, single TF binding profiles are in most cases not sufficient for dissecting complex regulatory networks. Thus, potent computational tools detecting statistically significant and biologically relevant TF-motif co-occurrences in genome-wide datasets are essential for analyzing context-dependent transcriptional regulation. We have developed COPS (Co-Occurrence Pattern Search), a new bioinformatics tool based on a combination of association rules and Markov chain models, which detects co-occurring TF binding sites (BSs) on genomic regions of interest. COPS scans DNA sequences for frequent motif patterns using a Frequent-Pattern tree based data mining approach, which allows efficient performance of the software with respect to both data structure and implementation speed, in particular when mining large datasets. Since transcriptional gene regulation very often relies on the formation of regulatory protein complexes mediated by closely adjoining TF binding sites on CRMs, COPS additionally detects preferred short distance between co-occurring TF motifs. The performance of our software with respect to biological significance was evaluated using three published datasets containing genomic regions that are independently bound by several TFs involved in a defined biological process. In sum, COPS is a fast, efficient and user-friendly tool mining statistically and biologically significant TFBS co-occurrences and therefore allows the identification of TFs that combinatorially regulate gene expression.
Project description:Identifying transcription factor (TF) binding sites (TFBSs) is important in the computational inference of gene regulation. Widely used computational methods of TFBS prediction based on position weight matrices (PWMs) usually have high false positive rates. Moreover, computational studies of transcription regulation in eukaryotes frequently require numerous PWM models of TFBSs due to a large number of TFs involved. To overcome these problems we developed DRAF, a novel method for TFBS prediction that requires only 14 prediction models for 232 human TFs, while at the same time significantly improves prediction accuracy. DRAF models use more features than PWM models, as they combine information from TFBS sequences and physicochemical properties of TF DNA-binding domains into machine learning models. Evaluation of DRAF on 98 human ChIP-seq datasets shows on average 1.54-, 1.96- and 5.19-fold reduction of false positives at the same sensitivities compared to models from HOCOMOCO, TRANSFAC and DeepBind, respectively. This observation suggests that one can efficiently replace the PWM models for TFBS prediction by a small number of DRAF models that significantly improve prediction accuracy. The DRAF method is implemented in a web tool and in a stand-alone software freely available at http://cbrc.kaust.edu.sa/DRAF.
Project description:Interpreting the functional impact of noncoding variants is an ongoing challenge in the field of genome analysis. With most noncoding variants associated with complex traits and disease residing in regulatory regions, altered transcription factor (TF) binding has been proposed as a mechanism of action. It is therefore imperative to develop methods that predict the impact of noncoding variants at TF binding sites (TFBSs). Here, we describe the update of our MANTA database that stores: 1) TFBS predictions in the human genome, and 2) the potential impact on TF binding for all possible single nucleotide variants (SNVs) at these TFBSs. TFBSs were predicted by combining experimental ChIP-seq data from ReMap and computational position weight matrices (PWMs) derived from JASPAR. Impact of SNVs at these TFBSs was assessed by means of PWM scores computed on the alternate alleles. The updated database, MANTA2, provides the scientific community with a critical map of TFBSs and SNV impact scores to improve the interpretation of noncoding variants in the human genome.
Project description:The identification of transcription factor binding sites (TFBSs) on genomic DNA is of crucial importance for understanding and predicting regulatory elements in gene networks. TFBS motifs are commonly described by Position Weight Matrices (PWMs), in which each DNA base pair contributes independently to the transcription factor (TF) binding. However, this description ignores correlations between nucleotides at different positions, and is generally inaccurate: analysing fly and mouse in vivo ChIPseq data, we show that in most cases the PWM model fails to reproduce the observed statistics of TFBSs. To overcome this issue, we introduce the pairwise interaction model (PIM), a generalization of the PWM model. The model is based on the principle of maximum entropy and explicitly describes pairwise correlations between nucleotides at different positions, while being otherwise as unconstrained as possible. It is mathematically equivalent to considering a TF-DNA binding energy that depends additively on each nucleotide identity at all positions in the TFBS, like the PWM model, but also additively on pairs of nucleotides. We find that the PIM significantly improves over the PWM model, and even provides an optimal description of TFBS statistics within statistical noise. The PIM generalizes previous approaches to interdependent positions: it accounts for co-variation of two or more base pairs, and predicts secondary motifs, while outperforming multiple-motif models consisting of mixtures of PWMs. We analyse the structure of pairwise interactions between nucleotides, and find that they are sparse and dominantly located between consecutive base pairs in the flanking region of TFBS. Nonetheless, interactions between pairs of non-consecutive nucleotides are found to play a significant role in the obtained accurate description of TFBS statistics. The PIM is computationally tractable, and provides a general framework that should be useful for describing and predicting TFBSs beyond PWMs.
Project description:The adaptation of microorganisms to their environment is controlled by complex transcriptional regulatory networks (TRNs), which are still only partially understood even for model species. Genome scale annotation of regulatory features of genes and TRN reconstruction are challenging tasks of microbial genomics. We used the knowledge-driven comparative-genomics approach implemented in the RegPredict Web server to infer TRN in the model Gram-positive bacterium Bacillus subtilis and 10 related Bacillales species. For transcription factor (TF) regulons, we combined the available information from the DBTBS database and the literature with bioinformatics tools, allowing inference of TF binding sites (TFBSs), comparative analysis of the genomic context of predicted TFBSs, functional assignment of target genes, and effector prediction. For RNA regulons, we used known RNA regulatory motifs collected in the Rfam database to scan genomes and analyze the genomic context of new RNA sites. The inferred TRN in B. subtilis comprises regulons for 129 TFs and 24 regulatory RNA families. First, we analyzed 66 TF regulons with previously known TFBSs in B. subtilis and projected them to other Bacillales genomes, resulting in refinement of TFBS motifs and identification of novel regulon members. Second, we inferred motifs and described regulons for 28 experimentally studied TFs with previously unknown TFBSs. Third, we discovered novel motifs and reconstructed regulons for 36 previously uncharacterized TFs. The inferred collection of regulons is available in the RegPrecise database (http://regprecise.lbl.gov/) and can be used in genetic experiments, metabolic modeling, and evolutionary analysis.
Project description:Transcription factors (TFs) are sequence-specific DNA binding proteins, fine-tuning spatiotemporal gene expression. Since genomic occupancy of a TF is highly dynamic, it is crucial to study TF binding sites (TFBSs) in a cell-specific context. To date, thousands of ChIP-seq datasets have portrayed the genomic binding landscapes of numerous TFs in different cell types. Although these datasets can be browsed via several platforms, tools that can operate on that data flow are still lacking. Here, we introduce TFregulomeR (https://github.com/benoukraflab/TFregulomeR), an R-library linked to an up-to-date compendium of cistrome and methylome datasets, implemented with functionalities that facilitate integrative analyses. In particular, TFregulomeR enables the characterization of TF binding partners and cell-specific TFBSs, along with the study of TF's functions in the context of different partnerships and DNA methylation levels. We demonstrated that TFs' target gene ontologies can differ notably depending on their partners and, by re-analyzing well characterized TFs, we brought to light that numerous leucine zipper TFBSs derived from ChIP-seq experiments documented in current databases were inadequately characterized, due to the fact that their position weight matrices were assembled using a mixture of homodimer and heterodimer binding sites. Altogether, analyses of context-specific transcription regulation with TFregulomeR foster our understanding of regulatory network-dependent TF functions.