MethMotif: an integrative cell specific database of transcription factor binding motifs coupled with DNA methylation profiles.
ABSTRACT: Several recent studies have portrayed DNA methylation as a new player in the recruitment of transcription factors (TF) within chromatin, highlighting a need to connect TF binding sites (TFBS) with their respective DNA methylation profiles. However, current TFBS databases are restricted to DNA binding motif sequences. Here, we present MethMotif, a two-dimensional TFBS database that records TFBS position weight matrices along with cell type specific CpG methylation information computed from a combination of ChIP-seq and whole genome bisulfite sequencing datasets. Integrating TFBS motifs with TFBS DNA methylation better portrays the features of DNA loci recognised by TFs. In particular, we found that DNA methylation patterns within TFBS can be cell specific (e.g. MAFF). Furthermore, for a given TF, different DNA methylation profiles are associated with different DNA binding motifs (e.g. REST). To date, MethMotif database records over 500 TFBSs computed from over 2000 ChIP-seq datasets in 11 different cell types. MethMotif portal is accessible through an open source web interface (https://bioinfo-csi.nus.edu.sg/methmotif) that allows users to intuitively explore the entire dataset and perform both single, and batch queries.
Project description:Finding where transcription factors (TFs) bind to the DNA is of key importance to decipher gene regulation at a transcriptional level. Classically, computational prediction of TF binding sites (TFBSs) is based on basic position weight matrices (PWMs) which quantitatively score binding motifs based on the observed nucleotide patterns in a set of TFBSs for the corresponding TF. Such models make the strong assumption that each nucleotide participates independently in the corresponding DNA-protein interaction and do not account for flexible length motifs. We introduce transcription factor flexible models (TFFMs) to represent TF binding properties. Based on hidden Markov models, TFFMs are flexible, and can model both position interdependence within TFBSs and variable length motifs within a single dedicated framework. The availability of thousands of experimentally validated DNA-TF interaction sequences from ChIP-seq allows for the generation of models that perform as well as PWMs for stereotypical TFs and can improve performance for TFs with flexible binding characteristics. We present a new graphical representation of the motifs that convey properties of position interdependence. TFFMs have been assessed on ChIP-seq data sets coming from the ENCODE project, revealing that they can perform better than both PWMs and the dinucleotide weight matrix extension in discriminating ChIP-seq from background sequences. Under the assumption that ChIP-seq signal values are correlated with the affinity of the TF-DNA binding, we find that TFFM scores correlate with ChIP-seq peak signals. Moreover, using available TF-DNA affinity measurements for the Max TF, we demonstrate that TFFMs constructed from ChIP-seq data correlate with published experimentally measured DNA-binding affinities. Finally, TFFMs allow for the straightforward computation of an integrated TF occupancy score across a sequence. These results demonstrate the capacity of TFFMs to accurately model DNA-protein interactions, while providing a single unified framework suitable for the next generation of TFBS prediction.
Project description:Chromatin immunoprecipitation (ChIP) coupled to high-throughput sequencing (ChIP-Seq) techniques can reveal DNA regions bound by transcription factors (TF). Analysis of the ChIP-Seq regions is now a central component in gene regulation studies. The need remains strong for methods to improve the interpretation of ChIP-Seq data and the study of specific TF binding sites (TFBS).We introduce a set of methods to improve the interpretation of ChIP-Seq data, including the inference of mediating TFs based on TFBS motif over-representation analysis and the subsequent study of spatial distribution of TFBSs. TFBS over-representation analysis applied to ChIP-Seq data is used to detect which TFBSs arise more frequently than expected by chance. Visualization of over-representation analysis results with new composition-bias plots reveals systematic bias in over-representation scores. We introduce the BiasAway background generating software to resolve the problem. A heuristic procedure based on topological motif enrichment relative to the ChIP-Seq peaks' local maximums highlights peaks likely to be directly bound by a TF of interest. The results suggest that on average two-thirds of a ChIP-Seq dataset's peaks are bound by the ChIP'd TF; the origin of the remaining peaks remaining undetermined. Additional visualization methods allow for the study of both inter-TFBS spatial relationships and motif-flanking sequence properties, as demonstrated in case studies for TBP and ZNF143/THAP11.Topological properties of TFBS within ChIP-Seq datasets can be harnessed to better interpret regulatory sequences. Using GC content corrected TFBS over-representation analysis, combined with visualization techniques and analysis of the topological distribution of TFBS, we can distinguish peaks likely to be directly bound by a TF. The new methods will empower researchers for exploration of gene regulation and TF binding.
Project description:Most of the position weight matrix (PWM) based bioinformatics methods developed to predict transcription factor binding sites (TFBS) assume each nucleotide in the sequence motif contributes independently to the interaction between protein and DNA sequence, usually producing high false positive predictions. The increasing availability of TF enrichment profiles from recent ChIP-Seq methodology facilitates the investigation of dependent structure and accurate prediction of TFBSs. We develop a novel Tree-based PWM (TPWM) approach to accurately model the interaction between TF and its binding site. The whole tree-structured PWM could be considered as a mixture of different conditional-PWMs. We propose a discriminative approach, called TPD (TPWM based Discriminative Approach), to construct the TPWM from the ChIP-Seq data with a pre-existing PWM. To achieve the maximum discriminative power between the positive and negative datasets, the cutoff value is determined based on the Matthew Correlation Coefficient (MCC). The resulting TPWMs are evaluated with respect to accuracy on extensive synthetic datasets. We then apply our TPWM discriminative approach on several real ChIP-Seq datasets to refine the current TFBS models stored in the TRANSFAC database. Experiments on both the simulated and real ChIP-Seq data show that the proposed method starting from existing PWM has consistently better performance than existing tools in detecting the TFBSs. The improved accuracy is the result of modelling the complete dependent structure of the motifs and better prediction of true positive rate. The findings could lead to better understanding of the mechanisms of TF-DNA interactions.
Project description:BACKGROUND: Transcription factor (TF)-DNA binding loci are explored by analyzing massive datasets generated with application of Chromatin Immuno-Precipitation (ChIP)-based high-throughput sequencing technologies. These datasets suffer from a bias in the information about binding loci availability, sample incompleteness and diverse sources of technical and biological noises. Therefore adequate mathematical models of ChIP-based high-throughput assay(s) and statistical tools are required for a robust identification of specific and reliable TF binding sites (TFBS), a precise characterization of TFBS avidity distribution and a plausible estimation the total number of specific TFBS for a given TF in the genome for a given cell type. RESULTS: We developed an exploratory mixture probabilistic model for a specific and non-specific transcription factor-DNA (TF-DNA) binding. Within ChiP-seq data sets, the statistics of specific and non-specific DNA-protein binding is defined by a mixture of sample size-dependent skewed functions described by Kolmogorov-Waring (K-W) function (Kuznetsov, 2003) and exponential function, respectively. Using available Chip-seq data for eleven TFs, essential for self-maintenance and differentiation of mouse embryonic stem cells (SC) (Nanog, Oct4, sox2, KLf4, STAT3, E2F1, Tcfcp211, ZFX, n-Myc, c-Myc and Essrb) reported in Chen et al (2008), we estimated (i) the specificity and the sensitivity of the ChiP-seq binding assays and (ii) the number of specific but not identified in the current experiments binding sites (BSs) in the genome of mouse embryonic stem cells. Motif finding analysis applied to the identified c-Myc TFBSs supports our results and allowed us to predict many novel c-Myc target genes. CONCLUSION: We provide a novel methodology of estimating the specificity and the sensitivity of TF-DNA binding in massively paralleled ChIP sequencing (ChIP-seq) binding assay. Goodness-of fit analysis of K-W functions suggests that a large fraction of low- and moderate- avidity TFBSs cannot be identified by the ChIP-based methods. Thus the task to identify the binding sensitivity of a TF cannot be technically resolved yet by current ChIP-seq, compared to former experimental techniques. Considering our improvement in measuring the sensitivity and the specificity of the TFs obtained from the ChIP-seq data, the models of transcriptional regulatory networks in embryonic cells and other cell types derived from the given ChIp-seq data should be carefully revised.
Project description:Transcription of mitochondrial DNA (mtDNA)-encoded genes is thought to be regulated by a handful of dedicated transcription factors (TFs), suggesting that mtDNA genes are separately regulated from the nucleus. However, several TFs, with known nuclear activities, were found to bind mtDNA and regulate mitochondrial transcription. Additionally, mtDNA transcriptional regulatory elements, which were proved important in vitro, were harbored by a deletion that normally segregated among healthy individuals. Hence, mtDNA transcriptional regulation is more complex than once thought. Here, by analyzing ENCODE chromatin immunoprecipitation sequencing (ChIP-seq) data, we identified strong binding sites of three bona fide nuclear TFs (c-Jun, Jun-D, and CEBPb) within human mtDNA protein-coding genes. We validated the binding of two TFs by ChIP-quantitative polymerase chain reaction (c-Jun and Jun-D) and showed their mitochondrial localization by electron microscopy and subcellular fractionation. As a step toward investigating the functionality of these TF-binding sites (TFBS), we assessed signatures of selection. By analyzing 9,868 human mtDNA sequences encompassing all major global populations, we recorded genetic variants in tips and nodes of mtDNA phylogeny within the TFBS. We next calculated the effects of variants on binding motif prediction scores. Finally, the mtDNA variation pattern in predicted TFBS, occurring within ChIP-seq negative-binding sites, was compared with ChIP-seq positive-TFBS (CPR). Motifs within CPRs of c-Jun, Jun-D, and CEBPb harbored either only tip variants or their nodal variants retained high motif prediction scores. This reflects negative selection within mtDNA CPRs, thus supporting their functionality. Hence, human mtDNA-coding sequences may have dual roles, namely coding for genes yet possibly also possessing regulatory potential.
Project description:Motivation:Transcription factor (TF) binds to the promoter region of a gene to control gene expression. Identifying precise TF binding sites (TFBSs) is essential for understanding the detailed mechanisms of TF-mediated gene regulation. However, there is a shortage of computational approach that can deliver single base pair resolution prediction of TFBS. Results:In this paper, we propose DeepSNR, a Deep Learning algorithm for predicting TF binding location at Single Nucleotide Resolution de novo from DNA sequence. DeepSNR adopts a novel deconvolutional network (deconvNet) model and is inspired by the similarity to image segmentation by deconvNet. The proposed deconvNet architecture is constructed on top of 'DeepBind' and we trained the entire model using TF-specific data from ChIP-exonuclease (ChIP-exo) experiments. DeepSNR has been shown to outperform motif search-based methods for several evaluation metrics. We have also demonstrated the usefulness of DeepSNR in the regulatory analysis of TFBS as well as in improving the TFBS prediction specificity using ChIP-seq data. Availability and implementation:DeepSNR is available open source in the GitHub repository (https://github.com/sirajulsalekin/DeepSNR). Supplementary information:Supplementary data are available at Bioinformatics online.
Project description:Accurate prediction of transcription factor binding sites (TFBSs) is a prerequisite for identifying cis-regulatory modules that underlie transcriptional regulatory circuits encoded in the genome. Here, we present a computational framework for detecting TFBSs, when multiple position weight matrices (PWMs) for a transcription factor are available. Grouping multiple PWMs of a transcription factor (TF) based on their sequence similarity improves the specificity of TFBS prediction, which was evaluated using multiple genome-wide ChIP-Seq data sets from 26 TFs. The Z-scores of the area under a receiver operating characteristic curve (AUC) values of 368 TFs were calculated and used to statistically identify co-occurring regulatory motifs in the TF bound ChIP loci. Motifs that are co-occurring along with the empirical bindings of E2F, JUN or MYC have been evaluated, in the basal or stimulated condition. Results prove our method can be useful to systematically identify the co-occurring motifs of the TF for the given conditions.
Project description:Despite explosive growth in genomic datasets, the methods for studying epigenomic mechanisms of gene regulation remain primitive. Here we present a model-based approach to systematically analyze the epigenomic functions in modulating transcription factor-DNA binding. Based on the first principles of statistical mechanics, this model considers the interactions between epigenomic modifications and a cis-regulatory module, which contains multiple binding sites arranged in any configurations. We compiled a comprehensive epigenomic dataset in mouse embryonic stem (mES) cells, including DNA methylation (MeDIP-seq and MRE-seq), DNA hydroxymethylation (5-hmC-seq), and histone modifications (ChIP-seq). We discovered correlations of transcription factors (TFs) for specific combinations of epigenomic modifications, which we term epigenomic motifs. Epigenomic motifs explained why some TFs appeared to have different DNA binding motifs derived from in vivo (ChIP-seq) and in vitro experiments. Theoretical analyses suggested that the epigenome can modulate transcriptional noise and boost the cooperativity of weak TF binding sites. ChIP-seq data suggested that epigenomic boost of binding affinities in weak TF binding sites can function in mES cells. We showed in theory that the epigenome should suppress the TF binding differences on SNP-containing binding sites in two people. Using personal data, we identified strong associations between H3K4me2/H3K9ac and the degree of personal differences in NF?B binding in SNP-containing binding sites, which may explain why some SNPs introduce much smaller personal variations on TF binding than other SNPs. In summary, this model presents a powerful approach to analyze the functions of epigenomic modifications. This model was implemented into an open source program APEG (Affinity Prediction by Epigenome and Genome, http://systemsbio.ucsd.edu/apeg).
Project description:BACKGROUND: ChIP-Seq is widely used to detect genomic segments bound by transcription factors (TF), either directly at DNA binding sites (BSs) or indirectly via other proteins. Currently, there are many software tools implementing different approaches to identify TFBSs within ChIP-Seq peaks. However, their use for the interpretation of ChIP-Seq data is usually complicated by the absence of direct experimental verification, making it difficult both to set a threshold to avoid recognition of too many false-positive BSs, and to compare the actual performance of different models. RESULTS: Using ChIP-Seq data for FoxA2 binding loci in mouse adult liver and human HepG2 cells we compared FoxA binding-site predictions for four computational models of two fundamental classes: pattern matching based on existing training set of experimentally confirmed TFBSs (oPWM and SiteGA) and de novo motif discovery (ChIPMunk and diChIPMunk). To properly select prediction thresholds for the models, we experimentally evaluated affinity of 64 predicted FoxA BSs using EMSA that allows safely distinguishing sequences able to bind TF. As a result we identified thousands of reliable FoxA BSs within ChIP-Seq loci from mouse liver and human HepG2 cells. It was found that the performance of conventional position weight matrix (PWM) models was inferior with the highest false positive rate. On the contrary, the best recognition efficiency was achieved by the combination of SiteGA & diChIPMunk/ChIPMunk models, properly identifying FoxA BSs in up to 90% of loci for both mouse and human ChIP-Seq datasets. CONCLUSIONS: The experimental study of TF binding to oligonucleotides corresponding to predicted sites increases the reliability of computational methods for TFBS-recognition in ChIP-Seq data analysis. Regarding ChIP-Seq data interpretation, basic PWMs have inferior TFBS recognition quality compared to the more sophisticated SiteGA and de novo motif discovery methods. A combination of models from different principles allowed identification of proper TFBSs.
Project description:BACKGROUND: Scientists routinely scan DNA sequences for transcription factor (TF) binding sites (TFBSs). Most of the available tools rely on position-specific scoring matrices (PSSMs) constructed from aligned binding sites. Because of the resolutions of assays used to obtain TFBSs, databases such as TRANSFAC, ORegAnno and PAZAR store unaligned variable-length DNA segments containing binding sites of a TF. These DNA segments need to be aligned to build a PSSM. While the TRANSFAC database provides scoring matrices for TFs, nearly 78% of the TFs in the public release do not have matrices available. As work on TFBS alignment algorithms has been limited, it is highly desirable to have an alignment algorithm tailored to TFBSs. RESULTS: We designed a novel algorithm named LASAGNA, which is aware of the lengths of input TFBSs and utilizes position dependence. Results on 189 TFs of 5 species in the TRANSFAC database showed that our method significantly outperformed ClustalW2 and MEME. We further compared a PSSM method dependent on LASAGNA to an alignment-free TFBS search method. Results on 89 TFs whose binding sites can be located in genomes showed that our method is significantly more precise at fixed recall rates. Finally, we described LASAGNA-ChIP, a more sophisticated version for ChIP (Chromatin immunoprecipitation) experiments. Under the one-per-sequence model, it showed comparable performance with MEME in discovering motifs in ChIP-seq peak sequences. CONCLUSIONS: We conclude that the LASAGNA algorithm is simple and effective in aligning variable-length binding sites. It has been integrated into a user-friendly webtool for TFBS search and visualization called LASAGNA-Search. The tool currently stores precomputed PSSM models for 189 TFs and 133 TFs built from TFBSs in the TRANSFAC Public database (release 7.0) and the ORegAnno database (08Nov10 dump), respectively. The webtool is available at http://biogrid.engr.uconn.edu/lasagna_search/.