Inferring functional transcription factor-gene binding pairs by integrating transcription factor binding data with transcription factor knockout data.
ABSTRACT: BACKGROUND: Chromatin immunoprecipitation (ChIP) experiments are now the most comprehensive experimental approaches for mapping the binding of transcription factors (TFs) to their target genes. However, ChIP data alone is insufficient for identifying functional binding target genes of TFs for two reasons. First, there is an inherent high false positive/negative rate in ChIP-chip or ChIP-seq experiments. Second, binding signals in the ChIP data do not necessarily imply functionality. METHODS: It is known that ChIP-chip data and TF knockout (TFKO) data reveal complementary information on gene regulation. While ChIP-chip data can provide TF-gene binding pairs, TFKO data can provide TF-gene regulation pairs. Therefore, we propose a novel network approach for identifying functional TF-gene binding pairs by integrating the ChIP-chip data with the TFKO data. In our method, a TF-gene binding pair from the ChIP-chip data is regarded to be functional if it also has high confident curated TFKO TF-gene regulatory relation or deduced hypostatic TF-gene regulatory relation. RESULTS AND CONCLUSIONS: We first validated our method on a gathered ground truth set. Then we applied our method to the ChIP-chip data to identify functional TF-gene binding pairs. The biological significance of our identified functional TF-gene binding pairs was shown by assessing their functional enrichment, the prevalence of protein-protein interaction, and expression coherence. Our results outperformed the results of three existing methods across all measures. And our identified functional targets of TFs also showed statistical significance over the randomly assigned TF-gene pairs. We also showed that our method is dataset independent and can apply to ChIP-seq data and the E. coli genome. Finally, we provided an example showing the biological applicability of our notion.
Project description:Transcriptional regulation of gene expression in eukaryotes is usually accomplished by cooperative transcription factors (TFs). Computational identification of cooperative TF pairs has become a hot research topic and many algorithms have been proposed in the literature. A typical algorithm for predicting cooperative TF pairs has two steps. (Step 1) Define the targets of each TF under study. (Step 2) Design a measure for calculating the cooperativity of a TF pair based on the targets of these two TFs. While different algorithms have distinct sophisticated cooperativity measures, the targets of a TF are usually defined using ChIP-chip data. However, there is an inherent weakness in using ChIP-chip data to define the targets of a TF. ChIP-chip analysis can only identify the binding targets of a TF but it cannot distinguish the true regulatory from the binding but non-regulatory targets of a TF.This work is the first study which aims to investigate whether the performance of computational identification of cooperative TF pairs could be improved by using a more biologically relevant way to define the targets of a TF. For this purpose, we propose four simple algorithms, all of which consist of two steps. (Step 1) Define the targets of a TF using (i) ChIP-chip data in the first algorithm, (ii) TF binding data in the second algorithm, (iii) TF perturbation data in the third algorithm, and (iv) the intersection of TF binding and TF perturbation data in the fourth algorithm. Compared with the first three algorithms, the fourth algorithm uses a more biologically relevant way to define the targets of a TF. (Step 2) Measure the cooperativity of a TF pair by the statistical significance of the overlap of the targets of these two TFs using the hypergeometric test. By adopting four existing performance indices, we show that the fourth proposed algorithm (PA4) significantly out performs the other three proposed algorithms. This suggests that the computational identification of cooperative TF pairs is indeed improved when using a more biologically relevant way to define the targets of a TF. Strikingly, the prediction results of our simple PA4 are more biologically meaningful than those of the 12 existing sophisticated algorithms in the literature, all of which used ChIP-chip data to define the targets of a TF. This suggests that properly defining the targets of a TF may be more important than designing sophisticated cooperativity measures. In addition, our PA4 has the power to predict several experimentally validated cooperative TF pairs, which have not been successfully predicted by any existing algorithms in the literature.This study shows that the performance of computational identification of cooperative TF pairs could be improved by using a more biologically relevant way to define the targets of a TF. The main contribution of this study is not to propose another new algorithm but to provide a new thinking for the research of computational identification of cooperative TF pairs. Researchers should put more effort on properly defining the targets of a TF (i.e. Step 1) rather than totally focus on designing sophisticated cooperativity measures (i.e. Step 2). The lists of TF target genes, the Matlab codes and the prediction results of the four proposed algorithms could be downloaded from our companion website http://cosbi3.ee.ncku.edu.tw/TFI/.
Project description:BACKGROUND:How transcription factors (TFs) interact with cis-regulatory sequences and interact with each other is a fundamental, but not well understood, aspect of gene regulation. METHODOLOGY/PRINCIPAL FINDINGS:We present a computational method to address this question, relying on the established biophysical principles. This method, STAP (sequence to affinity prediction), takes into account all combinations and configurations of strong and weak binding sites to analyze large scale transcription factor (TF)-DNA binding data to discover cooperative interactions among TFs, infer sequence rules of interaction and predict TF target genes in new conditions with no TF-DNA binding data. The distinctions between STAP and other statistical approaches for analyzing cis-regulatory sequences include the utility of physical principles and the treatment of the DNA binding data as quantitative representation of binding strengths. Applying this method to the ChIP-seq data of 12 TFs in mouse embryonic stem (ES) cells, we found that the strength of TF-DNA binding could be significantly modulated by cooperative interactions among TFs with adjacent binding sites. However, further analysis on five putatively interacting TF pairs suggests that such interactions may be relatively insensitive to the distance and orientation of binding sites. Testing a set of putative Nanog motifs, STAP showed that a novel Nanog motif could better explain the ChIP-seq data than previously published ones. We then experimentally tested and verified the new Nanog motif. A series of comparisons showed that STAP has more predictive power than several state-of-the-art methods for cis-regulatory sequence analysis. We took advantage of this power to study the evolution of TF-target relationship in Drosophila. By learning the TF-DNA interaction models from the ChIP-chip data of D. melanogaster (Mel) and applying them to the genome of D. pseudoobscura (Pse), we found that only about half of the sequences strongly bound by TFs in Mel have high binding affinities in Pse. We show that prediction of functional TF targets from ChIP-chip data can be improved by using the conservation of STAP predicted affinities as an additional filter. CONCLUSIONS/SIGNIFICANCE:STAP is an effective method to analyze binding site arrangements, TF cooperativity, and TF target genes from genome-wide TF-DNA binding data.
Project description:Characterization of the genomic distances over which transcription factor (TF) binding influences gene expression is important for inferring target genes from TF chromatin immunoprecipitation followed by sequencing (ChIP-seq) data. Here we systematically examine the relationship between thousands of TF and histone modification ChIP-seq data sets with thousands of gene expression profiles. We develop a model for integrating these data, which reveals two classes of TFs with distinct ranges of regulatory influence, chromatin-binding preferences, and auto-regulatory properties. We find that the regulatory range of the same TF bound within different topologically associating domains (TADs) depend on intrinsic TAD properties such as local gene density and G/C content, but also on the TAD chromatin states. Our results suggest that considering TF type, binding distance to gene locus, as well as chromatin context is important in identifying implicated TFs from GWAS SNPs.
Project description:Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) allows researchers to determine the genome-wide binding locations of individual transcription factors (TFs) at high resolution. This information can be interrogated to study various aspects of TF behaviour, including the mechanisms that control TF binding. Physical interaction between TFs comprises one important aspect of TF binding in eukaryotes, mediating tissue-specific gene expression. We have developed an algorithm, spaced motif analysis (SpaMo), which is able to infer physical interactions between the given TF and TFs bound at neighbouring sites at the DNA interface. The algorithm predicts TF interactions in half of the ChIP-seq data sets we test, with the majority of these predictions supported by direct evidence from the literature or evidence of homodimerization. High resolution motif spacing information obtained by this method can facilitate an improved understanding of individual TF complex structures. SpaMo can assist researchers in extracting maximum information relating to binding mechanisms from their TF ChIP-seq data. SpaMo is available for download and interactive use as part of the MEME Suite (http://meme.nbcr.net).
Project description:We present One Hand Clapping (OHC), a method for the detection of condition-specific interactions between transcription factors (TFs) from genome-wide gene activity measurements. OHC is based on a mapping between transcription factors and their target genes. Given a single case-control experiment, it uses a linear regression model to assess whether the common targets of two arbitrary TFs behave differently than expected from the genes targeted by only one of the TFs. When applied to osmotic stress data in S. cerevisiae, OHC produces consistent results across three types of expression measurements: gene expression microarray data, RNA Polymerase II ChIP-chip binding data and messenger RNA synthesis rates. Among the eight novel, condition-specific TF pairs, we validate the interaction between Gcn4p and Arr1p experimentally. We apply OHC to a large gene activity dataset in S. cerevisiae and provide a compendium of condition-specific TF interactions.
Project description:One goal of human genetics is to understand how the information for precise and dynamic gene expression programs is encoded in the genome. The interactions of transcription factors (TFs) with DNA regulatory elements clearly play an important role in determining gene expression outputs, yet the regulatory logic underlying functional transcription factor binding is poorly understood. Many studies have focused on characterizing the genomic locations of TF binding, yet it is unclear to what extent TF binding at any specific locus has functional consequences with respect to gene expression output. To evaluate the context of functional TF binding we knocked down 59 TFs and chromatin modifiers in one HapMap lymphoblastoid cell line. We then identified genes whose expression was affected by the knockdowns. We intersected the gene expression data with transcription factor binding data (based on ChIP-seq and DNase-seq) within 10 kb of the transcription start sites of expressed genes. This combination of data allowed us to infer functional TF binding. Using this approach, we found that only a small subset of genes bound by a factor were differentially expressed following the knockdown of that factor, suggesting that most interactions between TF and chromatin do not result in measurable changes in gene expression levels of putative target genes. We found that functional TF binding is enriched in regulatory elements that harbor a large number of TF binding sites, at sites with predicted higher binding affinity, and at sites that are enriched in genomic regions annotated as "active enhancers."
Project description:BACKGROUND:Current technologies for understanding the transcriptional reprogramming in cells include the transcription factor (TF) chromatin immunoprecipitation (ChIP) experiments and the TF knockout experiments. The ChIP experiments show the binding targets of TFs against which the antibody directs while the knockout techniques find the regulatory gene targets of the knocked-out TFs. However, it was shown that these two complementary results contain few common targets. Researchers have used the concept of TF functional redundancy to explain the low overlap between these two techniques. But the detailed molecular mechanisms behind TF functional redundancy remain unknown. Without knowing the possible molecular mechanisms, it is hard for biologists to fully unravel the cause of TF functional redundancy. RESULTS:To mine out the molecular mechanisms, a novel algorithm to extract TF regulatory modules that help explain the observed TF functional redundancy effect was devised and proposed in this research. The method first searched for candidate TF sets from the TF binding data. Then based on these candidate sets the method utilized the modified Steiner Tree construction algorithm to construct the possible TF regulatory modules from protein-protein interaction data and finally filtered out the noise-induced results by using confidence tests. The mined-out regulatory modules were shown to correlate to the concept of functional redundancy and provided testable hypotheses of the molecular mechanisms behind functional redundancy. And the biological significance of the mined-out results was demonstrated in three different biological aspects: ontology enrichment, protein interaction prevalence and expression coherence. About 23.5% of the mined-out TF regulatory modules were literature-verified. Finally, the biological applicability of the proposed method was shown in one detailed example of a verified TF regulatory module for pheromone response and filamentous growth in yeast. CONCLUSION:In this research, a novel method that mined out the potential TF regulatory modules which elucidate the functional redundancy observed among TFs is proposed. The extracted TF regulatory modules not only correlate the molecular mechanisms to the observed functional redundancy among TFs, but also show biological significance in inferring TF functional binding target genes. The results provide testable hypotheses for biologists to further design subsequent research and experiments.
Project description:BACKGROUND:Interactions among transcription factors (TFs) and histone modifications (HMs) play an important role in the precise regulation of gene expression. The context specificity of those interactions and further its dynamics in normal and disease remains largely unknown. Recent development in genomics technology enables transcription profiling by RNA-seq and protein's binding profiling by ChIP-seq. Integrative analysis of the two types of data allows us to investigate TFs and HMs interactions both from the genome co-localization and downstream target gene expression. RESULTS:We propose a integrative pipeline to explore the co-localization of 55 TFs and 11 HMs and its dynamics in human GM12878 and K562 by matched ChIP-seq and RNA-seq data from ENCODE. We classify TFs and HMs into three types based on their binding enrichment around transcription start site (TSS). Then a set of statistical indexes are proposed to characterize the TF-TF and TF-HM co-localizations. We found that Rad21, SMC3, and CTCF co-localized across five cell lines. High resolution Hi-C data in GM12878 shows that they associate most of the Hi-C peak loci with a specific CTCF-motif "anchor" and supports that CTCF, SMC3, and RAD2 co-localization serves important role in 3D chromatin structure. Meanwhile, 17 TF-TF pairs are highly dynamic between GM12878 and K562. We then build SVM models to correlate high and low expression level of target genes with TF binding and HM strength. We found that H3k9ac, H3k27ac, and three TFs (ELF1, TAF1, and POL2) are predictive with the accuracy about 85~92%. CONCLUSION:We propose a pipeline to analyze the co-localization of TF and HM and their dynamics across cell lines from ChIP-seq, and investigate their regulatory potency by RNA-seq. The integrative analysis of two level data reveals new insight for the cooperation of TFs and HMs and is helpful in understanding cell line specificity of TF/HM interactions.
Project description:Understanding the cell-specific binding patterns of transcription factors (TFs) is fundamental to studying gene regulatory networks in biological systems, for which ChIP-seq not only provides valuable data but is also considered as the gold standard. Despite tremendous efforts from the scientific community to conduct TF ChIP-seq experiments, the available data represent only a limited percentage of ChIP-seq experiments, considering all possible combinations of TFs and cell lines. In this study, we demonstrate a method for accurately predicting cell-specific TF binding for TF-cell line combinations based on only a small fraction (4%) of the combinations using available ChIP-seq data. The proposed model, termed TFImpute, is based on a deep neural network with a multi-task learning setting to borrow information across transcription factors and cell lines. Compared with existing methods, TFImpute achieves comparable accuracy on TF-cell line combinations with ChIP-seq data; moreover, TFImpute achieves better accuracy on TF-cell line combinations without ChIP-seq data. This approach can predict cell line specific enhancer activities in K562 and HepG2 cell lines, as measured by massively parallel reporter assays, and predicts the impact of SNPs on TF binding.
Project description:In the development of the Drosophila embryo, gene expression is directed by the sequence-specific interactions of a large network of protein transcription factors (TFs) and DNA cis-regulatory binding sites. Once the identity of the typically 8-10bp binding sites for any given TF has been determined by one of several experimental procedures, the sequences can be represented in a position weight matrix (PWM) and used to predict the location of additional TF binding sites elsewhere in the genome. Often, alignments of large (>200bp) genomic fragments that have been experimentally determined to bind the TF of interest in Chromatin Immunoprecipitation (ChIP) studies are trimmed under the assumption that the majority of the binding sites are located near the center of all the aligned fragments. In this study, ChIP/chip datasets are analyzed using the corresponding PWMs for the well-studied TFs; CAUDAL, HUNCHBACK, KNIRPS and KRUPPEL, to determine the distribution of predicted binding sites. All four TFs are critical regulators of gene expression along the anterio-posterior axis in early Drosophila development. For all four TFs, the ChIP peaks contain multiple binding sites that are broadly distributed across the genomic region represented by the peak, regardless of the prediction stringency criteria used. This result suggests that ChIP peak trimming may exclude functional binding sites from subsequent analyses.