Fine-tuning enhancer models to predict transcriptional targets across multiple genomes.
ABSTRACT: Networks of regulatory relations between transcription factors (TF) and their target genes (TG)- implemented through TF binding sites (TFBS)- are key features of biology. An idealized approach to solving such networks consists of starting from a consensus TFBS or a position weight matrix (PWM) to generate a high accuracy list of candidate TGs for biological validation. Developing and evaluating such approaches remains a formidable challenge in regulatory bioinformatics. We perform a benchmark study on 34 Drosophila TFs to assess existing TFBS and cis-regulatory module (CRM) detection methods, with a strong focus on the use of multiple genomes. Particularly, for CRM-modelling we investigate the addition of orthologous sites to a known PWM to construct phyloPWMs and we assess the added value of phylogenentic footprinting to predict contextual motifs around known TFBSs. For CRM-prediction, we compare motif conservation with network-level conservation approaches across multiple genomes. Choosing the optimal training and scoring strategies strongly enhances the performance of TG prediction for more than half of the tested TFs. Finally, we analyse a 35(th) TF, namely Eyeless, and find a significant overlap between predicted TGs and candidate TGs identified by microarray expression studies. In summary we identify several ways to optimize TF-specific TG predictions, some of which can be applied to all TFs, and others that can be applied only to particular TFs. The ability to model known TF-TG relations, together with the use of multiple genomes, results in a significant step forward in solving the architecture of gene regulatory networks.
Project description:The position-weight matrix (PWM) is a useful representation of a transcription factor binding site (TFBS) sequence pattern because the PWM can be estimated from a small number of representative TFBS sequences. However, because the PWM probability model assumes independence between individual nucleotide positions, the PWMs for some TFs poorly discriminate binding sites from non-binding-sites that have similar sequence content. Since the local three-dimensional DNA structure ('shape') is a determinant of TF binding specificity and since DNA shape has a significant sequence-dependence, we combined DNA shape-derived features into a TF-generalized regulatory score and tested whether the score could improve PWM-based discrimination of TFBS from non-binding-sites.We compared a traditional PWM model to a model that combines the PWM with a DNA shape feature-based regulatory potential score, for accuracy in detecting binding sites for 75 vertebrate transcription factors. The PWM+shape model was more accurate than the PWM-only model, for 45% of TFs tested, with no significant loss of accuracy for the remaining TFs.The shape-based model is available as an open-source R package at that is archived on the GitHub software repository at https://email@example.comSupplementary data are available at Bioinformatics online.
Project description:BACKGROUND: Gene expression is regulated mainly by transcription factors (TFs) that interact with regulatory cis-elements on DNA sequences. To identify functional regulatory elements, computer searching can predict TF binding sites (TFBS) using position weight matrices (PWMs) that represent positional base frequencies of collected experimentally determined TFBS. A disadvantage of this approach is the large output of results for genomic DNA. One strategy to identify genuine TFBS is to utilize local concentrations of predicted TFBS. It is unclear whether there is a general tendency for TFBS to cluster at promoter regions, although this is the case for certain TFBS. Also unclear is the identification of TFs that have TFBS concentrated in promoters and to what level this occurs. This study hopes to answer some of these questions. RESULTS: We developed the cluster score measure to evaluate the correlation between predicted TFBS clusters and promoter sequences for each PWM. Non-promoter sequences were used as a control. Using the cluster score, we identified a PWM group called PWM-PCP, in which TFBS clusters positively correlate with promoters, and another PWM group called PWM-NCP, in which TFBS clusters negatively correlate with promoters. The PWM-PCP group comprises 47% of the 199 vertebrate PWMs, while the PWM-NCP group occupied 11 percent. After reducing the effect of CpG islands (CGI) against the clusters using partial correlation coefficients among three properties (promoter, CGI and predicted TFBS cluster), we identified two PWM groups including those strongly correlated with CGI and those not correlated with CGI. CONCLUSION: Not all PWMs predict TFBS correlated with human promoter sequences. Two main PWM groups were identified: (1) those that show TFBS clustered in promoters associated with CGI, and (2) those that show TFBS clustered in promoters independent of CGI. Assessment of PWM matches will allow more positive interpretation of TFBS in regulatory regions.
Project description:While transcription factors (TFs) are known to regulate the expression of their target genes (TGs), only a weak correlation of expression between TFs and their TGs has generally been observed. As lack of correlation could be caused by additional layers of regulation, the overall correlation distribution may hide the presence of a subset of regulatory TF-TG pairs with tight expression coupling. Using reported regulatory pairs in the plant Arabidopsis thaliana along with comprehensive gene expression information and testing a wide array of molecular features, we aimed to discern the molecular determinants of high expression correlation of TFs and their TGs. TF-family assignment, stress-response process involvement, short genomic distances of the TF-binding sites to the transcription start site of their TGs, few required protein-protein-interaction connections to establish physical interactions between the TF and polymerase-II, unambiguous TF-binding motifs, increased numbers of miRNA target-sites in TF-mRNAs, and a young evolutionary age of TGs were found particularly indicative of high TF-TG correlation. The modulating roles of post-transcriptional, post-translational processes, and epigenetic factors have been characterized as well. Our study reveals that regulatory pairs with high expression coupling are associated with specific molecular determinants.
Project description:Identifying transcription factor (TF) binding sites (TFBSs) is important in the computational inference of gene regulation. Widely used computational methods of TFBS prediction based on position weight matrices (PWMs) usually have high false positive rates. Moreover, computational studies of transcription regulation in eukaryotes frequently require numerous PWM models of TFBSs due to a large number of TFs involved. To overcome these problems we developed DRAF, a novel method for TFBS prediction that requires only 14 prediction models for 232 human TFs, while at the same time significantly improves prediction accuracy. DRAF models use more features than PWM models, as they combine information from TFBS sequences and physicochemical properties of TF DNA-binding domains into machine learning models. Evaluation of DRAF on 98 human ChIP-seq datasets shows on average 1.54-, 1.96- and 5.19-fold reduction of false positives at the same sensitivities compared to models from HOCOMOCO, TRANSFAC and DeepBind, respectively. This observation suggests that one can efficiently replace the PWM models for TFBS prediction by a small number of DRAF models that significantly improve prediction accuracy. The DRAF method is implemented in a web tool and in a stand-alone software freely available at http://cbrc.kaust.edu.sa/DRAF.
Project description:Nannochloropsis spp. are a group of oleaginous microalgae that harbor an expanded array of lipid-synthesis related genes, yet how they are transcriptionally regulated remains unknown. Here a phylogenomic approach was employed to identify and functionally annotate the transcriptional factors (TFs) and TF binding-sites (TFBSs) in N. oceanica IMET1. Among 36 microalgae and higher plants genomes, a two-fold reduction in the number of TF families plus a seven-fold decrease of average family-size in Nannochloropsis, Rhodophyta and Chlorophyta were observed. The degree of similarity in TF-family profiles is indicative of the phylogenetic relationship among the species, suggesting co-evolution of TF-family profiles and species. Furthermore, comparative analysis of six Nannochloropsis genomes revealed 68 "most-conserved" TFBS motifs, with 11 of which predicted to be related to lipid accumulation or photosynthesis. Mapping the IMET1 TFs and TFBS motifs to the reference plant TF-"TFBS motif" relationships in TRANSFAC enabled the prediction of 78 TF-"TFBS motif" interaction pairs, which consisted of 34 TFs (with 11 TFs potentially involved in the TAG biosynthesis pathway), 30?TFBS motifs and 2,368 regulatory connections between TFs and target genes. Our results form the basis of further experiments to validate and engineer the regulatory network of Nannochloropsis spp. for enhanced biofuel production.
Project description:Differentially evolved responses to various stress conditions in plants are controlled by complex regulatory circuits of transcriptional activators, and repressors, such as transcription factors (TFs). To understand the general and condition-specific activities of the TFs and their regulatory relationships with the target genes (TGs), we have used a homogeneous stress gene expression dataset generated on ten natural ecotypes of the model plant Arabidopsis thaliana, during five single and six combined stress conditions. Knowledge-based profiles of binding sites for 25 stress-responsive TF families (187 TFs) were generated and tested for their enrichment in the regulatory regions of the associated TGs. Condition-dependent regulatory sub-networks have shed light on the differential utilization of the underlying network topology, by stress-specific regulators and multifunctional regulators. The multifunctional regulators maintain the core stress response processes while the transient regulators confer the specificity to certain conditions. Clustering patterns of transcription factor binding sites (TFBS) have reflected the combinatorial nature of transcriptional regulation, and suggested the putative role of the homotypic clusters of TFBS towards maintaining transcriptional robustness against cis-regulatory mutations to facilitate the preservation of stress response processes. The Gene Ontology enrichment analysis of the TGs reflected sequential regulation of stress response mechanisms in plants.
Project description:BACKGROUND: Gene regulation by transcription factors (TF) is species, tissue and time specific. To better understand how the genetic code controls gene expression in bovine muscle we associated gene expression data from developing Longissimus thoracis et lumborum skeletal muscle with bovine promoter sequence information. RESULTS: We created a highly conserved genome-wide promoter landscape comprising 87,408 interactions relating 333 TFs with their 9,242 predicted target genes (TGs). We discovered that the complete set of predicted TGs share an average of 2.75 predicted TF binding sites (TFBSs) and that the average co-expression between a TF and its predicted TGs is higher than the average co-expression between the same TF and all genes. Conversely, pairs of TFs sharing predicted TGs showed a co-expression correlation higher that pairs of TFs not sharing TGs. Finally, we exploited the co-occurrence of predicted TFBS in the context of muscle-derived functionally-coherent modules including cell cycle, mitochondria, immune system, fat metabolism, muscle/glycolysis, and ribosome. Our findings enabled us to reverse engineer a regulatory network of core processes, and correctly identified the involvement of E2F1, GATA2 and NFKB1 in the regulation of cell cycle, fat, and muscle/glycolysis, respectively. CONCLUSION: The pivotal implication of our research is two-fold: (1) there exists a robust genome-wide expression signal between TFs and their predicted TGs in cattle muscle consistent with the extent of promoter sharing; and (2) this signal can be exploited to recover the cellular mechanisms underpinning transcription regulation of muscle structure and development in bovine. Our study represents the first genome-wide report linking tissue specific co-expression to co-regulation in a non-model vertebrate.
Project description:Transcription factors (TFs) play a fundamental role in coordinating biological processes in response to stimuli. Consequently, we often seek to determine the key TFs and their regulated target genes (TGs) amidst gene expression data. This requires a knowledge-base of TF-TG interactions, which would enable us to determine the topology of the transcriptional network and predict novel regulatory interactions. To address this, we generated an Open-access Repository of Transcriptional Interactions, ORTI, by integrating available TF-TG interaction databases. These databases rely on different types of experimental evidence, including low-throughput assays, high-throughput screens, and bioinformatics predictions. We have subsequently categorised TF-TG interactions in ORTI according to the quality of this evidence. To demonstrate its capabilities, we applied ORTI to gene expression data and identified modulated TFs using an enrichment analysis. Combining this with pairwise TF-TG interactions enabled us to visualise temporal regulation of a transcriptional network. Additionally, ORTI enables the prediction of novel TF-TG interactions, based on how well candidate genes co-express with known TGs of the target TF. By filtering out known TF-TG interactions that are unlikely to occur within the experimental context, this analysis predicts context-specific TF-TG interactions. We show that this can be applied to experimental designs of varying complexities. In conclusion, ORTI is a rich and publicly available database of experimentally validated mammalian transcriptional interactions which is accompanied with tools that can identify and predict transcriptional interactions, serving as a useful resource for unravelling the topology of transcriptional networks.
Project description:MicroRNAs (miRNAs) are small non-coding RNAs that regulate genes at the post-transcriptional level in spatiotemporal manner. Several miRNAs are identified as prognostic and diagnostic markers in many human cancers. Estimation of the temporal activities of the miRNAs is an important step in the way to understand the complex interactions of these important regulatory elements with transcription factors (TFs) and target genes (TGs). However, current research on miRNA activities excludes network dynamics from the studies, disregarding the important element of time in the regulatory network analysis.In the current study, we combined experimentally verified miRNA-TG interactions with breast cancer microarray TG expression data to identify key miRNAs and compute their temporal activity using network component analysis (NCA). The computed activities showed that miRNAs were regulated in a time dependent manner. Our results allowed constructing a synergistic network of miRNAs using the computed miRNA activities and their shared regulation of TGs. We further extended this network by incorporating miRNA-TG, miRNA-TF, TF-miRNA and TF-TG regulations in the context of breast cancer. Our integrated network identified several miRNAs known to be involved in breast cancer regulation and revealed several novel miRNAs. Our further analysis detected substantial involvement of the miRNAs miR-324, miR-93, miR-615 and miR-1 in breast cancer, which was not known previously. Next, combining our integrated networks with functional annotation of differentially expressed genes resulted in new sub-networks. These sub-networks allowed us to identify the key miRNAs and their interactions with TFs and TGs of several biological processes involved in breast cancer. The identified markers are validated for their potential as prognostic markers for breast cancer through survival analysis.Our dynamical analysis of the miRNA interactions greatly helps to discover new network based markers, and is highly applicable (but not limited) to cancer research.
Project description:The evolution of regulatory networks in Bacteria has largely been explained at macroevolutionary scales through lateral gene transfer and gene duplication. Transcription factors (TF) have been found to be less conserved across species than their target genes (TG). This would be expected if TFs accumulate mutations faster than TGs. This hypothesis is supported by several lab evolution studies which found TFs, especially global regulators, to be frequently mutated. Despite these studies, the contribution of point mutations in TFs to the evolution of regulatory network is poorly understood. We tested if TFs show greater genetic variation than their TGs using whole-genome sequencing data from a large collection of Escherichia coli isolates. TFs were less diverse than their TGs across natural isolates, with TFs of large regulons being more conserved. In contrast, TFs showed higher mutation frequency in adaptive laboratory evolution experiments. However, over long-term laboratory evolution spanning 60 000 generations, mutation frequency in TFs gradually declined after a rapid initial burst. Extrapolating the dynamics of genetic variation from long-term laboratory evolution to natural populations, we propose that point mutations, conferring large-scale gene expression changes, may drive the early stages of adaptation but gene regulation is subjected to stronger purifying selection post adaptation.