Meta-analysis discovery of tissue-specific DNA sequence motifs from mammalian gene expression data.
ABSTRACT: A key step in the regulation of gene expression is the sequence-specific binding of transcription factors (TFs) to their DNA recognition sites. However, elucidating TF binding site (TFBS) motifs in higher eukaryotes has been challenging, even when employing cross-species sequence conservation. We hypothesized that for human and mouse, many orthologous genes expressed in a similarly tissue-specific manner in both human and mouse gene expression data, are likely to be co-regulated by orthologous TFs that bind to DNA sequence motifs present within noncoding sequence conserved between these genomes.We performed automated motif searching and merging across four different motif finding algorithms, followed by filtering of the resulting motifs for those that contain blocks of information content. Applying this motif finding strategy to conserved noncoding regions surrounding co-expressed tissue-specific human genes allowed us to discover both previously known, and many novel candidate, regulatory DNA motifs in all 18 tissue-specific expression clusters that we examined. For previously known TFBS motifs, we observed that if a TF was expressed in the specified tissue of interest, then in most cases we identified a motif that matched its TRANSFAC motif; conversely, of all those discovered motifs that matched TRANSFAC motifs, most of the corresponding TF transcripts were expressed in the tissue(s) corresponding to the expression cluster for which the motif was found.Our results indicate that the integration of the results from multiple motif finding tools identifies and ranks highly more known and novel motifs than does the use of just one of these tools. In addition, we believe that our simultaneous enrichment strategies helped to identify likely human cis regulatory elements. A number of the discovered motifs may correspond to novel binding site motifs for as yet uncharacterized tissue-specific TFs. We expect this strategy to be useful for identifying motifs in other metazoan genomes.
Project description:Nannochloropsis spp. are a group of oleaginous microalgae that harbor an expanded array of lipid-synthesis related genes, yet how they are transcriptionally regulated remains unknown. Here a phylogenomic approach was employed to identify and functionally annotate the transcriptional factors (TFs) and TF binding-sites (TFBSs) in N. oceanica IMET1. Among 36 microalgae and higher plants genomes, a two-fold reduction in the number of TF families plus a seven-fold decrease of average family-size in Nannochloropsis, Rhodophyta and Chlorophyta were observed. The degree of similarity in TF-family profiles is indicative of the phylogenetic relationship among the species, suggesting co-evolution of TF-family profiles and species. Furthermore, comparative analysis of six Nannochloropsis genomes revealed 68 "most-conserved" TFBS motifs, with 11 of which predicted to be related to lipid accumulation or photosynthesis. Mapping the IMET1 TFs and TFBS motifs to the reference plant TF-"TFBS motif" relationships in TRANSFAC enabled the prediction of 78 TF-"TFBS motif" interaction pairs, which consisted of 34 TFs (with 11 TFs potentially involved in the TAG biosynthesis pathway), 30?TFBS motifs and 2,368 regulatory connections between TFs and target genes. Our results form the basis of further experiments to validate and engineer the regulatory network of Nannochloropsis spp. for enhanced biofuel production.
Project description:BACKGROUND: Scientists routinely scan DNA sequences for transcription factor (TF) binding sites (TFBSs). Most of the available tools rely on position-specific scoring matrices (PSSMs) constructed from aligned binding sites. Because of the resolutions of assays used to obtain TFBSs, databases such as TRANSFAC, ORegAnno and PAZAR store unaligned variable-length DNA segments containing binding sites of a TF. These DNA segments need to be aligned to build a PSSM. While the TRANSFAC database provides scoring matrices for TFs, nearly 78% of the TFs in the public release do not have matrices available. As work on TFBS alignment algorithms has been limited, it is highly desirable to have an alignment algorithm tailored to TFBSs. RESULTS: We designed a novel algorithm named LASAGNA, which is aware of the lengths of input TFBSs and utilizes position dependence. Results on 189 TFs of 5 species in the TRANSFAC database showed that our method significantly outperformed ClustalW2 and MEME. We further compared a PSSM method dependent on LASAGNA to an alignment-free TFBS search method. Results on 89 TFs whose binding sites can be located in genomes showed that our method is significantly more precise at fixed recall rates. Finally, we described LASAGNA-ChIP, a more sophisticated version for ChIP (Chromatin immunoprecipitation) experiments. Under the one-per-sequence model, it showed comparable performance with MEME in discovering motifs in ChIP-seq peak sequences. CONCLUSIONS: We conclude that the LASAGNA algorithm is simple and effective in aligning variable-length binding sites. It has been integrated into a user-friendly webtool for TFBS search and visualization called LASAGNA-Search. The tool currently stores precomputed PSSM models for 189 TFs and 133 TFs built from TFBSs in the TRANSFAC Public database (release 7.0) and the ORegAnno database (08Nov10 dump), respectively. The webtool is available at http://biogrid.engr.uconn.edu/lasagna_search/.
Project description:Protein-DNA bindings between transcription factors (TFs) and transcription factor binding sites (TFBSs) play an essential role in transcriptional regulation. Over the past decades, significant efforts have been made to study the principles for protein-DNA bindings. However, it is considered that there are no simple one-to-one rules between amino acids and nucleotides. Many methods impose complicated features beyond sequence patterns. Protein-DNA bindings are formed from associated amino acid and nucleotide sequence pairs, which determine many functional characteristics. Therefore, it is desirable to investigate associated sequence patterns between TFs and TFBSs. With increasing computational power, availability of massive experimental databases on DNA and proteins, and mature data mining techniques, we propose a framework to discover associated TF-TFBS binding sequence patterns in the most explicit and interpretable form from TRANSFAC. The framework is based on association rule mining with Apriori algorithm. The patterns found are evaluated by quantitative measurements at several levels on TRANSFAC. With further independent verifications from literatures, Protein Data Bank and homology modeling, there are strong evidences that the patterns discovered reveal real TF-TFBS bindings across different TFs and TFBSs, which can drive for further knowledge to better understand TF-TFBS bindings.
Project description:Function of non-B DNA structures are poorly understood though several bioinformatics studies predict role of the G-quadruplex DNA structure in transcription. Earlier, using transcriptome profiling we found evidence of widespread G-quadruplex-mediated gene regulation. Herein, we asked whether potential G-quadruplex (PG4) motifs associate with transcription factors (TF). This was analyzed using 220 position weight matrices [designated as transcription factor binding sites (TFBS)], representing 187 unique TF, in >75,000 genes in human, chimpanzee, mouse and rat. Results show binding sites of nine TFs, including that of AP-2, SP1, MAZ and VDR, occurred significantly within 100 bases of the PG4 motif (P < 1.24E-10). PG4-TFBS combinations were conserved in 'orthologously' related promoters across all four organisms and were associated with >850 genes in each genome. Remarkably, seven of the nine TFs were zinc-finger binding proteins indicating a novel characteristic of PG4 motifs. To test these findings, transcriptome profiles from human cell lines treated with G-quadruplex-specific molecules were used; 66 genes were significantly differentially expressed across both cell-types, which also harbored conserved PG4 motifs along with one/more of the nine TFBS. In addition, genes regulated by PG4-TFBS combinations were found to be co-regulated in human tissues, further emphasizing the regulatory significance of the associations.
Project description:BACKGROUND: Chromatin plays a critical role in regulating transcription factors (TFs) binding to their canonical transcription factor binding sites (TFBS). Recent studies in vertebrates show that many TFs preferentially bind to genomic regions that are well bound by nucleosomes in vitro. Co-occurring secondary motifs sometimes correlated with functional TFBS. RESULTS: We used a logistic regression to evaluate how well the propensity for nucleosome binding and co-occurrence of a secondary motif identify which canonical motifs are bound in vivo. We used ChIP-seq data for three transcription factors binding to their canonical motifs: c-Jun binding the AP-1 motif (TGA(C)/(G)TCA), GR (glucocorticoid receptor) binding the GR motif (G-ACA---(T)/(C)GT-C), and Hoxa2 (homeobox a2) binding the Pbx (Pre-B-cell leukemia homeobox) motif (TGATTGAT). For all canonical TFBS in the mouse genome, we calculated intrinsic nucleosome occupancy scores (INOS) for its surrounding 150-bps DNA and examined the relationship with in vivo TF binding. In mouse mammary 3134 cells, c-Jun and GR proteins preferentially bound regions calculated to be well-bound by nucleosomes in vitro with the canonical AP-1 and GR motifs themselves contributing to the high INOS. Functional GR motifs are enriched for AP-1 motifs if they are within a nucleosome-sized 150-bps region. GR and Hoxa2 also bind motifs with low INOS, perhaps indicating a different mechanism of action. CONCLUSION: Our analysis quantified the contribution of INOS and co-occurring sequence to the identification of functional canonical motifs in the genome. This analysis revealed an inherent competition between some TFs and nucleosomes for binding canonical TFBS. GR and c-Jun cooperate if they are within 150-bps. Binding of Hoxa2 and a fraction of GR to motifs with low INOS values suggesting they are not in competition with nucleosomes and may function using different mechanisms.
Project description:DNA sequences bound by a transcription factor (TF) are presumed to contain sequence elements that reflect its DNA binding preferences and its downstream-regulatory effects. Experimentally identified TF binding sites (TFBSs) are usually similar enough to be summarized by a 'consensus' motif, representative of the TF DNA binding specificity. Studies have shown that groups of nucleotide TFBS variants (subtypes) can contribute to distinct modes of downstream regulation by the TF via differential recruitment of cofactors. A TF(A) may bind to TFBS subtypes a(1) or a(2) depending on whether it associates with cofactors TF(B) or TF(C), respectively. While some approaches can discover motif pairs (dyads), none address the problem of identifying 'variants' of dyads. TFs are key components of multiple regulatory pathways targeting different sets of genes perhaps with different binding preferences. Identifying the discriminating TF-DNA associations that lead to the differential downstream regulation is thus essential. We present DiSCo (Discovery of Subtypes and Cofactors), a novel approach for identifying variants of dyad motifs (and their respective target sequence sets) that are instrumental for differential downstream regulation. Using both simulated and experimental datasets, we demonstrate how current motif discovery can be successfully leveraged to address this question.
Project description:Several recent studies have portrayed DNA methylation as a new player in the recruitment of transcription factors (TF) within chromatin, highlighting a need to connect TF binding sites (TFBS) with their respective DNA methylation profiles. However, current TFBS databases are restricted to DNA binding motif sequences. Here, we present MethMotif, a two-dimensional TFBS database that records TFBS position weight matrices along with cell type specific CpG methylation information computed from a combination of ChIP-seq and whole genome bisulfite sequencing datasets. Integrating TFBS motifs with TFBS DNA methylation better portrays the features of DNA loci recognised by TFs. In particular, we found that DNA methylation patterns within TFBS can be cell specific (e.g. MAFF). Furthermore, for a given TF, different DNA methylation profiles are associated with different DNA binding motifs (e.g. REST). To date, MethMotif database records over 500 TFBSs computed from over 2000 ChIP-seq datasets in 11 different cell types. MethMotif portal is accessible through an open source web interface (https://bioinfo-csi.nus.edu.sg/methmotif) that allows users to intuitively explore the entire dataset and perform both single, and batch queries.
Project description:Myelodysplastic syndromes have increased in frequency and incidence in the American population, but patient prognosis has not significantly improved over the last decade. Such improvements could be realized if biomarkers for accurate diagnosis and prognostic stratification were successfully identified. In this study, we propose a method that associates two state-of-the-art array technologies--single nucleotide polymor-phism(SNP) array and gene expression array--with gene motifs considered transcription factor-binding sites (TFBS). We are particularly interested in SNP-containing motifs introduced by genetic variation and mutation as TFBS. The potential regulation of SNP-containing motifs affects only when certain mutations occur. These motifs can be identified from a group of co-expressed genes with copy number variation. Then, we used a sliding window to identify motif candidates near SNPs on gene sequences. The candidates were filtered by coarse thresholding and fine statistical testing. Using the regression-based LARS-EN algorithm and a level-wise sequence combination procedure, we identified 28 SNP-containing motifs as candidate TFBS. We confirmed 21 of the 28 motifs with ChIP-chip fragments in the TRANSFAC database. Another six motifs were validated by TRANSFAC via searching binding fragments on co-regulated genes. The identified motifs and their location genes can be considered potential biomarkers for myelodysplastic syndromes. Thus, our proposed method, a novel strategy for associating two data categories, is capable of integrating information from different sources to identify reliable candidate regulatory SNP-containing motifs introduced by genetic variation and mutation.
Project description:Identifying transcription factor (TF) binding sites (TFBSs) is important in the computational inference of gene regulation. Widely used computational methods of TFBS prediction based on position weight matrices (PWMs) usually have high false positive rates. Moreover, computational studies of transcription regulation in eukaryotes frequently require numerous PWM models of TFBSs due to a large number of TFs involved. To overcome these problems we developed DRAF, a novel method for TFBS prediction that requires only 14 prediction models for 232 human TFs, while at the same time significantly improves prediction accuracy. DRAF models use more features than PWM models, as they combine information from TFBS sequences and physicochemical properties of TF DNA-binding domains into machine learning models. Evaluation of DRAF on 98 human ChIP-seq datasets shows on average 1.54-, 1.96- and 5.19-fold reduction of false positives at the same sensitivities compared to models from HOCOMOCO, TRANSFAC and DeepBind, respectively. This observation suggests that one can efficiently replace the PWM models for TFBS prediction by a small number of DRAF models that significantly improve prediction accuracy. The DRAF method is implemented in a web tool and in a stand-alone software freely available at http://cbrc.kaust.edu.sa/DRAF.
Project description:Differential binding of transcription factors (TFs) at cis-regulatory loci drives the differentiation and function of diverse cellular lineages. Understanding the regulatory interactions that underlie cell fate decisions requires characterizing TF binding sites (TFBS) across multiple cell types and conditions. Techniques, e.g. ChIP-Seq can reveal genome-wide patterns of TF binding, but typically requires laborious and costly experiments for each TF-cell-type (TFCT) condition of interest. Chromosomal accessibility assays can connect accessible chromatin in one cell type to many TFs through sequence motif mapping. Such methods, however, rarely take into account that the genomic context preferred by each factor differs from TF to TF, and from cell type to cell type. To address the differences in TF behaviors, we developed Mocap, a method that integrates chromatin accessibility, motif scores, TF footprints, CpG/GC content, evolutionary conservation and other factors in an ensemble of TFCT-specific classifiers. We show that integration of genomic features, such as CpG islands improves TFBS prediction in some TFCT. Further, we describe a method for mapping new TFCT, for which no ChIP-seq data exists, onto our ensemble of classifiers and show that our cross-sample TFBS prediction method outperforms several previously described methods.