Unknown,Transcriptomics,Genomics,Proteomics

Dataset Information

Stability selection for regression-based models of transcription factor-DNA binding specificity

ABSTRACT: Motivation: The DNA binding specificity of a transcription factor (TF) is typically represented using a position weight matrix (PWM) model, which implicitly assumes that individual bases in a TF binding site contribute independently to the binding affinity, an assumption that does not always hold. For this reason, more complex models of binding specificity have been developed. However, these models have their own caveats: they typically have a large number of parameters, which makes them hard to learn and interpret. Results: We propose novel regression-based models of TF-DNA binding specificity, trained using high resolution in vitro data from custom protein binding microarray (PBM) experiments. Our PBMs are specifically designed to cover a large number of putative DNA binding sites for the TFs of interest (yeast TFs Cbf1 and Tye7, and human TFs c-Myc, Max, and Mad2) in their native genomic context. These high-throughput, quantitative data are well suited for training complex models that take into account not only independent contributions from individual bases, but also contributions from di- and trinucleotides at various positions within or near the binding sites. To ensure that our models remain interpretable, we use feature selection to identify a small number of sequence features that accurately predict TF-DNA binding specificity. To further illustrate the accuracy of our regression models, we show that even in the case of paralogous TF with highly similar PWMs, our new models can distinguish the specificities of individual factors. Thus, our work represents an important step towards better sequence-based models of individual TF-DNA binding specificity. Four protein binding microarray (PBM) experiments of human transcription factors were performed. Briefly, the PBMs involved binding GST-tagged transcription factors c-Myc, Max, and Mad2(Mxi1) to double-stranded 180K Agilent microarrays in order to determine their binding specificity for putative DNA binding sites in native genomic context. Briefly, we represent three categories of 36-bp sequences: 1) bound probes, 2) unbound probes (or negative controls), and 3) test probes. Bound probes corresponded to genomic regions bound in vivo by c-Myc, Max, or Mad2 (ChIP-seq P < 10^(-10) in HeLaS3 or K562 celld (ENCODE)) that contain at least two consecutive 8-mers with universal PBM E-score > 0.4 (Munteanu and Gordan, LNCS 2013). All putative binding sites occurr at the same position within the probes on the array. M-bM-^@M-^\UnboundM-bM-^@M-^] probes corresponded to genomic regions with ChIP-seq P < 10^(-10) and a maximum 8-mer E-score < 0.2. We also designed test probes that contain, within constant flanking regions, all nnCACGTGnn 10-mers and 18 nnnCACGTGnnn 12-mers (where n = A, C, G, or T). Each DNA sequence represented on the array is present in 6 replicate spots. We report the PBM signal intensity for each spot. The PBM protocol is described in Berger et al., Nature Biotechnology 2006 (PMID 16998473).

ORGANISM(S): Homo sapiens

SUBMITTER: Raluca Gordan

PROVIDER: E-GEOD-47026 | biostudies-arrayexpress |

REPOSITORIES: biostudies-arrayexpress

ACCESS DATA

Similar Datasets

Project description:DNA sequence is a major determinant of the binding specificity of transcription factors (TFs) for their genomic targets. However, eukaryotic cells often express, at the same time, TFs with highly similar DNA binding motifs but distinct in vivo targets. Currently, it is not well understood how TFs with seemingly identical DNA motifs achieve unique specificities in vivo. Here, we used custom protein binding microarrays to analyze TF specificity for putative binding sites in their genomic sequence context. Using yeast TFs Cbf1 and Tye7 as our case study, we found that binding sites of these bHLH TFs (i.e., E-boxes) are bound differently in vitro and in vivo, depending on their genomic context. Computational analyses suggest that nucleotides outside E-box binding sites contribute to specificity by influencing the 3D structure of DNA binding sites. Thus, local shape of target sites might play a widespread role in achieving regulatory specificity within TF families. Three protein binding microarray (PBM) experiments of Saccharomyces cerevisiae transcription factors were performed. Briefly, the PBMs involved binding GST-tagged yeast transcription factors Cbf1 and Tye7 to double-stranded 44K Agilent microarrays in order to determine their binding specificity for putative DNA binding sites in native genomic context. Briefly, we represent three categories of 30-bp genomic sequences: 1) ChIP-chip bound probes, 2) ChIP-chip unbound probes, and 3) negative control probes. ChIP-chip bound probes corresponded to genomic regions bound in vivo by Cbf1 or Tye7 (ChIP-chip P < 0.005 in rich medium (YPD) (Harbison et al., Nature 2004, PMID 15343339)) contained at least two consecutive 8-mers with universal PBM E-score > 0.35 (Zhu et al., Genome Research 2009, PMID 19158363). All putative binding sites occurred at the same position within the probes on the array. M-bM-^@M-^\ChIP-chip unboundM-bM-^@M-^] probes corresponded to genomic regions with ChIP-chip P > 0.5 and at least two consecutive 8-mers at a more stringent universal PBM E-score threshold of 0.4. Negative control probes corresponded to S. cerevisiae intergenic regions with a maximum 8-mer E-score < 0.3. We also designed probes that contain, within constant flanking regions, all 10-bp sequences that occur within the M-bM-^@M-^\ChIP-chip boundM-bM-^@M-^] probes and contain the E-box CACGTG, but are flanked by synthetic rather than native genomic sequence. Each DNA sequence represented on the array is present in 4 replicate spots. We report the PBM signal intensity for each spot. The PBM protocol is described in Berger et al., Nature Biotechnology 2006 (PMID 16998473).

Project description:Accurate predictions of the DNA binding specificities of transcription factors (TFs) are necessary for understanding gene regulatory mechanisms. Traditionally, predictive models are built based on nucleotide sequence features. Here, we employed three- dimensional DNA shape information obtained on a high-throughput basis to integrate intuitive DNA structural features into the modeling of TF binding specificities using support vector regression. We performed quantitative predictions of DNA binding specificities, using the DREAM5 dataset for 65 mouse TFs and genomic-context protein binding microarray data for three human basic helix-loop-helix TFs. DNA shape-augmented models compared favorably with sequence-based models for these predictions. Although both k-mer and DNA shape features encoded the interdependencies between nucleotide positions of the binding site, using DNA shape features reduced the dimensionality of the feature space compared to k-mer use. Finally, analyzing the weights of DNA shape-augmented models uncovered TF family- specific structural readout mechanisms that were not obvious from the nucleotide sequence. Three genomic-context protein binding microarray (gcPBM) experiments of human transcription factors were performed. Briefly, the gcPBMs involved binding his-tagged transcription factors c-Myc, Max, and Mad1(Mxd1) to double-stranded 180K Agilent microarrays in order to determine their binding specificity for putative DNA binding sites in native genomic context. Briefly, we represent three categories of 36-bp sequences: 1) bound probes, 2) unbound probes (or negative controls), and 3) test probes. Bound probes corresponded to genomic regions bound in vivo by c-Myc, Max, or Mad2 (ChIP-seq P < 10^(-10) in HeLaS3 or K562 celld (ENCODE)) that contain at least two consecutive 8-mers with universal PBM E-score > 0.4 (Munteanu and Gordan, LNCS 2013). All putative binding sites occur at the same position within the probes on the array. M-bM-^@M-^\UnboundM-bM-^@M-^] probes corresponded to genomic regions with ChIP-seq P < 10^(-10) and a maximum 8-mer E-score < 0.2. We also designed test probes that contain, within constant flanking regions, all nnCACGTGnn 10-mers and 18 nnnCACGTGnnn 12-mers (where n = A, C, G, or T). Each DNA sequence represented on the array is present in 6 replicate spots. We report the gcPBM signal intensity for each spot. The PBM protocol is described in Berger et al., Nature Biotechnology 2006 (PMID 16998473).

Project description:Until now, it has been reasonably assumed that specific base-pair recognition is the only mechanism controlling the specificity of transcription factor (TF)M-bM-^HM-^RDNA binding. Contrary to this assumption, here we show that nonspecific DNA sequences possessing certain repeat symmetries, when present outside of specific TF binding sites (TFBSs), statistically control TFM-bM-^HM-^RDNA binding preferences. We used high-throughput proteinM-bM-^HM-^RDNA binding assays to measure the binding levels and free energies of binding for several human TFs to tens of thousands of short DNA sequences with varying re- peat symmetries. Based on statistical mechanics modeling, we iden- tify a new proteinM-bM-^HM-^RDNA binding mechanism induced by DNA se- quence symmetry in the absence of specific base-pair recognition, and experimentally demonstrate that this mechanism indeed gov- erns proteinM-bM-^HM-^RDNA binding preferences. Four custom protein binding microarray (PBM) experiments of human transcription factors were performed. Briefly, the PBMs involved binding his-tagged transcription factors c-Myc, Max, and Mad1(Mxd1) to double-stranded 180K Agilent microarrays in order to determine their binding specificity for GTCACGTGAC DNA binding sites flanked by repetitive DNA elements with different symmetries and correlation length scales. Briefly, we represent three categories of 36-bp sequences: 1) 28800 probes centered at a GTCACGTGAC site and flanked by repetitive elements (probe names starting with Ariel_); 2) Unbound probes (or negative controls); and 3) Bound probes, which correspond to randomly selected genomic regions bound in vivo by c-Myc, Max, or Mad2 (ChIP-seq P < 10^(-10) in HeLaS3 or K562 celld (ENCODE)), which contain at least two consecutive 8-mers with universal PBM E-score > 0.4 (Munteanu and Gordan, LNCS 2013). Each DNA sequence represented on the array is present in 6 replicate spots. We report the gcPBM signal intensity for each spot (raw files) as well as the median intensity over the 6 replicate spots (normalized data). The PBM protocol is described in Berger et al., Nature Biotechnology 2006 (PMID 16998473).

Project description:Transcription factors (TFs) play a central role in regulating gene expression by interacting with cis regulatory DNA elements associated with their target genes. Recent surveys have examined the DNA binding specificities of most Saccharomyces cerevisiae transcription factors but a comprehensive evaluation of their data has been lacking. Results: We analyzed in vitro and in vivo TF-DNA binding data reported in previous large-scale studies to generate a comprehensive, curated resource of DNA binding specificity data for all characterized S. cerevisiae transcription factors. Our collection comprises DNA binding site motifs and comprehensive in vitro DNA binding specificity data for all possible 8 bp sequences. Included in this database is DNA binding specificity data for 27 TFs independently generated by PBM analysis in this current study. Investigation of the DNA binding specificities within the basic leucine zipper (bZIP) and VHR transcription factor families revealed unexpected plasticity in TF-DNA recognition: intriguingly, the VHR transcription factors, newly characterized by protein binding microarrays in this study, recognize bZIP like DNA motifs, while the bZIP transcription factor Hac1 recognizes a motif highly similar to the canonical E-box motif of basic helix-loop-helix (bHLH) transcription factors. We identified several transcription factors with distinct primary and secondary motifs, which might be associated with different regulatory functions. Finally, integrated analysis of in vivo transcription factor binding data with protein binding microarray data lends further support for indirect DNA binding in vivo by sequence-specific transcription factors. 27 Protein binding microarray (PBM) experiments of Saccharomyces cerevisiae transcription factors were performed. Briefly, the PBMs involved binding GST-tagged yeast transcription factors to double-stranded 44K Agilent microarrays in order to determine their sequence preferences. The method is described in Berger et al., Nature Biotechnology 2006 (PMID 16998473). A key feature is that the microarrays are composed of de Bruijn sequences that contain each 10-base sequence once and only once, providing an evenly balanced sequence distribution. Individual de Bruijn sequences have different properties, including representation of gapped patterns. The array probe sequences on the custom array design used in this study were reported previously in Berger et al., Cell 2008 (PMID 18585359) and are available via an academic research use license. Here we provide the data transformed into median signal intensities (after normalization and detrending of the original array data) for all 32,896 8-base sequences, Z-scores for these intensities, and E-scores. E-scores are a modified version of AUC and describe how well each 8-mer ranks the intensities of the spots. In general, the E-scores are slightly more reproducible than Z-scores, but contain less information about relative binding affinity. Additional experimental details are found in Berger et al., Nature Biotechnology 2006, Gordan et al., Genome Biology (in press), and the accompanying Supplementary information.

Project description:The Myc-Max heterodimer is a DNA binding protein that regulates expression of a large number of genes. Genome occupancy of Myc-Max is thought to be driven by E-boxes (CACGTG or variants) to which the heterodimer binds in vitro. By analyzing ChIP-Seq datasets, we demonstrated that the positions occupied by Myc-Max across the human genome correlate with the RNA polymerase II (Pol II) transcription machinery better than with E-boxes. Metagene analyses showed that in promoter regions, Myc was uniformly positioned about 100 bp upstream of essentially all promoter proximal paused polymerases with Max about 10 bp upstream of Myc. We re-evaluated the DNA binding properties of full length Myc-Max proteins using electrophoretic mobility shift assays (EMSA) and protein-binding microarrays (PBM). EMSA results demonstrated Myc-Max heterodimers have high affinity for both E-box containing and non-specific DNA. Quantification of the relative affinities of Myc-Max for all possible 8- mers using PBM assays showed that sequences surrounding core 6-mers significantly affect binding. Comparing to the in vitro sequence preferences, Myc-Max genomic occupancy measured by ChIP-Seq was largely, although not completely, independent of sequence specificity. Our results suggest that the transcription machinery and associated promoter accessibility play an important role in genomic occupancy of Myc. Two protein binding microarray (PBM) experiments were performed: one for the heterodimer of the human transcription factors c-Myc and Max, and one for the Max-Max homodimer. Briefly, 4x44K arrays (Agilent Technologies; AmadID 015681) containing the M-bM-^@M-^Xall 10-merM-bM-^@M-^Y universal PBM design were used. Arrays were incubated with a PBS buffer based protein mixture of wither 10nM His-tagged Myc-Max heterodimer or 10nM His-tagged Max-Max homodimer, 2% milk, 200ng/M-BM-5L BSA, 50ng/M-BM-5L Salmon Testes DNA, and 0.02% TX-100. Bound protein was tagged with 10ng/M-BM-5L anti-His antibody conjugated to Alexa 488 (Qiagen; 35310) in PBS with 2% milk. Data were analyzed to obtain fluorescence intensities for all 8mers. The PBM protocol is described in Berger et al., Nature Biotechnology 2006 (PMID 16998473).

Project description:A subfamily of Drosophila homeodomain (HD) transcription factors (TFs) controls the identities of individual muscle founder cells (FCs). However, the molecular mechanisms by which these TFs generate unique FC genetic programs remain unknown. To investigate this problem, we first applied genome-wide mRNA expression profiling to identify genes that are activated or repressed by the muscle HD TFs Slouch (Slou) and Muscle segment homeobox (Msh). Next, we used protein binding microarrays to define the sequences that are bound by Slou, Msh and other HD TFs having mesodermal expression. These studies revealed that a large class of HDs, including Slou and Msh, predominantly recognize TAAT core sequences but that each HD also binds to unique sites that deviate from this canonical motif. To better understand the regulatory specificity of an individual FC identity HD, we evaluated the functions of atypical binding sites that are preferentially bound by Slou relative to other HDs within muscle enhancers that are either activated or repressed by this TF. These studies showed that Slou regulates the activities of particular myoblast enhancers through Slou-preferred sequences, whereas swapping these sequences for sites that are capable of binding to multiple HD family members does not support the normal regulatory functions of Slou. Moreover, atypical Slou binding sites are overrepresented in putative enhancers associated with additional Slou-responsive FC genes. Collectively, these studies provide new insights into the roles of individual HD TFs in determining cellular identity, and suggest that the diversity of HD binding preferences can confer regulatory specificity. 10 Protein binding microarray (PBM) experiments of Drosophila transcription factors were performed. Briefly, the PBMs involved binding GST-tagged fly transcription factors to double-stranded 44K Agilent microarrays in order to determine their sequence preferences. The method is described in Berger et al., Nature Biotechnology 2006 (PMID: 16998473). A key feature is that the microarrays are composed of de Bruijn sequences that contain each 10-base sequence once and only once, providing an evenly balanced sequence distribution. Individual de Bruijn sequences have different properties, including representation of gapped patterns. The array probe sequences on the custom array design used in this study were reported previously in Berger et al., Cell 2008 (PMID: 18585359) and are available via an academic research use license. Here we provide the data transformed into median signal intensities (after normalization and detrending of the original array data) for all 32,896 8-base sequences, Z-scores for these intensities, and E-scores. E-scores are a modified version of AUC, and describe how well each 8-mer ranks the intensities of the spots. 'Keep fraction' (kf) parameter setting of 0.9 was used to calculate E-scores. In general the E-scores are slightly more reproducible than Z-scores, but contain less information about relative binding affinity. Additional experimental details are found in Berger et al., Nature Biotechnology 2006 (PMID: 16998473), and the accompanying Supplementary information.

Dataset Information

Stability selection for regression-based models of transcription factor-DNA binding specificity

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets