Dataset Information

Evaluation of methods for modeling transcription factor sequence specificity

ABSTRACT: Genomic analyses often involve scanning for potential transcription-factor (TF) binding sites using models of the sequence specificity of DNA binding proteins. Many approaches have been developed to model and learn a protein’s binding specificity by representing sequence motifs, including the gaps and dependencies between binding-site residues, but these methods have not been systematically compared. Here we applied 26 such approaches to in vitro protein binding microarray data for 66 mouse TFs belonging to various families. For 9 TFs, we also scored the resulting motif models on in vivo data, and found that the best in vitro–derived motifs performed similarly to motifs derived from in vivo data. Our results indicate that simple models based on mononucleotide position weight matrices learned by the best methods perform similarly to more complex models for most TFs examined, but fall short in specific cases. In addition, the best-performing motifs typically have relatively low information content, consistent with widespread degeneracy in eukaryotic TF sequence preferences.

ORGANISM(S): synthetic construct

PROVIDER: GSE42864 | GEO | 2012/12/13

SECONDARY ACCESSION(S): PRJNA183675

REPOSITORIES: GEO

ACCESS DATA

Dataset's files

Source:

			Action	DRS
		Other

Items per page:

1 - 1 of 1

Similar Datasets

Project description:Motivation: The DNA binding specificity of a transcription factor (TF) is typically represented using a position weight matrix (PWM) model, which implicitly assumes that individual bases in a TF binding site contribute independently to the binding affinity, an assumption that does not always hold. For this reason, more complex models of binding specificity have been developed. However, these models have their own caveats: they typically have a large number of parameters, which makes them hard to learn and interpret. Results: We propose novel regression-based models of TF-DNA binding specificity, trained using high resolution in vitro data from custom protein binding microarray (PBM) experiments. Our PBMs are specifically designed to cover a large number of putative DNA binding sites for the TFs of interest (yeast TFs Cbf1 and Tye7, and human TFs c-Myc, Max, and Mad2) in their native genomic context. These high-throughput, quantitative data are well suited for training complex models that take into account not only independent contributions from individual bases, but also contributions from di- and trinucleotides at various positions within or near the binding sites. To ensure that our models remain interpretable, we use feature selection to identify a small number of sequence features that accurately predict TF-DNA binding specificity. To further illustrate the accuracy of our regression models, we show that even in the case of paralogous TF with highly similar PWMs, our new models can distinguish the specificities of individual factors. Thus, our work represents an important step towards better sequence-based models of individual TF-DNA binding specificity. Four protein binding microarray (PBM) experiments of human transcription factors were performed. Briefly, the PBMs involved binding GST-tagged transcription factors c-Myc, Max, and Mad2(Mxi1) to double-stranded 180K Agilent microarrays in order to determine their binding specificity for putative DNA binding sites in native genomic context. Briefly, we represent three categories of 36-bp sequences: 1) bound probes, 2) unbound probes (or negative controls), and 3) test probes. Bound probes corresponded to genomic regions bound in vivo by c-Myc, Max, or Mad2 (ChIP-seq P < 10^(-10) in HeLaS3 or K562 celld (ENCODE)) that contain at least two consecutive 8-mers with universal PBM E-score > 0.4 (Munteanu and Gordan, LNCS 2013). All putative binding sites occurr at the same position within the probes on the array. M-bM-^@M-^\UnboundM-bM-^@M-^] probes corresponded to genomic regions with ChIP-seq P < 10^(-10) and a maximum 8-mer E-score < 0.2. We also designed test probes that contain, within constant flanking regions, all nnCACGTGnn 10-mers and 18 nnnCACGTGnnn 12-mers (where n = A, C, G, or T). Each DNA sequence represented on the array is present in 6 replicate spots. We report the PBM signal intensity for each spot. The PBM protocol is described in Berger et al., Nature Biotechnology 2006 (PMID 16998473).

Project description:DNA sequence is a major determinant of the binding specificity of transcription factors (TFs) for their genomic targets. However, eukaryotic cells often express, at the same time, TFs with highly similar DNA binding motifs but distinct in vivo targets. Currently, it is not well understood how TFs with seemingly identical DNA motifs achieve unique specificities in vivo. Here, we used custom protein binding microarrays to analyze TF specificity for putative binding sites in their genomic sequence context. Using yeast TFs Cbf1 and Tye7 as our case study, we found that binding sites of these bHLH TFs (i.e., E-boxes) are bound differently in vitro and in vivo, depending on their genomic context. Computational analyses suggest that nucleotides outside E-box binding sites contribute to specificity by influencing the 3D structure of DNA binding sites. Thus, local shape of target sites might play a widespread role in achieving regulatory specificity within TF families. Three protein binding microarray (PBM) experiments of Saccharomyces cerevisiae transcription factors were performed. Briefly, the PBMs involved binding GST-tagged yeast transcription factors Cbf1 and Tye7 to double-stranded 44K Agilent microarrays in order to determine their binding specificity for putative DNA binding sites in native genomic context. Briefly, we represent three categories of 30-bp genomic sequences: 1) ChIP-chip bound probes, 2) ChIP-chip unbound probes, and 3) negative control probes. ChIP-chip bound probes corresponded to genomic regions bound in vivo by Cbf1 or Tye7 (ChIP-chip P < 0.005 in rich medium (YPD) (Harbison et al., Nature 2004, PMID 15343339)) contained at least two consecutive 8-mers with universal PBM E-score > 0.35 (Zhu et al., Genome Research 2009, PMID 19158363). All putative binding sites occurred at the same position within the probes on the array. M-bM-^@M-^\ChIP-chip unboundM-bM-^@M-^] probes corresponded to genomic regions with ChIP-chip P > 0.5 and at least two consecutive 8-mers at a more stringent universal PBM E-score threshold of 0.4. Negative control probes corresponded to S. cerevisiae intergenic regions with a maximum 8-mer E-score < 0.3. We also designed probes that contain, within constant flanking regions, all 10-bp sequences that occur within the M-bM-^@M-^\ChIP-chip boundM-bM-^@M-^] probes and contain the E-box CACGTG, but are flanked by synthetic rather than native genomic sequence. Each DNA sequence represented on the array is present in 4 replicate spots. We report the PBM signal intensity for each spot. The PBM protocol is described in Berger et al., Nature Biotechnology 2006 (PMID 16998473).

Dataset Information

Evaluation of methods for modeling transcription factor sequence specificity

Dataset's files

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets