High-throughput functional testing of ENCODE segmentation predictions.
ABSTRACT: The histone modification state of genomic regions is hypothesized to reflect the regulatory activity of the underlying genomic DNA. Based on this hypothesis, the ENCODE Project Consortium measured the status of multiple histone modifications across the genome in several cell types and used these data to segment the genome into regions with different predicted regulatory activities. We measured the cis-regulatory activity of more than 2000 of these predictions in the K562 leukemia cell line. We tested genomic segments predicted to be Enhancers, Weak Enhancers, or Repressed elements in K562 cells, along with other sequences predicted to be Enhancers specific to the H1 human embryonic stem cell line (H1-hESC). Both Enhancer and Weak Enhancer sequences in K562 cells were more active than negative controls, although surprisingly, Weak Enhancer segmentations drove expression higher than did Enhancer segmentations. Lower levels of the covalent histone modifications H3K36me3 and H3K27ac, thought to mark active enhancers and transcribed gene bodies, associate with higher expression and partly explain the higher activity of Weak Enhancers over Enhancer predictions. While DNase I hypersensitivity (HS) is a good predictor of active sequences in our assay, transcription factor (TF) binding models need to be included in order to accurately identify highly expressed sequences. Overall, our results show that a significant fraction (-26%) of the ENCODE enhancer predictions have regulatory activity, suggesting that histone modification states can reflect the cis-regulatory activity of sequences in the genome, but that specific sequence preferences, such as TF-binding sites, are the causal determinants of cis-regulatory activity.
Project description:Experimentally characterized enhancer regions have previously been shown to display specific patterns of enrichment for several different histone modifications. We modelled these enhancer chromatin profiles in the human genome and used them to guide the search for novel enhancers derived from transposable element (TE) sequences. To do this, a computational approach was taken to analyze the genome-wide histone modification landscape characterized by the ENCODE project in two human hematopoietic cell types, GM12878 and K562. We predicted the locations of 2,107 and 1,448 TE-derived enhancers in the GM12878 and K562 cell lines respectively. A vast majority of these putative enhancers are unique to each cell line; only 3.5% of the TE-derived enhancers are shared between the two. We evaluated the functional effect of TE-derived enhancers by associating them with the cell-type specific expression of nearby genes, and found that the number of TE-derived enhancers is strongly positively correlated with the expression of nearby genes in each cell line. Furthermore, genes that are differentially expressed between the two cell lines also possess a divergent number of TE-derived enhancers in their vicinity. As such, genes that are up-regulated in the GM12878 cell line and down-regulated in K562 have significantly more TE-derived enhancers in their vicinity in the GM12878 cell line and vice versa. These data indicate that human TE-derived sequences are likely to be involved in regulating cell-type specific gene expression on a broad scale and suggest that the enhancer activity of TE-derived sequences is mediated by epigenetic regulatory mechanisms.
Project description:Alu elements are primate-specific repeats and represent the most abundant type of transposable elements (TE) in the human genome. Genome-wide analysis of the enrichment of histone post-translational modifications suggests that human Alu sequences could function as transcriptional enhancers; however, no functional experiments have evaluated the role of Alu sequences in the control of transcription in situ. The present study analyses the regulatory activity of a human Alu sequence from the AluSx family located in the second intron of the long intergenic non-coding RNA Linc00441, found in divergent orientation to the RB1 gene. We observed that the Alu sequence acts as an enhancer element based on reporter gene assays while CRISPR-Cas9 deletions of the Alu sequence in K562 cells resulted in a marked transcriptional upregulation of Linc00441 and a decrease in proliferation. Our results suggest that an intragenic Alu sequence with enhancer activity can act as a transcriptional attenuator of its host lincRNA.
Project description:Regulated gene expression controls organismal development, and variation in regulatory patterns has been implicated in complex traits. Thus accurate prediction of enhancers is important for further understanding of these processes. Genome-wide measurement of epigenetic features, such as histone modifications and occupancy by transcription factors, is improving enhancer predictions, but the contribution of these features to prediction accuracy is not known. Given the importance of the hematopoietic transcription factor TAL1 for erythroid gene activation, we predicted candidate enhancers based on genomic occupancy by TAL1 and measured their activity. Contributions of multiple features to enhancer prediction were evaluated based on the results of these and other studies.TAL1-bound DNA segments were active enhancers at a high rate both in transient transfections of cultured cells (39 of 79, or 56%) and transgenic mice (43 of 66, or 65%). The level of binding signal for TAL1 or GATA1 did not help distinguish TAL1-bound DNA segments as active versus inactive enhancers, nor did the density of regulation-related histone modifications. A meta-analysis of results from this and other studies (273 tested predicted enhancers) showed that the presence of TAL1, GATA1, EP300, SMAD1, H3K4 methylation, H3K27ac, and CAGE tags at DNase hypersensitive sites gave the most accurate predictors of enhancer activity, with a success rate over 80% and a median threefold increase in activity. Chromatin accessibility assays and the histone modifications H3K4me1 and H3K27ac were sensitive for finding enhancers, but they have high false positive rates unless transcription factor occupancy is also included.Occupancy by key transcription factors such as TAL1, GATA1, SMAD1, and EP300, along with evidence of transcription, improves the accuracy of enhancer predictions based on epigenetic features.
Project description:BACKGROUND:Gene expression is controlled by proximal promoters and distal regulatory elements such as enhancers. While the activity of some promoters can be invariant across tissues, enhancers tend to be highly tissue-specific. RESULTS:We compiled sets of tissue-specific promoters based on gene expression profiles of 79 human tissues and cell types. Putative transcription factor binding sites within each set of sequences were used to train a support vector machine classifier capable of distinguishing tissue-specific promoters from control sequences. We obtained reliable classifiers for 92% of the tissues, with an area under the receiver operating characteristic curve between 60% (for subthalamic nucleus promoters) and 98% (for heart promoters). We next used these classifiers to identify tissue-specific enhancers, scanning distal non-coding sequences in the loci of the 200 most highly and lowly expressed genes. Thirty percent of reliable classifiers produced consistent enhancer predictions, with significantly higher densities in the loci of the most highly expressed compared to lowly expressed genes. Liver enhancer predictions were assessed in vivo using the hydrodynamic tail vein injection assay. Fifty-eight percent of the predictions yielded significant enhancer activity in the mouse liver, whereas a control set of five sequences was completely negative. CONCLUSIONS:We conclude that promoters of tissue-specific genes often contain unambiguous tissue-specific signatures that can be learned and used for the de novo prediction of enhancers.
Project description:Studies of regulatory activity and gene expression have revealed an intriguing dichotomy: There is substantial turnover in the regulatory activity of orthologous sequences between species; however, the expression level of orthologous genes is largely conserved. Understanding how distal regulatory elements, for example, enhancers, evolve and function is critical, as alterations in gene expression levels can drive the development of both complex disease and functional divergence between species. In this study, we investigated determinants of the conservation of regulatory enhancer activity for orthologous sequences across mammalian evolution. Using liver enhancers identified from genome-wide histone modification profiles in ten diverse mammalian species, we compared orthologous sequences that exhibited regulatory activity in all species (conserved-activity enhancers) to shared sequences active only in a single species (species-specific-activity enhancers). Conserved-activity enhancers have greater regulatory potential than species-specific-activity enhancers, as quantified by both the density and diversity of transcription factor binding motifs. Consistent with their greater regulatory potential, conserved-activity enhancers have greater regulatory activity in humans than species-specific-activity enhancers: They are active across more cellular contexts, and they regulate more genes than species-specific-activity enhancers. Furthermore, the genes regulated by conserved-activity enhancers are expressed in more tissues and are less tolerant of loss-of-function mutations than those targeted by species-specific-activity enhancers. These consistent results across various stages of gene regulation demonstrate that conserved-activity enhancers are more pleiotropic than their species-specific-activity counterparts. This suggests that pleiotropy is associated with the conservation of regulatory across mammalian evolution.
Project description:Gene expression is mediated by specialized cis-regulatory modules (CRMs), the most prominent of which are called enhancers. Early experiments indicated that enhancers located far from the gene promoters are often responsible for mediating gene transcription. Knowing their properties, regulatory activity, and genomic targets is crucial to the functional understanding of cellular events, ranging from cellular homeostasis to differentiation. Recent genome-wide investigation of epigenomic marks has indicated that enhancer elements could be enriched for certain epigenomic marks, such as, combinatorial patterns of histone modifications.Our efforts in this paper are motivated by these recent advances in epigenomic profiling methods, which have uncovered enhancer-associated chromatin features in different cell types and organisms. Specifically, in this paper, we use recent state-of-the-art Deep Learning methods and develop a deep neural network (DNN)-based architecture, called EP-DNN, to predict the presence and types of enhancers in the human genome. It uses as features, the expression levels of the histone modifications at the peaks of the functional sites as well as in its adjacent regions. We apply EP-DNN to four different cell types: H1, IMR90, HepG2, and HeLa S3. We train EP-DNN using p300 binding sites as enhancers, and TSS and random non-DHS sites as non-enhancers. We perform EP-DNN predictions to quantify the validation rate for different levels of confidence in the predictions and also perform comparisons against two state-of-the-art computational models for enhancer predictions, DEEP-ENCODE and RFECS.We find that EP-DNN has superior accuracy and takes less time to make predictions. Next, we develop methods to make EP-DNN interpretable by computing the importance of each input feature in the classification task. This analysis indicates that the important histone modifications were distinct for different cell types, with some overlaps, e.g., H3K27ac was important in cell type H1 but less so in HeLa S3, while H3K4me1 was relatively important in all four cell types. We finally use the feature importance analysis to reduce the number of input features needed to train the DNN, thus reducing training time, which is often the computational bottleneck in the use of a DNN.In this paper, we developed EP-DNN, which has high accuracy of prediction, with validation rates above 90 % for the operational region of enhancer prediction for all four cell lines that we studied, outperforming DEEP-ENCODE and RFECS. Then, we developed a method to analyze a trained DNN and determine which histone modifications are important, and within that, which features proximal or distal to the enhancer site, are important.
Project description:BACKGROUND:Mammalian-wide interspersed repeats (MIRs) are the most ancient family of transposable elements (TEs) in the human genome. The deep conservation of MIRs initially suggested the possibility that they had been exapted to play functional roles for their host genomes. MIRs also happen to be the only TEs whose presence in-and-around human genes is positively correlated to tissue-specific gene expression. Similar associations of enhancer prevalence within genes and tissue-specific expression, along with MIRs' previous implication as providing regulatory sequences, suggested a possible link between MIRs and enhancers. RESULTS:To test the possibility that MIRs contribute functional enhancers to the human genome, we evaluated the relationship between MIRs and human tissue-specific enhancers in terms of genomic location, chromatin environment, regulatory function, and mechanistic attributes. This analysis revealed MIRs to be highly concentrated in enhancers of the K562 and HeLa human cell-types. Significantly more enhancers were found to be linked to MIRs than would be expected by chance, and putative MIR-derived enhancers are characterized by a chromatin environment highly similar to that of canonical enhancers. MIR-derived enhancers show strong associations with gene expression levels, tissue-specific gene expression and tissue-specific cellular functions, including a number of biological processes related to erythropoiesis. MIR-derived enhancers were found to be a rich source of transcription factor binding sites, underscoring one possible mechanistic route for the element sequences co-option as enhancers. There is also tentative evidence to suggest that MIR-enhancer function is related to the transcriptional activity of non-coding RNAs. CONCLUSIONS:Taken together, these data reveal enhancers to be an important cis-regulatory platform from which MIRs can exercise a regulatory function in the human genome and help to resolve a long-standing conundrum as to the reason for MIRs' deep evolutionary conservation.
Project description:Single-cell screens enable high-throughput functional assessment of enhancers in their endogenous genomic context. However, the design of current studies limits their application to identifying the primary gene targets of enhancers. Here, we improve the experimental and computational parameters of single-cell enhancer screens to identify the secondary gene targets of enhancers. Our analysis of >500 putative enhancers in K562 cells reveals an interwoven enhancer-driven gene regulatory network. We find that enhancers from distinct genomic loci converge to modulate the expression of common sub-modules, including the ?- and ?-globin loci, by directly regulating transcription factors. Our analysis suggests that several genetic variants associated with myeloid blood cell traits alter the activity of a distal enhancer of MYB (?140 kb away), with downstream consequences on hemoglobin genes expression and cell state. These data have implications for the understanding of enhancer-associated traits and emphasize the flexibility of controlling transcriptional systems by modifying enhancer activity.
Project description:BACKGROUND:Enhancers are DNA regulatory elements that influence gene expression. There is substantial diversity in enhancers' activity patterns: some enhancers drive expression in a single cellular context, while others are active across many. Sequence characteristics, such as transcription factor (TF) binding motifs, influence the activity patterns of regulatory sequences; however, the regulatory logic through which specific sequences drive enhancer activity patterns is poorly understood. Recent analysis of Drosophila enhancers suggested that short dinucleotide repeat motifs (DRMs) are general enhancer sequence features that drive broad regulatory activity. However, it is not known whether the regulatory role of DRMs is conserved across species. RESULTS:We performed a comprehensive analysis of the relationship between short DNA sequence patterns, including DRMs, and human enhancer activity in 38,538 enhancers across 411 different contexts. In a machine-learning framework, the occurrence patterns of short sequence motifs accurately predicted broadly active human enhancers. However, DRMs alone were weakly predictive of broad enhancer activity in humans and showed different enrichment patterns than in Drosophila. In general, GC-rich sequence motifs were significantly associated with broad enhancer activity, and consistent with this enrichment, broadly active human TFs recognize GC-rich motifs. CONCLUSIONS:Our results reveal the importance of specific sequence motifs in broadly active human enhancers, demonstrate the lack of evolutionary conservation of the role of DRMs, and provide a computational framework for investigating the logic of enhancer sequences.
Project description:Transcriptional enhancers are generally known to regulate gene transcription from afar. Their activation involves a series of changes in chromatin marks and recruitment of protein factors. These enhancers may also occur inside genes, but how many may be active in human cells and their effects on the regulation of the host gene remains unclear.We describe a novel semi-supervised method based on the relative enrichment of chromatin signals between 2 conditions to predict active enhancers. We applied this method to the tumoral K562 and the normal GM12878 cell lines to predict enhancers that are differentially active in one cell type. These predictions show enhancer-like properties according to positional distribution, correlation with gene expression and production of enhancer RNAs. Using this model, we predict 10,365 and 9777 intragenic active enhancers in K562 and GM12878, respectively, and relate the differential activation of these enhancers to expression and splicing differences of the host genes.We propose that the activation or silencing of intragenic transcriptional enhancers modulate the regulation of the host gene by means of a local change of the chromatin and the recruitment of enhancer-related factors that may interact with the RNA directly or through the interaction with RNA binding proteins. Predicted enhancers are available at http://regulatorygenomics.upf.edu/Projects/enhancers.html .