Opening up the blackbox: an interpretable deep neural network-based classifier for cell-type specific enhancer predictions.
ABSTRACT: Gene expression is mediated by specialized cis-regulatory modules (CRMs), the most prominent of which are called enhancers. Early experiments indicated that enhancers located far from the gene promoters are often responsible for mediating gene transcription. Knowing their properties, regulatory activity, and genomic targets is crucial to the functional understanding of cellular events, ranging from cellular homeostasis to differentiation. Recent genome-wide investigation of epigenomic marks has indicated that enhancer elements could be enriched for certain epigenomic marks, such as, combinatorial patterns of histone modifications.Our efforts in this paper are motivated by these recent advances in epigenomic profiling methods, which have uncovered enhancer-associated chromatin features in different cell types and organisms. Specifically, in this paper, we use recent state-of-the-art Deep Learning methods and develop a deep neural network (DNN)-based architecture, called EP-DNN, to predict the presence and types of enhancers in the human genome. It uses as features, the expression levels of the histone modifications at the peaks of the functional sites as well as in its adjacent regions. We apply EP-DNN to four different cell types: H1, IMR90, HepG2, and HeLa S3. We train EP-DNN using p300 binding sites as enhancers, and TSS and random non-DHS sites as non-enhancers. We perform EP-DNN predictions to quantify the validation rate for different levels of confidence in the predictions and also perform comparisons against two state-of-the-art computational models for enhancer predictions, DEEP-ENCODE and RFECS.We find that EP-DNN has superior accuracy and takes less time to make predictions. Next, we develop methods to make EP-DNN interpretable by computing the importance of each input feature in the classification task. This analysis indicates that the important histone modifications were distinct for different cell types, with some overlaps, e.g., H3K27ac was important in cell type H1 but less so in HeLa S3, while H3K4me1 was relatively important in all four cell types. We finally use the feature importance analysis to reduce the number of input features needed to train the DNN, thus reducing training time, which is often the computational bottleneck in the use of a DNN.In this paper, we developed EP-DNN, which has high accuracy of prediction, with validation rates above 90 % for the operational region of enhancer prediction for all four cell lines that we studied, outperforming DEEP-ENCODE and RFECS. Then, we developed a method to analyze a trained DNN and determine which histone modifications are important, and within that, which features proximal or distal to the enhancer site, are important.
Project description:We present EP-DNN, a protocol for predicting enhancers based on chromatin features, in different cell types. Specifically, we use a deep neural network (DNN)-based architecture to extract enhancer signatures in a representative human embryonic stem cell type (H1) and a differentiated lung cell type (IMR90). We train EP-DNN using p300 binding sites, as enhancers, and TSS and random non-DHS sites, as non-enhancers. We perform same-cell and cross-cell predictions to quantify the validation rate and compare against two state-of-the-art methods, DEEP-ENCODE and RFECS. We find that EP-DNN has superior accuracy with a validation rate of 91.6%, relative to 85.3% for DEEP-ENCODE and 85.5% for RFECS, for a given number of enhancer predictions and also scales better for a larger number of enhancer predictions. Moreover, our H1 ? IMR90 predictions turn out to be more accurate than IMR90 ? IMR90, potentially because H1 exhibits a richer signature set and our EP-DNN model is expressive enough to extract these subtleties. Our work shows how to leverage the full expressivity of deep learning models, using multiple hidden layers, while avoiding overfitting on the training data. We also lay the foundation for exploration of cross-cell enhancer predictions, potentially reducing the need for expensive experimentation.
Project description:Recent epigenomic studies have predicted thousands of potential enhancers in the human genome. However, there has not been systematic characterization of target promoters for these potential enhancers. Using H3K4me2 as a mark for active enhancers, we identified genome-wide EP interactions in human CD4(+) T cells. Among the 6 520 long-distance chromatin interactions, we identify 2 067 enhancers that interact with 1 619 promoters and enhance their expression. These enhancers exist in accessible chromatin regions and are associated with various histone modifications and polymerase II binding. The promoters with interacting enhancers are expressed at higher levels than those without interacting enhancers, and their expression levels are positively correlated with the number of interacting enhancers. Interestingly, interacting promoters are co-expressed in a tissue-specific manner. We also find that chromosomes are organized into multiple levels of interacting domains. Our results define a global view of EP interactions and provide a data set to further understand mechanisms of enhancer targeting and long-range chromatin organization. The Gene Expression Omnibus accession number for the raw and analyzed chromatin interaction data is GSE32677.
Project description:<b>Background:</b>The data deluge can leverage sophisticated ML techniques for functionally annotating the regulatory non-coding genome. The challenge lies in selecting the appropriate classifier for the specific functional annotation problem, within the bounds of the hardware constraints and the model's complexity. In our system AIKYATAN, we annotate distal epigenomic regulatory sites, e.g., enhancers. Specifically, we develop a binary classifier that classifies genome sequences as distal regulatory regions or not, given their histone modifications' combinatorial signatures. This problem is challenging because the regulatory regions are distal to the genes, with diverse signatures across classes (e.g., enhancers and insulators) and even within each class (e.g., different enhancer sub-classes).<br><br><b>Results:</b>We develop a suite of ML models, under the banner AIKYATAN, including SVM models, random forest variants, and deep learning architectures, for distal regulatory element (DRE) detection. We demonstrate, with strong empirical evidence, deep learning approaches have a computational advantage. Plus, convolutional neural networks (CNN) provide the best-in-class accuracy, superior to the vanilla variant. With the human embryonic cell line H1, CNN achieves an accuracy of 97.9% and an order of magnitude lower runtime than the kernel SVM. Running on a GPU, the training time is sped up 21x and 30x (over CPU) for DNN and CNN, respectively. Finally, our CNN model enjoys superior prediction performance vis-'a-vis the competition. Specifically, AIKYATAN-CNN achieved 40% higher validation rate versus CSIANN and the same accuracy as RFECS.<br><br><b>Conclusions:</b>Our exhaustive experiments using an array of ML tools validate the need for a model that is not only expressive but can scale with increasing data volumes and diversity. In addition, a subset of these datasets have image-like properties and benefit from spatial pooling of features. Our AIKYATAN suite leverages diverse epigenomic datasets that can then be modeled using CNNs with optimized activation and pooling functions. The goal is to capture the salient features of the integrated epigenomic datasets for deciphering the distal (non-coding) regulatory elements, which have been found to be associated with functional variants. Our source code will be made publicly available at: https://bitbucket.org/cellsandmachines/aikyatan.
Project description:The histone modification state of genomic regions is hypothesized to reflect the regulatory activity of the underlying genomic DNA. Based on this hypothesis, the ENCODE Project Consortium measured the status of multiple histone modifications across the genome in several cell types and used these data to segment the genome into regions with different predicted regulatory activities. We measured the cis-regulatory activity of more than 2000 of these predictions in the K562 leukemia cell line. We tested genomic segments predicted to be Enhancers, Weak Enhancers, or Repressed elements in K562 cells, along with other sequences predicted to be Enhancers specific to the H1 human embryonic stem cell line (H1-hESC). Both Enhancer and Weak Enhancer sequences in K562 cells were more active than negative controls, although surprisingly, Weak Enhancer segmentations drove expression higher than did Enhancer segmentations. Lower levels of the covalent histone modifications H3K36me3 and H3K27ac, thought to mark active enhancers and transcribed gene bodies, associate with higher expression and partly explain the higher activity of Weak Enhancers over Enhancer predictions. While DNase I hypersensitivity (HS) is a good predictor of active sequences in our assay, transcription factor (TF) binding models need to be included in order to accurately identify highly expressed sequences. Overall, our results show that a significant fraction (-26%) of the ENCODE enhancer predictions have regulatory activity, suggesting that histone modification states can reflect the cis-regulatory activity of sequences in the genome, but that specific sequence preferences, such as TF-binding sites, are the causal determinants of cis-regulatory activity.
Project description:Enhancer mapping has been greatly facilitated by various genomic marks associated with it. However, little is available in our toolbox to link enhancers with their target promoters, hampering mechanistic understanding of enhancer-promoter (EP) interaction. We develop and characterize multiple genomic features for distinguishing true EP pairs from noninteracting pairs. We integrate these features into a probabilistic predictor for EP interactions. Multiple validation experiments demonstrate a significant improvement over state-of-the-art approaches. Systematic analyses of EP interactions across 12 cell types reveal several global features of EP interactions: (i) a larger fraction of EP interactions are cell type specific than enhancers; (ii) promoters controlled by multiple enhancers have higher tissue specificity, but the regulating enhancers are less conserved; (iii) cohesin plays a role in mediating tissue-specific EP interactions via chromatin looping in a CTCF-independent manner. Our approach presents a systematic and effective strategy to decipher the mechanisms underlying EP communication.
Project description:Combinatorial histone modification is an important epigenetic mechanism for regulating chromatin state and gene expression. Given the rapid accumulation of genome-wide histone modification maps, there is a pressing need for computational methods capable of joint analysis of multiple maps to reveal combinatorial modification patterns.We present the Semi-Supervised Coherent and Shifted Bicluster Identification algorithm (SS-CoSBI). It uses prior knowledge of combinatorial histone modifications to guide the biclustering process. Specifically, co-occurrence frequencies of histone modifications characterized by mass spectrometry are used as probabilistic priors to adjust the similarity measure in the biclustering process. Using a high-quality set of transcriptional enhancers and associated histone marks, we demonstrate that SS-CoSBI outperforms its predecessor by finding histone modification and genomic locus biclusters with higher enrichment of enhancers. We apply SS-CoSBI to identify multiple cell-type-specific combinatorial histone modification states associated with human enhancers. We show enhancer histone modification states are correlated with the expression of nearby genes. Further, we find that enhancers with the histone mark H3K4me1 have higher levels of DNA methylation and decreased expression of nearby genes, suggesting a functional interplay between H3K4me1 and DNA methylation that can modulate enhancer activities.The analysis presented here provides a systematic characterization of combinatorial histone codes of enhancers across three human cell types using a novel semi-supervised biclustering algorithm. As epigenomic maps accumulate, SS-CoSBI will become increasingly useful for understanding combinatorial chromatin modifications by taking advantage of existing knowledge.SS-CoSBI is implemented in C. The source code is freely available at http://www.healthcare.uiowa.edu/labs/tan/SS-CoSBI.gz.
Project description:Histone H3K4me1/2 methyltransferases MLL3/MLL4 and H3K27 acetyltransferases CBP/p300 are major enhancer epigenomic writers. To understand how these epigenomic writers orchestrate enhancer landscapes in cell differentiation, we have profiled genomic binding of MLL4, CBP, lineage-determining transcription factors (EBF2, C/EBPβ, C/EBPα, PPARγ), coactivator MED1, RNA polymerase II, as well as epigenome (H3K4me1/2/3, H3K9me2, H3K27me3, H3K36me3, H3K27ac), transcriptome and chromatin opening during adipogenesis of immortalized preadipocytes derived from mouse brown adipose tissue (BAT). We show that MLL4 and CBP drive the dynamic enhancer epigenome, which correlates with the dynamic transcriptome. MLL3/MLL4 are required for CBP/p300 binding on enhancers activated during adipogenesis. Further, MLL4 and CBP identify super-enhancers (SEs) of adipogenesis and that MLL3/MLL4 are required for SE formation. Finally, in brown adipocytes differentiated in culture, MLL4 identifies primed SEs of genes fully activated in BAT such as Ucp1. Comparison of MLL4-defined SEs in brown and white adipogenesis identifies brown-specific SE-associated genes that could be involved in BAT functions. These results establish MLL3/MLL4 and CBP/p300 as master enhancer epigenomic writers and suggest that enhancer-priming by MLL3/MLL4 followed by enhancer-activation by CBP/p300 sequentially shape dynamic enhancer landscapes during cell differentiation. Our data also provide a rich resource for understanding epigenomic regulation of brown adipogenesis.
Project description:Short non-coding transcripts can be transcribed from distant-acting transcriptional enhancer loci, but the prevalence of such enhancer RNAs (eRNAs) within the transcriptome, and the association of eRNA expression with tissue-specific enhancer activity in vivo remain poorly understood. Here, we investigated the expression dynamics of tissue-specific non-coding RNAs in embryonic mouse tissues via deep RNA sequencing. Overall, approximately 80% of validated in vivo enhancers show tissue-specific RNA expression that correlates with tissue-specific enhancer activity. Globally, we identified thousands of tissue-specifically transcribed non-coding regions (TSTRs) displaying various genomic hallmarks of bona fide enhancers. In transgenic mouse reporter assays, over half of tested TSTRs functioned as enhancers with reproducible activity in the predicted tissue. Together, our results demonstrate that tissue-specific eRNA expression is a common feature of in vivo enhancers, as well as a major source of extragenic transcription, and that eRNA expression signatures can be used to predict tissue-specific enhancers independent of known epigenomic enhancer marks.
Project description:Epigenomic signatures from histone marks and transcription factor (TF)-binding sites have been used to annotate putative gene regulatory regions. However, a direct comparison of these diverse annotations is missing, and it is unclear how genetic variation within these annotations affects gene expression. Here, we compare five widely used annotations of active regulatory elements that represent high densities of one or more relevant epigenomic marks-"super" and "typical" (nonsuper) enhancers, stretch enhancers, high-occupancy target (HOT) regions, and broad domains-across the four matched human cell types for which they are available. We observe that stretch and super enhancers cover cell type-specific enhancer "chromatin states," whereas HOT regions and broad domains comprise more ubiquitous promoter states. Expression quantitative trait loci (eQTL) in stretch enhancers have significantly smaller effect sizes compared to those in HOT regions. Strikingly, chromatin accessibility QTL in stretch enhancers have significantly larger effect sizes compared to those in HOT regions. These observations suggest that stretch enhancers could harbor genetically primed chromatin to enable changes in TF binding, possibly to drive cell type-specific responses to environmental stimuli. Our results suggest that current eQTL studies are relatively underpowered or could lack the appropriate environmental context to detect genetic effects in the most cell type-specific "regulatory annotations," which likely contributes to infrequent colocalization of eQTL with genome-wide association study signals.
Project description:Identifying enhancers regulating gene expression remains an important and challenging task. While recent sequencing-based methods provide epigenomic characteristics that correlate well with enhancer activity, it remains onerous to comprehensively identify all enhancers across development. Here we introduce a computational framework to identify tissue-specific enhancers evolving under purifying selection. First, we incorporate high-confidence binding site predictions with target gene functional enrichment analysis to identify transcription factors (TFs) likely functioning in a particular context. We then search the genome for clusters of binding sites for these TFs, overcoming previous constraints associated with biased manual curation of TFs or enhancers. Applying our method to the placenta, we find 33 known and implicate 17 novel TFs in placental function, and discover 2,216 putative placenta enhancers. Using luciferase reporter assays, 31/36 (86%) tested candidates drive activity in placental cells. Our predictions agree well with recent epigenomic data in human and mouse, yet over half our loci, including 7/8 (87%) tested regions, are novel. Finally, we establish that our method is generalizable by applying it to 5 additional tissues: heart, pancreas, blood vessel, bone marrow, and liver.