EP-DNN: A Deep Neural Network-Based Global Enhancer Prediction Algorithm.
ABSTRACT: We present EP-DNN, a protocol for predicting enhancers based on chromatin features, in different cell types. Specifically, we use a deep neural network (DNN)-based architecture to extract enhancer signatures in a representative human embryonic stem cell type (H1) and a differentiated lung cell type (IMR90). We train EP-DNN using p300 binding sites, as enhancers, and TSS and random non-DHS sites, as non-enhancers. We perform same-cell and cross-cell predictions to quantify the validation rate and compare against two state-of-the-art methods, DEEP-ENCODE and RFECS. We find that EP-DNN has superior accuracy with a validation rate of 91.6%, relative to 85.3% for DEEP-ENCODE and 85.5% for RFECS, for a given number of enhancer predictions and also scales better for a larger number of enhancer predictions. Moreover, our H1 ? IMR90 predictions turn out to be more accurate than IMR90 ? IMR90, potentially because H1 exhibits a richer signature set and our EP-DNN model is expressive enough to extract these subtleties. Our work shows how to leverage the full expressivity of deep learning models, using multiple hidden layers, while avoiding overfitting on the training data. We also lay the foundation for exploration of cross-cell enhancer predictions, potentially reducing the need for expensive experimentation.
Project description:Gene expression is mediated by specialized cis-regulatory modules (CRMs), the most prominent of which are called enhancers. Early experiments indicated that enhancers located far from the gene promoters are often responsible for mediating gene transcription. Knowing their properties, regulatory activity, and genomic targets is crucial to the functional understanding of cellular events, ranging from cellular homeostasis to differentiation. Recent genome-wide investigation of epigenomic marks has indicated that enhancer elements could be enriched for certain epigenomic marks, such as, combinatorial patterns of histone modifications.Our efforts in this paper are motivated by these recent advances in epigenomic profiling methods, which have uncovered enhancer-associated chromatin features in different cell types and organisms. Specifically, in this paper, we use recent state-of-the-art Deep Learning methods and develop a deep neural network (DNN)-based architecture, called EP-DNN, to predict the presence and types of enhancers in the human genome. It uses as features, the expression levels of the histone modifications at the peaks of the functional sites as well as in its adjacent regions. We apply EP-DNN to four different cell types: H1, IMR90, HepG2, and HeLa S3. We train EP-DNN using p300 binding sites as enhancers, and TSS and random non-DHS sites as non-enhancers. We perform EP-DNN predictions to quantify the validation rate for different levels of confidence in the predictions and also perform comparisons against two state-of-the-art computational models for enhancer predictions, DEEP-ENCODE and RFECS.We find that EP-DNN has superior accuracy and takes less time to make predictions. Next, we develop methods to make EP-DNN interpretable by computing the importance of each input feature in the classification task. This analysis indicates that the important histone modifications were distinct for different cell types, with some overlaps, e.g., H3K27ac was important in cell type H1 but less so in HeLa S3, while H3K4me1 was relatively important in all four cell types. We finally use the feature importance analysis to reduce the number of input features needed to train the DNN, thus reducing training time, which is often the computational bottleneck in the use of a DNN.In this paper, we developed EP-DNN, which has high accuracy of prediction, with validation rates above 90 % for the operational region of enhancer prediction for all four cell lines that we studied, outperforming DEEP-ENCODE and RFECS. Then, we developed a method to analyze a trained DNN and determine which histone modifications are important, and within that, which features proximal or distal to the enhancer site, are important.
Project description:Transcriptional enhancers play critical roles in regulation of gene expression, but their identification in the eukaryotic genome has been challenging. Recently, it was shown that enhancers in the mammalian genome are associated with characteristic histone modification patterns, which have been increasingly exploited for enhancer identification. However, only a limited number of cell types or chromatin marks have previously been investigated for this purpose, leaving the question unanswered whether there exists an optimal set of histone modifications for enhancer prediction in different cell types. Here, we address this issue by exploring genome-wide profiles of 24 histone modifications in two distinct human cell types, embryonic stem cells and lung fibroblasts. We developed a Random-Forest based algorithm, RFECS (Random Forest based Enhancer identification from Chromatin States) to integrate histone modification profiles for identification of enhancers, and used it to identify enhancers in a number of cell-types. We show that RFECS not only leads to more accurate and precise prediction of enhancers than previous methods, but also helps identify the most informative and robust set of three chromatin marks for enhancer prediction.
Project description:<b>Background:</b>The data deluge can leverage sophisticated ML techniques for functionally annotating the regulatory non-coding genome. The challenge lies in selecting the appropriate classifier for the specific functional annotation problem, within the bounds of the hardware constraints and the model's complexity. In our system AIKYATAN, we annotate distal epigenomic regulatory sites, e.g., enhancers. Specifically, we develop a binary classifier that classifies genome sequences as distal regulatory regions or not, given their histone modifications' combinatorial signatures. This problem is challenging because the regulatory regions are distal to the genes, with diverse signatures across classes (e.g., enhancers and insulators) and even within each class (e.g., different enhancer sub-classes).<br><br><b>Results:</b>We develop a suite of ML models, under the banner AIKYATAN, including SVM models, random forest variants, and deep learning architectures, for distal regulatory element (DRE) detection. We demonstrate, with strong empirical evidence, deep learning approaches have a computational advantage. Plus, convolutional neural networks (CNN) provide the best-in-class accuracy, superior to the vanilla variant. With the human embryonic cell line H1, CNN achieves an accuracy of 97.9% and an order of magnitude lower runtime than the kernel SVM. Running on a GPU, the training time is sped up 21x and 30x (over CPU) for DNN and CNN, respectively. Finally, our CNN model enjoys superior prediction performance vis-'a-vis the competition. Specifically, AIKYATAN-CNN achieved 40% higher validation rate versus CSIANN and the same accuracy as RFECS.<br><br><b>Conclusions:</b>Our exhaustive experiments using an array of ML tools validate the need for a model that is not only expressive but can scale with increasing data volumes and diversity. In addition, a subset of these datasets have image-like properties and benefit from spatial pooling of features. Our AIKYATAN suite leverages diverse epigenomic datasets that can then be modeled using CNNs with optimized activation and pooling functions. The goal is to capture the salient features of the integrated epigenomic datasets for deciphering the distal (non-coding) regulatory elements, which have been found to be associated with functional variants. Our source code will be made publicly available at: https://bitbucket.org/cellsandmachines/aikyatan.
Project description:The histone modification state of genomic regions is hypothesized to reflect the regulatory activity of the underlying genomic DNA. Based on this hypothesis, the ENCODE Project Consortium measured the status of multiple histone modifications across the genome in several cell types and used these data to segment the genome into regions with different predicted regulatory activities. We measured the cis-regulatory activity of more than 2000 of these predictions in the K562 leukemia cell line. We tested genomic segments predicted to be Enhancers, Weak Enhancers, or Repressed elements in K562 cells, along with other sequences predicted to be Enhancers specific to the H1 human embryonic stem cell line (H1-hESC). Both Enhancer and Weak Enhancer sequences in K562 cells were more active than negative controls, although surprisingly, Weak Enhancer segmentations drove expression higher than did Enhancer segmentations. Lower levels of the covalent histone modifications H3K36me3 and H3K27ac, thought to mark active enhancers and transcribed gene bodies, associate with higher expression and partly explain the higher activity of Weak Enhancers over Enhancer predictions. While DNase I hypersensitivity (HS) is a good predictor of active sequences in our assay, transcription factor (TF) binding models need to be included in order to accurately identify highly expressed sequences. Overall, our results show that a significant fraction (-26%) of the ENCODE enhancer predictions have regulatory activity, suggesting that histone modification states can reflect the cis-regulatory activity of sequences in the genome, but that specific sequence preferences, such as TF-binding sites, are the causal determinants of cis-regulatory activity.
Project description:BACKGROUND:The identification of enhancers is a challenging task. Various types of epigenetic information including histone modification have been utilized in the construction of enhancer prediction models based on a diverse panel of machine learning schemes. However, DNA methylation profiles generated from the whole genome bisulfite sequencing (WGBS) have not been fully explored for their potential in enhancer prediction despite the fact that low methylated regions (LMRs) have been implied to be distal active regulatory regions. METHOD:In this work, we propose a prediction framework, LMethyR-SVM, using LMRs identified from cell-type-specific WGBS DNA methylation profiles and a weighted support vector machine learning framework. In LMethyR-SVM, the set of cell-type-specific LMRs is further divided into three sets: reliable positive, like positive and likely negative, according to their resemblance to a small set of experimentally validated enhancers in the VISTA database based on an estimated non-parametric density distribution. Then, the prediction model is obtained by solving a weighted support vector machine. RESULTS:We demonstrate the performance of LMethyR-SVM by using the WGBS DNA methylation profiles derived from the human embryonic stem cell type (H1) and the fetal lung fibroblast cell type (IMR90). The predicted enhancers are highly conserved with a reasonable validation rate based on a set of commonly used positive markers including transcription factors, p300 binding and DNase-I hypersensitive sites. In addition, we show evidence that the large fraction of the LMethyR-SVM predicted enhancers are not predicted by ChromHMM in H1 cell type and they are more enriched for the FANTOM5 enhancers. CONCLUSION:Our work suggests that low methylated regions detected from the WGBS data are useful as complementary resources to histone modification marks in developing models for the prediction of cell-type-specific enhancers.
Project description:Enhancer mapping has been greatly facilitated by various genomic marks associated with it. However, little is available in our toolbox to link enhancers with their target promoters, hampering mechanistic understanding of enhancer-promoter (EP) interaction. We develop and characterize multiple genomic features for distinguishing true EP pairs from noninteracting pairs. We integrate these features into a probabilistic predictor for EP interactions. Multiple validation experiments demonstrate a significant improvement over state-of-the-art approaches. Systematic analyses of EP interactions across 12 cell types reveal several global features of EP interactions: (i) a larger fraction of EP interactions are cell type specific than enhancers; (ii) promoters controlled by multiple enhancers have higher tissue specificity, but the regulating enhancers are less conserved; (iii) cohesin plays a role in mediating tissue-specific EP interactions via chromatin looping in a CTCF-independent manner. Our approach presents a systematic and effective strategy to decipher the mechanisms underlying EP communication.
Project description:Deciphering the genomic regulatory code of enhancers is a key challenge in biology because this code underlies cellular identity. A better understanding of how enhancers work will improve the interpretation of noncoding genome variation and empower the generation of cell type-specific drivers for gene therapy. Here, we explore the combination of deep learning and cross-species chromatin accessibility profiling to build explainable enhancer models. We apply this strategy to decipher the enhancer code in melanoma, a relevant case study owing to the presence of distinct melanoma cell states. We trained and validated a deep learning model, called DeepMEL, using chromatin accessibility data of 26 melanoma samples across six different species. We show the accuracy of DeepMEL predictions on the CAGI5 challenge, where it significantly outperforms existing models on the melanoma enhancer of <i>IRF4</i> Next, we exploit DeepMEL to analyze enhancer architectures and identify accurate transcription factor binding sites for the core regulatory complexes in the two different melanoma states, with distinct roles for each transcription factor, in terms of nucleosome displacement or enhancer activation. Finally, DeepMEL identifies orthologous enhancers across distantly related species, where sequence alignment fails, and the model highlights specific nucleotide substitutions that underlie enhancer turnover. DeepMEL can be used from the Kipoi database to predict and optimize candidate enhancers and to prioritize enhancer mutations. In addition, our computational strategy can be applied to other cancer or normal cell types.
Project description:Transcriptional enhancers are non-coding segments of DNA that play a central role in the spatiotemporal regulation of gene expression programs. However, systematically and precisely predicting enhancers remain a major challenge. Although existing methods have achieved some success in enhancer prediction, they still suffer from many issues. We developed a deep learning-based algorithmic framework named PEDLA (https://github.com/wenjiegroup/PEDLA), which can directly learn an enhancer predictor from massively heterogeneous data and generalize in ways that are mostly consistent across various cell types/tissues. We first trained PEDLA with 1,114-dimensional heterogeneous features in H1 cells, and demonstrated that PEDLA framework integrates diverse heterogeneous features and gives state-of-the-art performance relative to five existing methods for enhancer prediction. We further extended PEDLA to iteratively learn from 22 training cell types/tissues. Our results showed that PEDLA manifested superior performance consistency in both training and independent test sets. On average, PEDLA achieved 95.0% accuracy and a 96.8% geometric mean (GM) of sensitivity and specificity across 22 training cell types/tissues, as well as 95.7% accuracy and a 96.8% GM across 20 independent test cell types/tissues. Together, our work illustrates the power of harnessing state-of-the-art deep learning techniques to consistently identify regulatory elements at a genome-wide scale from massively heterogeneous data across diverse cell types/tissues.
Project description:Enhancers physically interact with transcriptional promoters, looping over distances that can span multiple regulatory elements. Given that enhancer-promoter (EP) interactions generally occur via common protein complexes, it is unclear whether EP pairing is predominantly deterministic or proximity guided. Here, we present cross-organismic evidence suggesting that most EP pairs are compatible, largely determined by physical proximity rather than specific interactions. By reanalyzing transcriptome datasets, we find that the transcription of gene neighbors is correlated over distances that scale with genome size. We experimentally show that nonspecific EP interactions can explain such correlation, and that EP distance acts as a scaling factor for the transcriptional influence of an enhancer. We propose that enhancer sharing is commonplace among eukaryotes, and that EP distance is an important layer of information in gene regulation.
Project description:Deciphering the code of cis-regulatory element (CRE) is one of the core issues of today's biology. Enhancers are distal CREs and play significant roles in gene transcriptional regulation. Although identifications of enhancer locations across the whole genome [discriminative enhancer predictions (DEP)] is necessary, it is more important to predict in which specific cell or tissue types, they will be activated and functional [tissue-specific enhancer predictions (TSEP)]. Although existing deep learning models achieved great successes in DEP, they cannot be directly employed in TSEP because a specific cell or tissue type only has a limited number of available enhancer samples for training. Here, we first adopted a reported deep learning architecture and then developed a novel training strategy named "pretraining-retraining strategy" (PRS) for TSEP by decomposing the whole training process into two successive stages: a pretraining stage is designed to train with the whole enhancer data for performing DEP, and a retraining strategy is then designed to train with tissue-specific enhancer samples based on the trained pretraining model for making TSEP. As a result, PRS is found to be valid for DEP with an AUC of 0.922 and a GM (geometric mean) of 0.696, when testing on a larger-scale FANTOM5 enhancer dataset via a five-fold cross-validation. Interestingly, based on the trained pretraining model, a new finding is that only additional twenty epochs are needed to complete the retraining process on testing 23 specific tissues or cell lines. For TSEP tasks, PRS achieved a mean GM of 0.806 which is significantly higher than 0.528 of gkm-SVM, an existing mainstream method for CRE predictions. Notably, PRS is further proven superior to other two state-of-the-art methods: DEEP and BiRen. In summary, PRS has employed useful ideas from the domain of transfer learning and is a reliable method for TSEPs.