Local epigenomic state cannot discriminate interacting and non-interacting enhancer-promoter pairs with high accuracy.
ABSTRACT: We report an experimental design issue in recent machine learning formulations of the enhancer-promoter interaction problem arising from the fact that many enhancer-promoter pairs share features. Cross-fold validation schemes which do not correctly separate these feature sharing enhancer-promoter pairs into one test set report high accuracy, which is actually arising from high training set accuracy and a failure to properly evaluate generalization performance. Cross-fold validation schemes which properly segregate pairs with shared features show markedly reduced ability to predict enhancer-promoter interactions from epigenomic state. Parameter scans with multiple models indicate that local epigenomic features of individual pairs of enhancers and promoters cannot distinguish those pairs that interact from those which do with high accuracy, suggesting that additional information is required to predict enhancer-promoter interactions.
Project description:Enhancer-promoter interactions (EPIs) are crucial for transcriptional regulation. Mapping such interactions proves useful for understanding disease regulations and discovering risk genes in genome-wide association studies. Some previous studies showed that machine learning methods, as computational alternatives to costly experimental approaches, performed well in predicting EPIs from local sequence and/or local epigenomic data. In particular, deep learning methods were demonstrated to outperform traditional machine learning methods, and using DNA sequence data alone could perform either better than or almost as well as only utilizing epigenomic data. However, most, if not all, of these previous studies were based on randomly splitting enhancer-promoter pairs as training, tuning, and test data, which has recently been pointed out to be problematic; due to multiple and duplicating/overlapping enhancers (and promoters) in enhancer-promoter pairs in EPI data, such random splitting does not lead to independent training, tuning, and test data, thus resulting in model over-fitting and over-estimating predictive performance. Here, after correcting this design issue, we extensively studied the performance of various deep learning models with local sequence and epigenomic data around enhancer-promoter pairs. Our results confirmed much lower performance using either sequence or epigenomic data alone, or both, than reported previously. We also demonstrated that local epigenomic features were more informative than local sequence data. Our results were based on an extensive exploration of many convolutional neural network (CNN) and feed-forward neural network (FNN) structures, and of gradient boosting as a representative of traditional machine learning.
Project description:We present a novel partner-specific protein-protein interaction site prediction method called PAIRpred. Unlike most existing machine learning binding site prediction methods, PAIRpred uses information from both proteins in a protein complex to predict pairs of interacting residues from the two proteins. PAIRpred captures sequence and structure information about residue pairs through pairwise kernels that are used for training a support vector machine classifier. As a result, PAIRpred presents a more detailed model of protein binding, and offers state of the art accuracy in predicting binding sites at the protein level as well as inter-protein residue contacts at the complex level. We demonstrate PAIRpred's performance on Docking Benchmark 4.0 and recent CAPRI targets. We present a detailed performance analysis outlining the contribution of different sequence and structure features, together with a comparison to a variety of existing interface prediction techniques. We have also studied the impact of binding-associated conformational change on prediction accuracy and found PAIRpred to be more robust to such structural changes than existing schemes. As an illustration of the potential applications of PAIRpred, we provide a case study in which PAIRpred is used to analyze the nature and specificity of the interface in the interaction of human ISG15 protein with NS1 protein from influenza A virus. Python code for PAIRpred is available at http://combi.cs.colostate.edu/supplements/pairpred/.
Project description:Discriminating the gene target of a distal regulatory element from other nearby transcribed genes is a challenging problem with the potential to illuminate the causal underpinnings of complex diseases. We present TargetFinder, a computational method that reconstructs regulatory landscapes from diverse features along the genome. The resulting models accurately predict individual enhancer-promoter interactions across multiple cell lines with a false discovery rate up to 15 times smaller than that obtained using the closest gene. By evaluating the genomic features driving this accuracy, we uncover interactions between structural proteins, transcription factors, epigenetic modifications, and transcription that together distinguish interacting from non-interacting enhancer-promoter pairs. Most of this signature is not proximal to the enhancers and promoters but instead decorates the looping DNA. We conclude that complex but consistent combinations of marks on the one-dimensional genome encode the three-dimensional structure of fine-scale regulatory interactions.
Project description:Gene expression is mediated by specialized cis-regulatory modules (CRMs), the most prominent of which are called enhancers. Early experiments indicated that enhancers located far from the gene promoters are often responsible for mediating gene transcription. Knowing their properties, regulatory activity, and genomic targets is crucial to the functional understanding of cellular events, ranging from cellular homeostasis to differentiation. Recent genome-wide investigation of epigenomic marks has indicated that enhancer elements could be enriched for certain epigenomic marks, such as, combinatorial patterns of histone modifications.Our efforts in this paper are motivated by these recent advances in epigenomic profiling methods, which have uncovered enhancer-associated chromatin features in different cell types and organisms. Specifically, in this paper, we use recent state-of-the-art Deep Learning methods and develop a deep neural network (DNN)-based architecture, called EP-DNN, to predict the presence and types of enhancers in the human genome. It uses as features, the expression levels of the histone modifications at the peaks of the functional sites as well as in its adjacent regions. We apply EP-DNN to four different cell types: H1, IMR90, HepG2, and HeLa S3. We train EP-DNN using p300 binding sites as enhancers, and TSS and random non-DHS sites as non-enhancers. We perform EP-DNN predictions to quantify the validation rate for different levels of confidence in the predictions and also perform comparisons against two state-of-the-art computational models for enhancer predictions, DEEP-ENCODE and RFECS.We find that EP-DNN has superior accuracy and takes less time to make predictions. Next, we develop methods to make EP-DNN interpretable by computing the importance of each input feature in the classification task. This analysis indicates that the important histone modifications were distinct for different cell types, with some overlaps, e.g., H3K27ac was important in cell type H1 but less so in HeLa S3, while H3K4me1 was relatively important in all four cell types. We finally use the feature importance analysis to reduce the number of input features needed to train the DNN, thus reducing training time, which is often the computational bottleneck in the use of a DNN.In this paper, we developed EP-DNN, which has high accuracy of prediction, with validation rates above 90 % for the operational region of enhancer prediction for all four cell lines that we studied, outperforming DEEP-ENCODE and RFECS. Then, we developed a method to analyze a trained DNN and determine which histone modifications are important, and within that, which features proximal or distal to the enhancer site, are important.
Project description:BACKGROUND:Glioma stem cells (GSCs) are a subpopulation of stem-like cells that contribute to glioblastoma (GBM) aggressiveness, recurrence, and resistance to radiation and chemotherapy. Therapeutically targeting the GSC population may improve patient survival, but unique vulnerabilities need to be identified. RESULTS:We isolate GSCs from well-characterized GBM patient-derived xenografts (PDX), characterize their stemness properties using immunofluorescence staining, profile their epigenome including 5mC, 5hmC, 5fC/5caC, and two enhancer marks, and define their transcriptome. Fetal brain-derived neural stem/progenitor cells are used as a comparison to define potential unique and common molecular features between these different brain-derived cells with stem properties. Our integrative study reveals that abnormal expression of ten-eleven-translocation (TET) family members correlates with global levels of 5mC and 5fC/5caC and may be responsible for the distinct levels of these marks between glioma and neural stem cells. Heterogenous transcriptome and epigenome signatures among GSCs converge on several genes and pathways, including DNA damage response and cell proliferation, which are highly correlated with TET expression. Distinct enhancer landscapes are also strongly associated with differential gene regulation between glioma and neural stem cells; they exhibit unique co-localization patterns with DNA epigenetic mark switching events. Upon differentiation, glioma and neural stem cells exhibit distinct responses with regard to TET expression and DNA mark changes in the genome and GSCs fail to properly remodel their epigenome. CONCLUSIONS:Our integrative epigenomic and transcriptomic characterization reveals fundamentally distinct yet potentially targetable biologic features of GSCs that result from their distinct epigenomic landscapes.
Project description:Mammalian gene regulation is often mediated by distal enhancer elements, in particular, for tissue specific and developmental genes. Computational identification of enhancers is difficult because they do not exhibit clear location preference relative to their target gene and also because they lack clearly distinguishing genomic features. This represents a major challenge in deciphering transcriptional regulation. Recent ChIP-seq based genome-wide investigation of epigenomic modifications have revealed that enhancers are often enriched for certain epigenomic marks. Here we utilize the epigenomic data in human heart tissue along with validated human heart enhancers to develop a Support Vector Machine (SVM) model of cardiac enhancers. Cross-validation classification accuracy of our model was 84% and 92% on positive and negative sets respectively with ROC AUC = 0.92. More importantly, while P300 binding has been used as gold standard for enhancers, our model can distinguish P300-bound validated enhancers from other P300-bound regions that failed to exhibit enhancer activity in transgenic mouse. While GWAS studies reveal polymorphic regions associated with certain phenotypes, they do not immediately provide causality. Next, we hypothesized that genomic regions containing a GWAS SNP associated with a cardiac phenotype might contain another SNP in a cardiac enhancer, which presumably mediates the phenotype. Starting with a comprehensive set of SNPs associated with cardiac phenotypes in GWAS studies, we scored other SNPs in LD with the GWAS SNP according to its probability of being an enhancer and choose one with best score in the LD as enhancer. We found that our predicted enhancers are enriched for known cardiac transcriptional regulator motifs and are likely to regulate the nearby gene. Importantly, these tendencies are more favorable for the predicted enhancers compared with an approach that uses P300 binding as a marker of enhancer activity.
Project description:More than 98% of the human genome does not encode proteins, and the vast majority of the noncoding regions have not been well studied. Some of these regions contain enhancers and functional non-coding RNAs. Previous research suggested that enhancer transcripts could be potent independent indicators of enhancer activity, and some enhancer lncRNAs (elncRNAs) have been proven to play critical roles in gene regulation. Here, we identified enhancer-promoter interactions from high-throughput chromosome conformation capture (Hi-C) data. We found that elncRNAs were highly enriched surrounding chromatin loop anchors. Additionally, the interaction frequency of elncRNA-associated enhancer-promoter pairs was significantly higher than the interaction frequency of other enhancer-promoter pairs, suggesting that elncRNAs may reinforce the interactions between enhancers and promoters. We also found that elncRNA expression levels were positively correlated with the interaction frequency of enhancer-promoter pairs. The promoters interacting with elncRNA-associated enhancers were rich in RNA polymerase II and YY1 transcription factor binding sites. We clustered enhancer-promoter pairs into different groups to reflect the different ways in which elncRNAs could influence enhancer-promoter pairs. Interestingly, G-quadruplexes were found to potentially mediate some enhancer-promoter interaction pairs, and the interaction frequency of these pairs was significantly higher than that of other enhancer-promoter pairs. We also found that the G-quadruplexes on enhancers were highly related to the expression of elncRNAs. G-quadruplexes located in the promoters of elncRNAs led to high expression of elncRNAs, whereas G-quadruplexes located in the gene bodies of elncRNAs generally resulted in low expression of elncRNAs.
Project description:The human epigenome has been experimentally characterized by thousands of measurements for every basepair in the human genome. We propose a deep neural network tensor factorization method, Avocado, that compresses this epigenomic data into a dense, information-rich representation. We use this learned representation to impute epigenomic data more accurately than previous methods, and we show that machine learning models that exploit this representation outperform those trained directly on epigenomic data on a variety of genomics tasks. These tasks include predicting gene expression, promoter-enhancer interactions, replication timing, and an element of 3D chromatin architecture.
Project description:Epigenomic modifications are instrumental for transcriptional regulation, but comprehensive reference epigenomes remain unexplored in rice. Here, we develop an enhanced chromatin immunoprecipitation (eChIP) approach for plants, and generate genome-wide profiling of five histone modifications and RNA polymerase II occupancy with it. By integrating chromatin accessibility, DNA methylation, and transcriptome datasets, we construct comprehensive epigenome landscapes across various tissues in 20 representative rice varieties. Approximately 81.8% of rice genomes are annotated with different epigenomic properties. Refinement of promoter regions using open chromatin and H3K4me3-marked regions provides insight into transcriptional regulation. We identify extensive enhancer-like promoters with potential enhancer function on transcriptional regulation through chromatin interactions. Active and repressive histone modifications and the predicted enhancers vary largely across tissues, whereas inactive chromatin states are relatively stable. Together, these datasets constitute a valuable resource for functional element annotation in rice and indicate the central role of epigenomic information in understanding transcriptional regulation.
Project description:<h4>Background</h4>Co-localized combinations of histone modifications ("chromatin states") have been shown to correlate with promoter and enhancer activity. Changes in chromatin states over multiple time points ("chromatin state trajectories") have previously been analyzed at promoter and enhancers separately. With the advent of time series Hi-C data it is now possible to connect promoters and enhancers and to analyze chromatin state trajectories at promoter-enhancer pairs.<h4>Results</h4>We present TimelessFlex, a framework for investigating chromatin state trajectories at promoters and enhancers and at promoter-enhancer pairs based on Hi-C information. TimelessFlex extends our previous approach Timeless, a Bayesian network for clustering multiple histone modification data sets at promoter and enhancer feature regions. We utilize time series ATAC-seq data measuring open chromatin to define promoters and enhancer candidates. We developed an expectation-maximization algorithm to assign promoters and enhancers to each other based on Hi-C interactions and jointly cluster their feature regions into paired chromatin state trajectories. We find jointly clustered promoter-enhancer pairs showing the same activation patterns on both sides but with a stronger trend at the enhancer side. While the promoter side remains accessible across the time series, the enhancer side becomes dynamically more open towards the gene activation time point. Promoter cluster patterns show strong correlations with gene expression signals, whereas Hi-C signals get only slightly stronger towards activation. The code of the framework is available at https://github.com/henriettemiko/TimelessFlex .<h4>Conclusions</h4>TimelessFlex clusters time series histone modifications at promoter-enhancer pairs based on Hi-C and it can identify distinct chromatin states at promoter and enhancer feature regions and their changes over time.