Prediction of enhancer-promoter interactions via natural language processing.
ABSTRACT: BACKGROUND:Precise identification of three-dimensional genome organization, especially enhancer-promoter interactions (EPIs), is important to deciphering gene regulation, cell differentiation and disease mechanisms. Currently, it is a challenging task to distinguish true interactions from other nearby non-interacting ones since the power of traditional experimental methods is limited due to low resolution or low throughput. RESULTS:We propose a novel computational framework EP2vec to assay three-dimensional genomic interactions. We first extract sequence embedding features, defined as fixed-length vector representations learned from variable-length sequences using an unsupervised deep learning method in natural language processing. Then, we train a classifier to predict EPIs using the learned representations in supervised way. Experimental results demonstrate that EP2vec obtains F1 scores ranging from 0.841~ 0.933 on different datasets, which outperforms existing methods. We prove the robustness of sequence embedding features by carrying out sensitivity analysis. Besides, we identify motifs that represent cell line-specific information through analysis of the learned sequence embedding features by adopting attention mechanism. Last, we show that even superior performance with F1 scores 0.889~ 0.940 can be achieved by combining sequence embedding features and experimental features. CONCLUSIONS:EP2vec sheds light on feature extraction for DNA sequences of arbitrary lengths and provides a powerful approach for EPIs identification.
Project description:In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.
Project description:BACKGROUND:Enhancer-promoter interactions (EPIs) play key roles in transcriptional regulation and disease progression. Although several computational methods have been developed to predict such interactions, their performances are not satisfactory when training and testing data from different cell lines. Currently, it is still unclear what extent a across cell line prediction can be made based on sequence-level information. RESULTS:In this work, we present a novel Sequence-based method (called SEPT) to predict the enhancer-promoter interactions in new cell line by using the cross-cell information and Transfer learning. SEPT first learns the features of enhancer and promoter from DNA sequences with convolutional neural network (CNN), then designing the gradient reversal layer of transfer learning to reduce the cell line specific features meanwhile retaining the features associated with EPIs. When the locations of enhancers and promoters are provided in new cell line, SEPT can successfully recognize EPIs in this new cell line based on labeled data of other cell lines. The experiment results show that SEPT can effectively learn the latent import EPIs-related features between cell lines and achieves the best prediction performance in terms of AUC (the area under the receiver operating curves). CONCLUSIONS:SEPT is an effective method for predicting the EPIs in new cell line. Domain adversarial architecture of transfer learning used in SEPT can learn the latent EPIs shared features among cell lines from all other existing labeled data. It can be expected that SEPT will be of interest to researchers concerned with biological interaction prediction.
Project description:Enhancer-promoter interactions (EPIs) are crucial for transcriptional regulation. Mapping such interactions proves useful for understanding disease regulations and discovering risk genes in genome-wide association studies. Some previous studies showed that machine learning methods, as computational alternatives to costly experimental approaches, performed well in predicting EPIs from local sequence and/or local epigenomic data. In particular, deep learning methods were demonstrated to outperform traditional machine learning methods, and using DNA sequence data alone could perform either better than or almost as well as only utilizing epigenomic data. However, most, if not all, of these previous studies were based on randomly splitting enhancer-promoter pairs as training, tuning, and test data, which has recently been pointed out to be problematic; due to multiple and duplicating/overlapping enhancers (and promoters) in enhancer-promoter pairs in EPI data, such random splitting does not lead to independent training, tuning, and test data, thus resulting in model over-fitting and over-estimating predictive performance. Here, after correcting this design issue, we extensively studied the performance of various deep learning models with local sequence and epigenomic data around enhancer-promoter pairs. Our results confirmed much lower performance using either sequence or epigenomic data alone, or both, than reported previously. We also demonstrated that local epigenomic features were more informative than local sequence data. Our results were based on an extensive exploration of many convolutional neural network (CNN) and feed-forward neural network (FNN) structures, and of gradient boosting as a representative of traditional machine learning.
Project description:Advances in high-throughput sequencing have increased the availability of microbiome sequencing data that can be exploited to characterize microbiome community structure in situ. We explore using word and sentence embedding approaches for nucleotide sequences since they may be a suitable numerical representation for downstream machine learning applications (especially deep learning). This work involves first encoding ("embedding") each sequence into a dense, low-dimensional, numeric vector space. Here, we use Skip-Gram word2vec to embed k-mers, obtained from 16S rRNA amplicon surveys, and then leverage an existing sentence embedding technique to embed all sequences belonging to specific body sites or samples. We demonstrate that these representations are meaningful, and hence the embedding space can be exploited as a form of feature extraction for exploratory analysis. We show that sequence embeddings preserve relevant information about the sequencing data such as k-mer context, sequence taxonomy, and sample class. Specifically, the sequence embedding space resolved differences among phyla, as well as differences among genera within the same family. Distances between sequence embeddings had similar qualities to distances between alignment identities, and embedding multiple sequences can be thought of as generating a consensus sequence. In addition, embeddings are versatile features that can be used for many downstream tasks, such as taxonomic and sample classification. Using sample embeddings for body site classification resulted in negligible performance loss compared to using OTU abundance data, and clustering embeddings yielded high fidelity species clusters. Lastly, the k-mer embedding space captured distinct k-mer profiles that mapped to specific regions of the 16S rRNA gene and corresponded with particular body sites. Together, our results show that embedding sequences results in meaningful representations that can be used for exploratory analyses or for downstream machine learning applications that require numeric data. Moreover, because the embeddings are trained in an unsupervised manner, unlabeled data can be embedded and used to bolster supervised machine learning tasks.
Project description:Although convolutional neural networks (CNNs) have been applied to a variety of computational genomics problems, there remains a large gap in our understanding of how they build representations of regulatory genomic sequences. Here we perform systematic experiments on synthetic sequences to reveal how CNN architecture, specifically convolutional filter size and max-pooling, influences the extent that sequence motif representations are learned by first layer filters. We find that CNNs designed to foster hierarchical representation learning of sequence motifs-assembling partial features into whole features in deeper layers-tend to learn distributed representations, i.e. partial motifs. On the other hand, CNNs that are designed to limit the ability to hierarchically build sequence motif representations in deeper layers tend to learn more interpretable localist representations, i.e. whole motifs. We then validate that this representation learning principle established from synthetic sequences generalizes to in vivo sequences.
Project description:BACKGROUND:There has been an increasing interest in learning low-dimensional vector representations of medical concepts from Electronic Health Records (EHRs). Vector representations of medical concepts facilitate exploratory analysis and predictive modeling of EHR data to gain insights about the patterns of care and health outcomes. EHRs contain structured data such as diagnostic codes and laboratory tests, as well as unstructured free text data in form of clinical notes, which provide more detail about condition and treatment of patients. METHODS:In this work, we propose a method that jointly learns vector representations of medical concepts and words. This is achieved by a novel learning scheme based on the word2vec model. Our model learns those relationships by integrating clinical notes and sets of accompanying medical codes and by defining joint contexts for each observed word and medical code. RESULTS:In our experiments, we learned joint representations using MIMIC-III data. Using the learned representations of words and medical codes, we evaluated phenotypes for 6 diseases discovered by our and baseline method. The experimental results show that for each of the 6 diseases our method finds highly relevant words. We also show that our representations can be very useful when predicting the reason for the next visit. CONCLUSIONS:The jointly learned representations of medical concepts and words capture not only similarity between codes or words themselves, but also similarity between codes and words. They can be used to extract phenotypes of different diseases. The representations learned by the joint model are also useful for construction of patient features.
Project description:Protein-protein interactions are closely relevant to protein function and drug discovery. Hence, accurately identifying protein-protein interactions will help us to understand the underlying molecular mechanisms and significantly facilitate the drug discovery. However, the majority of existing computational methods for protein-protein interactions prediction are focused on the feature extraction and combination of features and there have been limited gains from the state-of-the-art models. In this work, a new residue representation method named Res2vec is designed for protein sequence representation. Residue representations obtained by Res2vec describe more precisely residue-residue interactions from raw sequence and supply more effective inputs for the downstream deep learning model. Combining effective feature embedding with powerful deep learning techniques, our method provides a general computational pipeline to infer protein-protein interactions, even when protein structure knowledge is entirely unknown. The proposed method DeepFE-PPI is evaluated on the S. Cerevisiae and human datasets. The experimental results show that DeepFE-PPI achieves 94.78% (accuracy), 92.99% (recall), 96.45% (precision), 89.62% (Matthew's correlation coefficient, MCC) and 98.71% (accuracy), 98.54% (recall), 98.77% (precision), 97.43% (MCC), respectively. In addition, we also evaluate the performance of DeepFE-PPI on five independent species datasets and all the results are superior to the existing methods. The comparisons show that DeepFE-PPI is capable of predicting protein-protein interactions by a novel residue representation method and a deep learning classification framework in an acceptable level of accuracy. The codes along with instructions to reproduce this work are available from https://github.com/xal2019/DeepFE-PPI.
Project description:BACKGROUND:The Bacteria Biotope (BB) task is a biomedical relation extraction (RE) that aims to study the interaction between bacteria and their locations. This task is considered to pertain to fundamental knowledge in applied microbiology. Some previous investigations conducted the study by applying feature-based models; others have presented deep-learning-based models such as convolutional and recurrent neural networks used with the shortest dependency paths (SDPs). Although SDPs contain valuable and concise information, some parts of crucial information that is required to define bacterial location relationships are often neglected. Moreover, the traditional word-embedding used in previous studies may suffer from word ambiguation across linguistic contexts. RESULTS:Here, we present a deep learning model for biomedical RE. The model incorporates feature combinations of SDPs and full sentences with various attention mechanisms. We also used pre-trained contextual representations based on domain-specific vocabularies. To assess the model's robustness, we introduced a mean F1 score on many models using different random seeds. The experiments were conducted on the standard BB corpus in BioNLP-ST'16. Our experimental results revealed that the model performed better (in terms of both maximum and average F1 scores; 60.77% and 57.63%, respectively) compared with other existing models. CONCLUSIONS:We demonstrated that our proposed contributions to this task can be used to extract rich lexical, syntactic, and semantic features that effectively boost the model's performance. Moreover, we analyzed the trade-off between precision and recall to choose the proper cut-off to use in real-world applications.
Project description:Learning accurate drug representations is essential for tasks such as computational drug repositioning and prediction of drug side effects. A drug hierarchy is a valuable source that encodes knowledge of relations among drugs in a tree-like structure where drugs that act on the same organs, treat the same disease, or bind to the same biological target are grouped together. However, its utility in learning drug representations has not yet been explored, and currently described drug representations cannot place novel molecules in a drug hierarchy. Here, we develop a semi-supervised drug embedding that incorporates two sources of information: (1) underlying chemical grammar that is inferred from chemical structures of drugs and drug-like molecules (unsupervised) and (2) hierarchical relations that are encoded in an expert-crafted hierarchy of approved drugs (supervised). We use the Variational Auto-Encoder (VAE) framework to encode the chemical structures of molecules and use the drug-drug similarity information obtained from the hierarchy to induce the clustering of drugs in hyperbolic space. The hyperbolic space is amenable for encoding hierarchical relations. Both quantitative and qualitative results support that the learned drug embedding can accurately reproduce the chemical structure and recapitulate the hierarchical relations among drugs. Furthermore, our approach can infer the pharmacological properties of novel molecules by retrieving similar drugs from the embedding space. We demonstrate that our drug embedding can predict new uses and discover new side effects of existing drugs. We show that it significantly outperforms comparison methods in both tasks.
Project description:Protein-protein interaction network data provides valuable information that infers direct links between genes and their biological roles. This information brings a fundamental hypothesis for protein function prediction that interacting proteins tend to have similar functions. With the help of recently-developed network embedding feature generation methods and deep maxout neural networks, it is possible to extract functional representations that encode direct links between protein-protein interactions information and protein function. Our novel method, STRING2GO, successfully adopts deep maxout neural networks to learn functional representations simultaneously encoding both protein-protein interactions and functional predictive information. The experimental results show that STRING2GO outperforms other protein-protein interaction network-based prediction methods and one benchmark method adopted in a recent large scale protein function prediction competition.