Project description:The RNA polymerase II core promoter is the site of convergence of the signals that lead to the initiation of transcription. Here, we perform a comparative analysis of the downstream core promoter region (DPR) in Drosophila and humans by using machine learning. These studies revealed a distinct human-specific version of the DPR and led to the use of the machine learning models for the identification of synthetic extreme DPR motifs with specificity for human transcription factors relative to Drosophila factors, and vice versa. More generally, machine learning models could be analogously used to design synthetic promoter elements with customized functional properties.
Project description:The RNA polymerase II (Pol II) core promoter is the strategic site of convergence of the signals that lead to the initiation of DNA transcription, but the downstream core promoter in humans has been difficult to understand. Here we analyse the human Pol II core promoter and use machine learning to generate predictive models for the downstream core promoter region (DPR) and the TATA box. We developed a method termed HARPE (high-throughput analysis of randomized promoter elements) to create hundreds of thousands of DPR (or TATA box) variants, each with known transcriptional strength. We then analysed the HARPE data by support vector regression (SVR) to provide comprehensive models for the sequence motifs, and found that the SVR-based approach is more effective than a consensus-based method for predicting transcriptional activity. These results show that the DPR is a functionally important core promoter element that is widely used in human promoters. Notably, there appears to be a duality between the DPR and the TATA box, as many promoters contain one or the other element. More broadly, these findings show that functional DNA motifs can be identified by machine learning analysis of a comprehensive set of sequence variants.
Project description:The initiator (Inr) is the starting point for the transcription of many genes. Here, we generated highly predictive machine learning models of the human Inr region, and determined that the Inr is present in about 60% of focused human promoters, identified a novel TATA-specific Inr, and detected the overlapping but functionally distinct TCT motif. Quantitative genome-wide analyses revealed a strict and synergistic interaction between the Inr and DPR, an inverse relationship between the TATA and DPR, a flexible and sometimes independent function of the TATA box in relation to the Inr, and different properties of the TCT motif in humans versus Drosophila.
Project description:Selective inhibitors are essential for targeted therapeutics and for probing enzyme functions in various biological systems. The two main challenges in identifying such inhibitors lie in the extensive experimental effort required, including the generation of large libraries, and in tailoring the selectivity of inhibitors to enzymes with homologous structures. To address these challenges, machine learning (ML) is being used to improve protein design by training on targeted libraries and identifying key interface mutations that enhance affinity and specificity. However, such ML-based methods are limited by inaccurate energy calculations and difficulties in predicting the structural impacts of multiple mutations. Here, we present an ML-based method that leverages HTS data to streamline the design of selective inhibitors. To demonstrate its utility, we applied our new method to finding inhibitors of matrix metalloproteinases (MMPs), a family of homologous enzymes involved in both physiological and pathological processes. By training ML models on binding data for three MMPs (MMP-1, MMP-3, and MMP-9), we successfully designed a novel N-TIMP2 variant with a differential specificity profile, namely, high affinity for MMP-9, moderate affinity for MMP-3, and low affinity for MMP-1. Our experimental validation showed that this novel variant exhibited a significant specificity shift and enhanced selectivity compared to wild-type N-TIMP2. Through molecular modeling and energy minimization, we obtained structural insights into the variant’s enhanced selectivity. Our findings highlight the power of ML-based methods to reduce experimental workloads, facilitate the rational design of selective inhibitors, and advance the understanding of specific inhibitor-enzyme interactions in homologous enzyme systems.
Project description:Gene expression profiles were generated from 199 primary breast cancer patients. Samples 1-176 were used in another study, GEO Series GSE22820, and form the training data set in this study. Sample numbers 200-222 form a validation set. This data is used to model a machine learning classifier for Estrogen Receptor Status. RNA was isolated from 199 primary breast cancer patients. A machine learning classifier was built to predict ER status using only three gene features.
Project description:Human induced pluripotent stem cells (iPSCs) were established as an artificial embryonic stem cells (ESCs) to avoid immune rejection, for ethical issues in regenerative medicine, and for biological research. Comparison analyses in previous studies revealed that there is no hot spot that distinguishes iPSCs from ESCs. We herewith established a learning model using Jubatus, as a machine learning platform, with linear model for classification to distinguish human iPSCs from ESCs based on DNA methylation profiles. We found that the linear model classification is most suitable for the analysis of human iPSCs whose line number is practically 10 to 100. The learning models discriminated ESCs and iPSCs with an accuracy of ≥ 85.71 % and ≥ 90.91 %, respectively. In addition, the epigenetic signature of iPSCs was identified by component analysis of the learning models. The iPSC-specific fluctuated methylation regions were abundant at chromosome 7, 8, 12, and 22. The method can be utilized with comprehensive data and can also be widely applied to many aspects of molecular biology research.