Project description:The RNA polymerase II (Pol II) core promoter is the strategic site of convergence of the signals that lead to the initiation of DNA transcription, but the downstream core promoter in humans has been difficult to understand. Here we analyse the human Pol II core promoter and use machine learning to generate predictive models for the downstream core promoter region (DPR) and the TATA box. We developed a method termed HARPE (high-throughput analysis of randomized promoter elements) to create hundreds of thousands of DPR (or TATA box) variants, each with known transcriptional strength. We then analysed the HARPE data by support vector regression (SVR) to provide comprehensive models for the sequence motifs, and found that the SVR-based approach is more effective than a consensus-based method for predicting transcriptional activity. These results show that the DPR is a functionally important core promoter element that is widely used in human promoters. Notably, there appears to be a duality between the DPR and the TATA box, as many promoters contain one or the other element. More broadly, these findings show that functional DNA motifs can be identified by machine learning analysis of a comprehensive set of sequence variants.
Project description:Gene expression profiles were generated from 199 primary breast cancer patients. Samples 1-176 were used in another study, GEO Series GSE22820, and form the training data set in this study. Sample numbers 200-222 form a validation set. This data is used to model a machine learning classifier for Estrogen Receptor Status. RNA was isolated from 199 primary breast cancer patients. A machine learning classifier was built to predict ER status using only three gene features.
Project description:We experimented how well various supervised machine learning methods such as decision tree, partial least squares discriminant analysis (PLSDA), support vector machine and random forest perform in classifying endometriosis from the control samples trained on both transcriptomics and methylomics data. The assessment was done from two different perspectives for improving classification performances: (a) implication of three different normalization techniques, and (b) implication of differential analysis using the generalized linear model (GLM). We concluded that an appropriate machine learning diagnostic pipeline for endometriosis should use TMM normalization for transcriptomics data, and quantile or voom normalization for methylomics data, GLM for feature space reduction and classification performance maximization.
Project description:We experimented how well various supervised machine learning methods such as decision tree, partial least squares discriminant analysis (PLSDA), support vector machine and random forest perform in classifying endometriosis from the control samples trained on both transcriptomics and methylomics data. The assessment was done from two different perspectives for improving classification performances: (a) implication of three different normalization techniques, and (b) implication of differential analysis using the generalized linear model (GLM). We concluded that an appropriate machine learning diagnostic pipeline for endometriosis should use TMM normalization for transcriptomics data, and quantile or voom normalization for methylomics data, GLM for feature space reduction and classification performance maximization.
Project description:The RNA polymerase II core promoter is the site of convergence of the signals that lead to the initiation of transcription. Here, we perform a comparative analysis of the downstream core promoter region (DPR) in Drosophila and humans by using machine learning. These studies revealed a distinct human-specific version of the DPR and led to the use of the machine learning models for the identification of synthetic extreme DPR motifs with specificity for human transcription factors relative to Drosophila factors, and vice versa. More generally, machine learning models could be analogously used to design synthetic promoter elements with customized functional properties.
Project description:Gene expression profiles were generated from 199 primary breast cancer patients. Samples 1-176 were used in another study, GEO Series GSE22820, and form the training data set in this study. Sample numbers 200-222 form a validation set. This data is used to model a machine learning classifier for Estrogen Receptor Status.
Project description:Small non-coding RNAs can be secreted through a variety of mechanisms, including exosomal sorting, in small extracellular vesicles, and within lipoprotein complexes. However, the mechanisms that govern their sorting and secretion are still not well understood. In this study, we present ExoGRU, a machine learning model that predicts small RNA secretion probabilities from primary RNA sequence. We experimentally validated the performance of this model through ExoGRU-guided mutagenesis and synthetic RNA sequence analysis, and confirmed that primary RNA sequence is a major determinant in small RNA secretion. Additionally, we used ExoGRU to reveal cis and trans factors that underlie small RNA secretion, including known and novel RNA-binding proteins, e.g., YBX1, HNRNPA2B1, and RBM24. We also developed a novel technique called exoCLIP, which reveals the RNA interactome of RBPs within the cell-free space. We used exoCLIP to reveal the RNA interactome of HNRNPA2B1 and RBM24 in extracellular vesicles. Together, our results demonstrate the power of machine learning in revealing novel biological mechanisms. In addition to providing deeper insight into complex processes such as small RNA secretion, this knowledge can be leveraged in therapeutic and synthetic biology applications.
Project description:Small non-coding RNAs can be secreted through a variety of mechanisms, including exosomal sorting, in small extracellular vesicles, and within lipoprotein complexes. However, the mechanisms that govern their sorting and secretion are still not well understood. In this study, we present ExoGRU, a machine learning model that predicts small RNA secretion probabilities from primary RNA sequence. We experimentally validated the performance of this model through ExoGRU-guided mutagenesis and synthetic RNA sequence analysis, and confirmed that primary RNA sequence is a major determinant in small RNA secretion. Additionally, we used ExoGRU to reveal cis and trans factors that underlie small RNA secretion, including known and novel RNA-binding proteins, e.g., YBX1, HNRNPA2B1, and RBM24. We also developed a novel technique called exoCLIP, which reveals the RNA interactome of RBPs within the cell-free space. We used exoCLIP to reveal the RNA interactome of HNRNPA2B1 and RBM24 in extracellular vesicles. Together, our results demonstrate the power of machine learning in revealing novel biological mechanisms. In addition to providing deeper insight into complex processes such as small RNA secretion, this knowledge can be leveraged in therapeutic and synthetic biology applications.