Project description:Transcriptional enhancers play critical roles in regulation of gene expression, but their identification has remained a challenge. Recently, it was shown that enhancers in the mammalian genome are associated with characteristic histone modification patterns, which have been increasingly exploited for enhancer identification. However, only a limited number of histone modifications have previously been investigated for this purpose, leaving the questions answered whether there exist an optimal set of histone modifications that could improve the enhancer prediction. Here, we address this issue by exploring a rich dataset produced by the human Epigenome Roadmap Project. Specifically, we examined genome-wide profiles of 24 histone modifications in human embryonic stem cells and fibroblasts, and developed a Random-Forest based algorithm to integrate histone modification profiles for identification of enhancers.As a training set, we used histone modification profiles at genome-wide binding sites of p300 in the two cell types identified using ChIP-seq. We show that this algorithm not only leads to more accurate and precise prediction of enhancers than previous methods, but also helps identify an optimal set of three chromatin marks for enhancer prediction.
Project description:Transcriptional enhancers play critical roles in regulation of gene expression, but their identification has remained a challenge. Recently, it was shown that enhancers in the mammalian genome are associated with characteristic histone modification patterns, which have been increasingly exploited for enhancer identification. However, only a limited number of histone modifications have previously been investigated for this purpose, leaving the questions answered whether there exist an optimal set of histone modifications that could improve the enhancer prediction. Here, we address this issue by exploring a rich dataset produced by the human Epigenome Roadmap Project. Specifically, we examined genome-wide profiles of 24 histone modifications in human embryonic stem cells and fibroblasts, and developed a Random-Forest based algorithm to integrate histone modification profiles for identification of enhancers.As a training set, we used histone modification profiles at genome-wide binding sites of p300 in the two cell types identified using ChIP-seq. We show that this algorithm not only leads to more accurate and precise prediction of enhancers than previous methods, but also helps identify an optimal set of three chromatin marks for enhancer prediction. ChIP-Seq Analysis of p300 in hESC H1 and IMR90 cells. Sequencing was done on the Illumina Genome Analyzer II platform for the H1 data and Illumina HiSeq for IMR90.Data was mapped to hg18 using Bowtie.
Project description:Using a public reference data set of 82 unique entities, 382 nanopore-sequenced brain tumor samples were classified based on their methylation status through an ad hoc random forest algorithm. As a measure of confidence, score recalibration was performed and platform-specific thresholds were defined.
Project description:Immunotherapy has improved the prognosis of patients with advanced non-small cell lung
cancer (NSCLC), but only a small subset of patients achieved clinical benefit. The purpose of our study was to integrate multidimensional data using a machine learning method to predict the therapeutic efficacy of immune checkpoint inhibitors (ICIs) monotherapy in patients with advanced NSCLC.The authors retrospectively enrolled 112 patients with stage IIIB-IV NSCLC receiving ICIs monotherapy. The random forest (RF) algorithm was used to establish efficacy prediction models based on five different input datasets, including precontrast computed tomography (CT) radiomic data, postcontrast CT radiomic data, combination of the two CT radiomic data, clinical data, and a combination of radiomic and clinical data. The 5-fold cross-validation was used to train and test the random forest classifier. The performance of the models was assessed according to the area under the curve (AUC) in the receiver operating characteristic (ROC) curve. Among these models(RF MLP LR XGBoost), our reproduced onnx models have better performance, especially for random forest. The response variable with a value (1/0) indicates the (efficacy/inefficacy) of PD-1/PD-L1 monotherapy in patients with advanced NSCLC
Project description:This is a Random Forest algorithm-based machine learning model to predict lncRNAs from coding mRNAs in plant transcriptomic data. The model assigns 1 for coding sequences and 2 for long non-coding sequences. The prediction is performed using a combination of Open Reading Frame (ORF) based, Sequence-based and Codon-bias features. Users need to download the curated ONNX model and also need to convert the sequences into feature matrix as mentioned in PLIT paper (Deshpande et al. 2019) to make predictions on sequences from Zea Mays sequence data.
2023-05-22 | BIOMD0000001067 | BioModels
Project description:A Random Forest Algorithm Based on a cgMLST Scheme to Predict hvKP
Project description:We examined published microarray data from 104 acute lymphoblastic leukaemia patient specimens, that represent six different subgroups defined by cytogenetic features and immunophenotypes. Using the decision-tree based supervised learning algorithm Random Forest (RF), we determined a small set of genes for optimal subgroup distinction and subsequently validated their predictive power in an independent cohort of 68 specimens that were assessed using Affymetrix HG-U133A arrays.
Project description:A Random Forest model is developed to incorporate tumor mutation data within the context of the biological process known as leukocyte proliferation regulation. This model aims to predict a patient's response to anti-PD1 treatment.
The authors conducted experiments using four different types of classifiers: Random Forest, Gradient Boosting, Feed Forward Neural Network, and Long Short-Term Memory (LSTM) recurrent neural network. Among these classifiers, the Random Forest algorithm yielded the best predictive performance when modeling gene mutation data associated with the 'leukocyte proliferation regulation' biological process. Hence, this curated version of the model focuses on the Random Forest model trained specifically on the 'Leukocyte Proliferation Regulation' process.
In this model, a value of '0' is assigned to NonResponders, while a value of '1' is assigned to Responders. Please note that to obtain predictions, users should provide mutation data containing only the genes corresponding to the 'GO_REGULATION_OF_LEUKOCYTE_PROLIFERATION' process keyword, as specified in the 'GO_test_genes_dict_intersection' dictionary.
Project description:Top-down mass spectrometry (MS) is a powerful tool for identification and comprehensive characterization of proteoforms arising from alternative splicing, sequence variation, and post-translational modifications. While the technique is powerful, it suffered from the complex dataset generated from top-down MS experiments, which requires sequential data processing steps for data interpretation. Deconvolution of the complex isotopic distribution that arises from naturally occurring isotopes is a critical step in the data processing process. Multiple algorithms are currently available to deconvolute top-down mass spectra; however, each algorithm generates different deconvoluted peak lists with varied accuracy comparing to true positive annotations. In this study, we have designed a machine learning strategy that can process and combine the peak lists from different deconvolution results. By optimizing clustering results, deconvolution results from THRASH, TopFD, MS-Deconv, and SNAP algorithms were combined into consensus peak lists at various thresholds using either a simple voting ensemble method or a random forest machine learning algorithm. The random forest model outperformed the single best algorithm. This machine learning strategy could enhance the accuracy and confidence in protein identification during database search by accelerating detection of true positive peaks while filtering out false positive peaks. Thus, this method showed promises in enhancing proteoform identification and characterization for high-throughput data analysis in top-down proteomics.