Project description:With the rapidly evolving SARS-CoV-2 variants of concern, there is an urgent need for the discovery of further treatments for the coronavirus disease (COVID-19). Drug repurposing is one of the most rapid strategies for addressing this need, and numerous compounds have already been selected for in vitro testing by several groups. These have led to a growing database of molecules with in vitro activity against the virus. Machine learning models can assist drug discovery through prediction of the best compounds based on previously published data. Herein, we have implemented several machine learning methods to develop predictive models from recent SARS-CoV-2 in vitro inhibition data and used them to prioritize additional FDA-approved compounds for in vitro testing selected from our in-house compound library. From the compounds predicted with a Bayesian machine learning model, lumefantrine, an antimalarial was selected for testing and showed limited antiviral activity in cell-based assays while demonstrating binding (Kd 259 nM) to the spike protein using microscale thermophoresis. Several other compounds which we prioritized have since been tested by others and were also found to be active in vitro. This combined machine learning and in vitro testing approach can be expanded to virtually screen available molecules with predicted activity against SARS-CoV-2 reference WIV04 strain and circulating variants of concern. In the process of this work, we have created multiple iterations of machine learning models that can be used as a prioritization tool for SARS-CoV-2 antiviral drug discovery programs. The very latest model for SARS-CoV-2 with over 500 compounds is now freely available at www.assaycentral.org.
Project description:The demand for emergency department (ED) services is increasing across the globe, particularly during the current COVID-19 pandemic. Clinical triage and risk assessment have become increasingly challenging due to the shortage of medical resources and the strain on hospital infrastructure caused by the pandemic. As a result of the widespread use of electronic health records (EHRs), we now have access to a vast amount of clinical data, which allows us to develop prediction models and decision support systems to address these challenges. To date, there is no widely accepted clinical prediction benchmark related to the ED based on large-scale public EHRs. An open-source benchmark data platform would streamline research workflows by eliminating cumbersome data preprocessing, and facilitate comparisons among different studies and methodologies. Based on the Medical Information Mart for Intensive Care IV Emergency Department (MIMIC-IV-ED) database, we created a benchmark dataset and proposed three clinical prediction benchmarks. This study provides future researchers with insights, suggestions, and protocols for managing data and developing predictive tools for emergency care.
Project description:Predictive modeling is a central technique in neuroimaging to identify brain-behavior relationships and test their generalizability to unseen data. However, data leakage undermines the validity of predictive models by breaching the separation between training and test data. Leakage is always an incorrect practice but still pervasive in machine learning. Understanding its effects on neuroimaging predictive models can inform how leakage affects existing literature. Here, we investigate the effects of five forms of leakage-involving feature selection, covariate correction, and dependence between subjects-on functional and structural connectome-based machine learning models across four datasets and three phenotypes. Leakage via feature selection and repeated subjects drastically inflates prediction performance, whereas other forms of leakage have minor effects. Furthermore, small datasets exacerbate the effects of leakage. Overall, our results illustrate the variable effects of leakage and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling.
Project description:Gene expression profiles were generated from 199 primary breast cancer patients. Samples 1-176 were used in another study, GEO Series GSE22820, and form the training data set in this study. Sample numbers 200-222 form a validation set. This data is used to model a machine learning classifier for Estrogen Receptor Status. RNA was isolated from 199 primary breast cancer patients. A machine learning classifier was built to predict ER status using only three gene features.
Project description:Antigenic peptides (APs), also known as T-cell epitopes (TCEs), represent the immunogenic segment of pathogens capable of inducing an immune response, making them potential candidates for epitope-based vaccine (EBV) design. Traditional wet lab methods for identifying TCEs are expensive, challenging, and time-consuming. Alternatively, computational approaches employing machine learning (ML) techniques offer a faster and more cost-effective solution. In this study, we present a robust XGBoost ML model for predicting TCEs of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus as potential vaccine candidates. The peptide sequences comprising TCEs and non-TCEs retrieved from Immune Epitope Database Repository (IEDB) were subjected to feature extraction process to extract their physicochemical properties for model training. Upon evaluation using a test dataset, the model achieved an impressive accuracy of 97.6%, outperforming other ML classifiers. Employing a five-fold cross-validation a mean accuracy of 97.58% was recorded, indicating consistent and linear performance across all iterations. While the predicted epitopes show promise as vaccine candidates for SARS-CoV-2, further scientific examination through in vivo and in vitro studies is essential to validate their suitability.
Project description:PurposeMouse efficacy studies are a critical hurdle to advance translational research of potential therapeutic compounds for many diseases. Although mouse liver microsomal (MLM) stability studies are not a perfect surrogate for in vivo studies of metabolic clearance, they are the initial model system used to assess metabolic stability. Consequently, we explored the development of machine learning models that can enhance the probability of identifying compounds possessing MLM stability.MethodsPublished assays on MLM half-life values were identified in PubChem, reformatted, and curated to create a training set with 894 unique small molecules. These data were used to construct machine learning models assessed with internal cross-validation, external tests with a published set of antitubercular compounds, and independent validation with an additional diverse set of 571 compounds (PubChem data on percent metabolism).Results"Pruning" out the moderately unstable / moderately stable compounds from the training set produced models with superior predictive power. Bayesian models displayed the best predictive power for identifying compounds with a half-life ≥1 h.ConclusionsOur results suggest the pruning strategy may be of general benefit to improve test set enrichment and provide machine learning models with enhanced predictive value for the MLM stability of small organic molecules. This study represents the most exhaustive study to date of using machine learning approaches with MLM data from public sources.
Project description:Background and objectivesAlzheimer disease (AD) has a polygenic architecture, for which genome-wide association studies (GWAS) have helped elucidate sequence variants (SVs) influencing susceptibility. Polygenic risk score (PRS) approaches show promise for generating summary measures of inherited risk for clinical AD based on the effects of APOE and other GWAS hits. However, existing PRS approaches, based on traditional regression models, explain only modest variation in AD dementia risk and AD-related endophenotypes. We hypothesized that machine learning (ML) models of polygenic risk (ML-PRS) could outperform standard regression-based PRS methods and therefore have the potential for greater clinical utility.MethodsWe analyzed combined data from the Mayo Clinic Study of Aging (n = 1,791) and the Alzheimer's Disease Neuroimaging Initiative (n = 864). An AD PRS was computed for each participant using the top common SVs obtained from a large AD dementia GWAS. In parallel, ML models were trained using those SV genotypes, with amyloid PET burden as the primary outcome. Secondary outcomes included amyloid PET positivity and clinical diagnosis (cognitively unimpaired vs impaired). We compared performance between ML-PRS and standard PRS across 100 training sessions with different data splits. In each session, data were split into 80% training and 20% testing, and then five-fold cross-validation was used within the training set to ensure the best model was produced for testing. We also applied permutation importance techniques to assess which genetic factors contributed most to outcome prediction.ResultsML-PRS models outperformed the AD PRS (r2 = 0.28 vs r2 = 0.24 in test set) in explaining variation in amyloid PET burden. Among ML approaches, methods accounting for nonlinear genetic influences were superior to linear methods. ML-PRS models were also more accurate when predicting amyloid PET positivity (area under the curve [AUC] = 0.80 vs AUC = 0.63) and the presence of cognitive impairment (AUC = 0.75 vs AUC = 0.54) compared with the standard PRS.DiscussionWe found that ML-PRS approaches improved upon standard PRS for prediction of AD endophenotypes, partly related to improved accounting for nonlinear effects of genetic susceptibility alleles. Further adaptations of the ML-PRS framework could help to close the gap of remaining unexplained heritability for AD and therefore facilitate more accurate presymptomatic and early-stage risk stratification for clinical decision-making.
Project description:SARS-CoV-2 pandemic is the big issue of the whole world right now. The health community is struggling to rescue the public and countries from this spread, which revives time to time with different waves. Even the vaccination seems to be not prevents this spread. Accurate identification of infected people on time is essential these days to control the spread. So far, Polymerase chain reaction (PCR) and rapid antigen tests are widely used in this identification, accepting their own drawbacks. False negative cases are the menaces in this scenario. To avoid these problems, this study uses machine learning techniques to build a classification model with higher accuracy to filter the COVID-19 cases from the non-COVID individuals. Transcriptome data of the SARS-CoV-2 patients along with the control are used in this stratification using three different feature selection algorithms and seven classification models. Differently expressed genes also studied between these two groups of people and used in this classification. Results shows that mutual information (or DEGs) along with naïve Bayes (or SVM) gives the best accuracy (0.98 ± 0.04) among these methods.Supplementary informationThe online version contains supplementary material available at 10.1007/s42979-023-01703-6.
Project description:Accurately labeling large datasets is important for biomedical machine learning yet challenging while modern data augmentation methods may generate noise in the training data, which may deteriorate machine learning model performance. Existing approaches addressing noisy training data typically rely on strict modeling assumptions, classification models and well-curated dataset. To address these, we propose a novel reliability-based training-data-cleaning method employing inductive conformal prediction (ICP). This method uses a small set of well-curated training data and leverages ICP-calculated reliability metrics to selectively correct mislabeled data and outliers within vast quantities of noisy training data. The efficacy is validated across three classification tasks with distinct modalities: filtering drug-induced-liver-injury (DILI) literature with free-text title and abstract, predicting ICU admission of COVID-19 patients through CT radiomics and electronic health records, and subtyping breast cancer using RNA-sequencing data. Varying levels of noise to the training labels were introduced via label permutation. Our training-data-cleaning method significantly enhanced the downstream classification performance (paired t-tests, p ≤ 0 . 05 among 30 random train/test partitions): significant accuracy enhancement in 86 out of 96 DILI experiments (up to 11.4% increase from 0.812 to 0.905), significant AUROC and AUPRC enhancements in all 48 COVID-19 experiments (up to 23.8% increase from 0.597 to 0.739 for AUROC, and 69.8% increase from 0.183 to 0.311 for AUPRC), and significant accuracy and macro-average F1-score improvements in 47 out of 48 RNA-sequencing experiments (up to 74.6% increase from 0.351 to 0.613 for accuracy, and 89.0% increase from 0.267 to 0.505 for F1-score). The improvement can be both statistically and clinically significant for information retrieval, disease diagnosis and prognosis. The method offers the potential to substantially boost classification performance in biomedical machine learning tasks without necessitating an excessive volume of well-curated training data or strong data distribution and modeling assumptions in existing semi-supervised learning methods.
Project description:Cytotoxicity, usually represented by cell viability, is a crucial parameter for evaluating drug safety in vitro. Accurate prediction of cell viability/cytotoxicity could accelerate drug development in the early stage. In this study, by using machine learning algorithms on cellular transcriptome and cell viability data, highly accurate prediction models of 50% and 80% cell viability were developed with AUROCs of 0.90 and 0.84, respectively, which also showed good performance on diverse cell lines. With respect to the characterization of Feature Genes employed, the models can be interpreted, and the mechanisms of bioactive compounds with narrow therapeutic indices can also be analyzed. In summary, the models established in this study have the capacity to predict cytotoxicity highly accurately across cell lines and can be used for high safety substances screening efficiently. Moreover, the Cytotoxicity Signature genes from interpretability analysis is valuable for studying the mechanisms of action, especially for substances with narrow therapeutic indices.