Project description:Cytotoxicity, usually represented by cell viability, is a crucial parameter for evaluating drug safety in vitro. Accurate prediction of cell viability/cytotoxicity could accelerate drug development in the early stage. In this study, by using machine learning algorithms on cellular transcriptome and cell viability data, highly accurate prediction models of 50% and 80% cell viability were developed with AUROCs of 0.90 and 0.84, respectively, which also showed good performance on diverse cell lines. With respect to the characterization of Feature Genes employed, the models can be interpreted, and the mechanisms of bioactive compounds with narrow therapeutic indices can also be analyzed. In summary, the models established in this study have the capacity to predict cytotoxicity highly accurately across cell lines and can be used for high safety substances screening efficiently. Moreover, the Cytotoxicity Signature genes from interpretability analysis is valuable for studying the mechanisms of action, especially for substances with narrow therapeutic indices.
Project description:Deep learning, aided by the availability of big data sets, has led to substantial advances across many disciplines. However, many scientific problems of practical interest lack sufficiently large datasets amenable to deep learning. Prediction of antibody viscosity is one such problem where deep learning methods have not yet been explored due to the relative scarcity of relevant training data. In this work, we overcome this limitation using a biophysically meaningful representation that enables us to develop generalizable models even under limited training data. We present, PfAbNet-viscosity, a 3D convolutional neural network architecture, to predict high-concentration viscosity of therapeutic antibodies. We show that with the electrostatic potential surface of the antibody variable region as the only input to the network, the models trained on as few as couple dozen datapoints can generalize with high accuracy. Our feature attribution analysis shows that PfAbNet-viscosity has learned key biophysical drivers of viscosity. The applicability of our approach to other biological systems is discussed.
Project description:Seizure prediction may improve the quality of life of patients suffering from drug-resistant epilepsy, which accounts for about 30% of the total epileptic patients. The pre-ictal period determination, characterized by a transitional stage between normal brain activity and seizure, is a critical step. Past approaches failed to attain real-world applicability due to lack of generalization capacity. More recently, deep learning techniques may outperform traditional classifiers and handle time dependencies. However, despite the existing efforts for providing interpretable insights, clinicians may not be willing to make high-stake decisions based on them. Furthermore, a disadvantageous aspect of the more usual seizure prediction pipeline is its modularity and significant independence between stages. An alternative could be the construction of a search algorithm that, while considering pipeline stages' synergy, fine-tunes the selection of a reduced set of features that are widely used in the literature and computationally efficient. With extracranial recordings from 19 patients suffering from temporal-lobe seizures, we developed a patient-specific evolutionary optimization strategy, aiming to generate the optimal set of features for seizure prediction with a logistic regression classifier, which was tested prospectively in a total of 49 seizures and 710 h of continuous recording and performed above chance for 32% of patients, using a surrogate predictor. These results demonstrate the hypothesis of pre-ictal period identification without the loss of interpretability, which may help understanding brain dynamics leading to seizures and improve prediction algorithms.
Project description:Insufficient sleep is associated with cardiometabolic disease and poor health. However, few studies have assessed its determinants in a nationally representative sample. Data from the 2009 behavioral risk factor surveillance system were used (N = 323,047 adults). Insufficient sleep was assessed as insufficient rest/sleep over 30 days. This was evaluated relative to sociodemographics (age, sex, race/ethnicity, marital status, region), socioeconomics (education, income, employment, insurance), health behaviors (diet, exercise, smoking, alcohol), and health/functioning (emotional support, BMI, mental/physical health). Overall, insufficient sleep was associated with being female, White or Black/African-American, unemployed, without health insurance, and not married; decreased age, income, education, physical activity; worse diet and overall health; and increased household size, alcohol, and smoking. These factors should be considered as risk factors for insufficient sleep.
Project description:Microbiome biomarker discovery for patient diagnosis, prognosis, and risk evaluation is attracting broad interest. Selected groups of microbial features provide signatures that characterize host disease states such as cancer or cardio-metabolic diseases. Yet, the current predictive models stemming from machine learning still behave as black boxes and seldom generalize well. Their interpretation is challenging for physicians and biologists, which makes them difficult to trust and use routinely in the physician-patient decision-making process. Novel methods that provide interpretability and biological insight are needed. Here, we introduce "predomics", an original machine learning approach inspired by microbial ecosystem interactions that is tailored for metagenomics data. It discovers accurate predictive signatures and provides unprecedented interpretability. The decision provided by the predictive model is based on a simple, yet powerful score computed by adding, subtracting, or dividing cumulative abundance of microbiome measurements. Tested on >100 datasets, we demonstrate that predomics models are simple and highly interpretable. Even with such simplicity, they are at least as accurate as state-of-the-art methods. The family of best models, discovered during the learning process, offers the ability to distil biological information and to decipher the predictability signatures of the studied condition. In a proof-of-concept experiment, we successfully predicted body corpulence and metabolic improvement after bariatric surgery using pre-surgery microbiome data. Predomics is a new algorithm that helps in providing reliable and trustworthy diagnostic decisions in the microbiome field. Predomics is in accord with societal and legal requirements that plead for an explainable artificial intelligence approach in the medical field.
Project description:Natural products have long been a rich source of diverse and clinically effective drug candidates. Non-ribosomal peptides (NRPs), polyketides (PKs), and NRP-PK hybrids are three classes of natural products that display a broad range of bioactivities, including antibiotic, antifungal, anticancer, and immunosuppressant activities. However, discovering these compounds through traditional bioactivity-guided techniques is costly and time-consuming, often resulting in the rediscovery of known molecules. Consequently, genome mining has emerged as a high-throughput strategy to screen hundreds of thousands of microbial genomes to identify their potential to produce novel natural products. Adenylation domains play a key role in the biosynthesis of NRPs and NRP-PKs by recruiting substrates to incrementally build the final structure. We propose MASPR, a machine learning method that leverages protein language models for accurate and interpretable predictions of A-domain substrate specificities. MASPR demonstrates superior accuracy and generalization over existing methods and is capable of predicting substrates not present in its training data, or zero-shot classification. We use MASPR to develop Seq2Hybrid, an efficient algorithm to predict the structure of hybrid NRP-PK natural products from microbial genomes. Using Seq2Hybrid, we propose putative biosynthetic gene clusters for the orphan natural products Octaminomycin A, Dityromycin, SW-163B, and JBIR-39.
Project description:Machine-learning based risk prediction models have the potential to improve patient outcomes by assessing risk more accurately than clinicians. Significant additional value lies in these models providing feedback about the factors that amplify an individual patient's risk. Identification of risk factors enables more informed decisions on interventions to mitigate or ameliorate modifiable factors. For these reasons, risk prediction models must be explainable and grounded on medical knowledge. Current machine learning-based risk prediction models are frequently 'black-box' models whose inner workings cannot be understood easily, making it difficult to define risk drivers. Since machine learning models follow patterns in the data rather than looking for medically relevant relationships, possible risk factors identified by these models do not necessarily translate into actionable insights for clinicians. Here, we use the example of risk assessment for postoperative complications to demonstrate how explainable and medically grounded risk prediction models can be developed. Pre- and postoperative risk prediction models are trained based on clinically relevant inputs extracted from electronic medical record data. We show that these models have similar predictive performance as models that incorporate a wider range of inputs and explain the models' decision-making process by visualizing how different model inputs and their values affect the models' predictions.
Project description:We present an analytical framework aimed at predicting the local brain activity in uncontrolled experimental conditions based on multimodal recordings of participants' behavior, and its application to a corpus of participants having conversations with another human or a conversational humanoid robot. The framework consists in extracting high-level features from the raw behavioral recordings and applying a dynamic prediction of binarized fMRI-recorded local brain activity using these behavioral features. The objective is to identify behavioral features required for this prediction, and their relative weights, depending on the brain area under investigation and the experimental condition. In order to validate our framework, we use a corpus of uncontrolled conversations of participants with a human or a robotic agent, focusing on brain regions involved in speech processing, and more generally in social interactions. The framework not only predicts local brain activity significantly better than random, it also quantifies the weights of behavioral features required for this prediction, depending on the brain area under investigation and on the nature of the conversational partner. In the left Superior Temporal Sulcus, perceived speech is the most important behavioral feature for predicting brain activity, regardless of the agent, while several features, which differ between the human and robot interlocutors, contribute to the prediction in regions involved in social cognition, such as the TemporoParietal Junction. This framework therefore allows us to study how multiple behavioral signals from different modalities are integrated in individual brain regions during complex social interactions.
Project description:BackgroundCardiac Resynchronization Therapy (CRT) is a widely used, device-based therapy for patients with left ventricle (LV) failure. Unfortunately, many patients do not benefit from CRT, so there is potential value in identifying this group of non-responders before CRT implementation. Past studies suggest that predicting CRT response will require diverse variables, including demographic, biomarker, and LV function data. Accordingly, the objective of this study was to integrate diverse variable types into a machine learning algorithm for predicting individual patient responses to CRT.MethodsWe built an ensemble classification algorithm using previously acquired data from the SMART-AV CRT clinical trial (n = 794 patients). We used five-fold stratified cross-validation on 80% of the patients (n = 635) to train the model with variables collected at 0 months (before initiating CRT), and the remaining 20% of the patients (n = 159) were used as a hold-out test set for model validation. To improve model interpretability, we quantified feature importance values using SHapley Additive exPlanations (SHAP) analysis and used Local Interpretable Model-agnostic Explanations (LIME) to explain patient-specific predictions.ResultsOur classification algorithm incorporated 26 patient demographic and medical history variables, 12 biomarker variables, and 18 LV functional variables, which yielded correct prediction of CRT response in 71% of patients. Additional patient stratification to identify the subgroups with the highest or lowest likelihood of response showed 96% accuracy with 22 correct predictions out of 23 patients in the highest and lowest responder groups.ConclusionComputationally integrating general patient characteristics, comorbidities, therapy history, circulating biomarkers, and LV function data available before CRT intervention can improve the prediction of individual patient responses.