Project description:Metabolic alterations play a crucial role in glioma development and progression and can be detected even before the appearance of the fatal phenotype. We have compared the circulating metabolic fingerprints of glioma patients versus healthy controls, for the first time, in a quest to identify a panel of small, dysregulated metabolites with potential to serve as a predictive and/or diagnostic marker in the clinical settings. High-resolution magic angle spinning nuclear magnetic resonance spectroscopy (HRMAS-NMR) was used for untargeted metabolomics and data acquisition followed by a machine learning (ML) approach for the analyses of large metabolic datasets. Cross-validation of ML predicted NMR spectral features was done by statistical methods (Wilcoxon-test) using JMP-pro16 software. Alanine was identified as the most critical metabolite with potential to detect glioma with precision of 1.0, recall of 0.96, and F1 measure of 0.98. The top 10 metabolites identified for glioma detection included alanine, glutamine, valine, methionine, N-acetylaspartate (NAA), γ-aminobutyric acid (GABA), serine, α-glucose, lactate, and arginine. We achieved 100% accuracy for the detection of glioma using ML algorithms, extra tree classifier, and random forest, and 98% accuracy with logistic regression. Classification of glioma in low and high grades was done with 86% accuracy using logistic regression model, and with 83% and 79% accuracy using extra tree classifier and random forest, respectively. The predictive accuracy of our ML model is superior to any of the previously reported algorithms, used in tissue- or liquid biopsy-based metabolic studies. The identified top metabolites can be targeted to develop early diagnostic methods as well as to plan personalized treatment strategies.
Project description:IntroductionDespite the availability of several pre-processing software, poor peak integration remains a prevalent problem in untargeted metabolomics data generated using liquid chromatography high-resolution mass spectrometry (LC-MS). As a result, the output of these pre-processing software may retain incorrectly calculated metabolite abundances that can perpetuate in downstream analyses.ObjectivesTo address this problem, we propose a computational methodology that combines machine learning and peak quality metrics to filter out low quality peaks.MethodsSpecifically, we comprehensively and systematically compared the performance of 24 different classifiers generated by combining eight classification algorithms and three sets of peak quality metrics on the task of distinguishing reliably integrated peaks from poorly integrated ones. These classifiers were compared to using a residual standard deviation (RSD) cut-off in pooled quality-control (QC) samples, which aims to remove peaks with analytical error.ResultsThe best performing classifier was found to be a combination of the AdaBoost algorithm and a set of 11 peak quality metrics previously explored in untargeted metabolomics and proteomics studies. As a complementary approach, applying our framework to peaks retained after filtering by 30% RSD across pooled QC samples was able to further distinguish poorly integrated peaks that were not removed from filtering alone. An R implementation of these classifiers and the overall computational approach is available as the MetaClean package at https://CRAN.R-project.org/package=MetaClean .ConclusionOur work represents an important step forward in developing an automated tool for filtering out unreliable peak integrations in untargeted LC-MS metabolomics data.
Project description:Clinical outcomes for patients with COVID-19 are heterogeneous and there is interest in defining subgroups for prognostic modeling and development of treatment algorithms. We obtained 28 demographic and laboratory variables in patients admitted to hospital with COVID-19. These comprised a training cohort (n = 6099) and two validation cohorts during the first and second waves of the pandemic (n = 996; n = 1011). Uniform manifold approximation and projection (UMAP) dimension reduction and Gaussian mixture model (GMM) analysis was used to define patient clusters. 29 clusters were defined in the training cohort and associated with markedly different mortality rates, which were predictive within confirmation datasets. Deconvolution of clinical features within clusters identified unexpected relationships between variables. Integration of large datasets using UMAP-assisted clustering can therefore identify patient subgroups with prognostic information and uncovers unexpected interactions between clinical variables. This application of machine learning represents a powerful approach for delineating disease pathogenesis and potential therapeutic interventions.
Project description:Humans are continuously exposed to a variety of toxicants and chemicals which is exacerbated during and after environmental catastrophes such as floods, earthquakes, and hurricanes. The hazardous chemical mixtures generated during these events threaten the health and safety of humans and other living organisms. This necessitates the development of rapid decision-making tools to facilitate mitigating the adverse effects of exposure on the key modulators of the endocrine system, such as the estrogen receptor alpha (ERα), for example. The mechanistic stages of the estrogenic transcriptional activity can be measured with high content/high throughput microscopy-based biosensor assays at the single-cell level, which generates millions of object-based minable data points. By combining computational modeling and experimental analysis, we built a highly accurate data-driven classification framework to assess the endocrine disrupting potential of environmental compounds. The effects of these compounds on the ERα pathway are predicted as being receptor agonists or antagonists using the principal component analysis (PCA) projections of high throughput, high content image analysis descriptors. The framework also combines rigorous preprocessing steps and nonlinear machine learning algorithms, such as the Support Vector Machines and Random Forest classifiers, to develop highly accurate mathematical representations of the separation between ERα agonists and antagonists. The results show that Support Vector Machines classify the unseen chemicals correctly with more than 96% accuracy using the proposed framework, where the preprocessing and the PCA steps play a key role in suppressing experimental noise and unraveling hidden patterns in the dataset.
Project description:Gene expression profiles were generated from 199 primary breast cancer patients. Samples 1-176 were used in another study, GEO Series GSE22820, and form the training data set in this study. Sample numbers 200-222 form a validation set. This data is used to model a machine learning classifier for Estrogen Receptor Status. RNA was isolated from 199 primary breast cancer patients. A machine learning classifier was built to predict ER status using only three gene features.
Project description:MotivationUntargeted metabolomics by mass spectrometry is the method of choice for unbiased analysis of molecules in complex samples of biological, clinical or environmental relevance. The exceptional versatility and sensitivity of modern high-resolution instruments allows profiling of thousands of known and unknown molecules in parallel. Inter-batch differences constitute a common and unresolved problem in untargeted metabolomics, and hinder the analysis of multi-batch studies or the intercomparison of experiments.ResultsWe present a new method, Regularized Adversarial Learning Preserving Similarity (RALPS), for the normalization of multi-batch untargeted metabolomics data. RALPS builds on deep adversarial learning with a three-term loss function that mitigates batch effects while preserving biological identity, spectral properties and coefficients of variation. Using two large metabolomics datasets, we showcase the superior performance of RALPS as compared with six state-of-the-art methods for batch correction. Further, we demonstrate that RALPS scales well, is robust, deals with missing values and can handle different experimental designs.Availability and implementationhttps://github.com/zamboni-lab/RALPS.Supplementary informationSupplementary data are available at Bioinformatics online.