ABSTRACT: Metabolomics is the systematic study of the small-molecule profiles of biological samples produced by specific cellular processes. The high-throughput technologies used in metabolomic investigations generate datasets where variables are strongly correlated and redundancy is present in the data. Discovering the hidden information is a challenge, and suitable approaches for data analysis must be employed. Projection to latent structures regression (PLS) has successfully solved a large number of problems, from multivariate calibration to classification, becoming a basic tool of metabolomics. PLS2 is the most used implementation of PLS. Despite its success, PLS2 showed some limitations when the so called 'structured noise' affects the data. Suitable methods have been recently introduced to patch up these limitations. In this study, a comprehensive and up-to-date presentation of PLS2 focused on metabolomics is provided. After a brief discussion of the mathematical framework of PLS2, the post-transformation procedure is introduced as a basic tool for model interpretation. Orthogonally-constrained PLS2 is presented as strategy to include constraints in the model according to the experimental design. Two experimental datasets are investigated to show how PLS2 and its improvements work in practice.
Project description:Premature leaf senescence (PLS), which has a significant impact on yield, is caused by various underlying mechanisms. Glycosyltransferases, which function in glycosyl transfer from activated nucleotides to aglycones, are involved in diverse biological processes, but their roles in rice leaf senescence remain elusive. Here, we isolated and characterized a leaf senescence-related gene from the Premature Leaf Senescent mutant (pls2). The mutant phenotype began with leaf yellowing at tillering and resulted in PLS during the reproductive stage. Leaf senescence was associated with an increase in hydrogen peroxide (H2O2) content accompanied with pronounced decreases in net photosynthetic rate, stomatal conductance, and transpiration rate. Map-based cloning revealed that a mutation in LOC_Os03g15840 (PLS2), a putative glycosyltransferase- encoding gene, was responsible for the defective phenotype. PLS2 expression was detected in all tissues surveyed, but predominantly in leaf mesophyll cells. Subcellular localization of the PLS2 was in the endoplasmic reticulum. The pls2 mutant accumulated higher levels of sucrose together with decreased expression of sucrose metabolizing genes compared with wild type. These data suggested that the PLS2 allele is essential for normal leaf senescence and its mutation resulted in PLS.
Project description:Notwithstanding its mitosporic nature, an improbable morpho-transformation state i. e., sclerotial development (SD), is vaguely known in Aspergillus oryzae. Nevertheless an intriguing phenomenon governing mold's development and stress response, the effects of exogenous factors engendering SD, especially the volatile organic compounds (VOCs) mediated interactions (VMI) pervasive in microbial niches have largely remained unexplored. Herein, we examined the effects of intra-species VMI on SD in A. oryzae RIB 40, followed by comprehensive analyses of associated growth rates, pH alterations, biochemical phenotypes, and exometabolomes. We cultivated A. oryzae RIB 40 (S1VMI: KACC 44967) opposite a non-SD partner strain, A. oryzae (S2: KCCM 60345), conditioning VMI in a specially designed "twin plate assembly." Notably, SD in S1VMI was delayed relative to its non-conditioned control (S1) cultivated without partner strain (S2) in twin plate. Selectively evaluating A. oryzae RIB 40 (S1VMI vs. S1) for altered phenotypes concomitant to SD, we observed a marked disparity for corresponding growth rates (S1VMI < S1)7days, media pH (S1VMI > S1)7days, and biochemical characteristics viz., protease (S1VMI > S1)7days, amylase (S1VMI > nS1)3-7days , and antioxidants (S1VMI > S1)7days levels. The partial least squares-discriminant analysis (PLS-DA) of gas chromatography-time of flight-mass spectrometry (GC-TOF-MS) datasets for primary metabolites exhibited a clustered pattern (PLS1, 22.04%; PLS2, 11.36%), with 7 days incubated S1VMI extracts showed higher abundance of amino acids, sugars, and sugar alcohols with lower organic acids and fatty acids levels, relative to S1. Intriguingly, the higher amino acid and sugar alcohol levels were positively correlated with antioxidant activity, likely impeding SD in S1VMI. Further, the PLS-DA (PLS1, 18.11%; PLS2, 15.02%) based on liquid chromatography-mass spectrometry (LC-MS) datasets exhibited a notable disparity for post-SD (9-11 days) sample extracts with higher oxylipins and 13-desoxypaxilline levels in S1VMI relative to S1, intertwining Aspergillus morphogenesis and secondary metabolism. The analysis of VOCs for the 7 days incubated samples displayed considerably higher accumulation of C-8 compounds in the headspace of twin-plate experimental sets (S1VMI:S2) compared to those in non-conditioned controls (S1 and S2-without respective partner strains), potentially triggering altered morpho-transformation and concurring biochemical as well as metabolic states in molds.
Project description:BACKGROUND:Supervised classification methods have been used for many years for feature selection in metabolomics and other omics studies. We developed a novel primal-dual based classification method (PD-CR) that can perform classification with rejection and feature selection on high dimensional datasets. PD-CR projects data onto a low dimension space and performs classification by minimizing an appropriate quadratic cost. It simultaneously optimizes the selected features and the prediction accuracy with a new tailored, constrained primal-dual method. The primal-dual framework is general enough to encompass various robust losses and to allow for convergence analysis. Here, we compare PD-CR to three commonly used methods: partial least squares discriminant analysis (PLS-DA), random forests and support vector machines (SVM). We analyzed two metabolomics datasets: one urinary metabolomics dataset concerning lung cancer patients and healthy controls; and a metabolomics dataset obtained from frozen glial tumor samples with mutated isocitrate dehydrogenase (IDH) or wild-type IDH. RESULTS:PD-CR was more accurate than PLS-DA, Random Forests and SVM for classification using the 2 metabolomics datasets. It also selected biologically relevant metabolites. PD-CR has the advantage of providing a confidence score for each prediction, which can be used to perform classification with rejection. This substantially reduces the False Discovery Rate. CONCLUSION:PD-CR is an accurate method for classification of metabolomics datasets which can outperform PLS-DA, Random Forests and SVM while selecting biologically relevant features. Furthermore the confidence score provided with PD-CR can be used to perform classification with rejection and reduce the false discovery rate.
Project description:Partial least squares (PLS) is one of the most commonly used supervised modelling approaches for analysing multivariate metabolomics data. PLS is typically employed as either a regression model (PLS-R) or a classification model (PLS-DA). However, in metabolomics studies it is common to investigate multiple, potentially interacting, factors simultaneously following a specific experimental design. Such data often cannot be considered as a "pure" regression or a classification problem. Nevertheless, these data have often still been treated as a regression or classification problem and this could lead to ambiguous results. In this study, we investigated the feasibility of designing a hybrid target matrix Y that better reflects the experimental design than simple regression or binary class membership coding commonly used in PLS modelling. The new design of Y coding was based on the same principle used by structural modelling in machine learning techniques. Two real metabolomics datasets were used as examples to illustrate how the new Y coding can improve the interpretability of the PLS model compared to classic regression/classification coding.
Project description:Because cerebrospinal fluid (CSF) is the biofluid which interacts most closely with the central nervous system, it holds promise as a reporter of neurological disease, for example multiple sclerosis (MScl). To characterize the metabolomics profile of neuroinflammatory aspects of this disease we studied an animal model of MScl-experimental autoimmune/allergic encephalomyelitis (EAE). Because CSF also exchanges metabolites with blood via the blood-brain barrier, malfunctions occurring in the CNS may be reflected in the biochemical composition of blood plasma. The combination of blood plasma and CSF provides more complete information about the disease. Both biofluids can be studied by use of NMR spectroscopy. It is then necessary to perform combined analysis of the two different datasets. Mid-level data fusion was therefore applied to blood plasma and CSF datasets. First, relevant information was extracted from each biofluid dataset by use of linear support vector machine recursive feature elimination. The selected variables from each dataset were concatenated for joint analysis by partial least squares discriminant analysis (PLS-DA). The combined metabolomics information from plasma and CSF enables more efficient and reliable discrimination of the onset of EAE. Second, we introduced hierarchical models fusion, in which previously developed PLS-DA models are hierarchically combined. We show that this approach enables neuroinflamed rats (even on the day of onset) to be distinguished from either healthy or peripherally inflamed rats. Moreover, progression of EAE can be investigated because the model separates the onset and peak of the disease.
Project description:Metabolomic studies with a time-series design are widely used for discovery and validation of biomarkers. In such studies, changes of metabolic profiles over time under different conditions (e.g., control and intervention) are compared, and metabolites responding differently between the conditions are identified as putative biomarkers. To incorporate time-series information into the variable (biomarker) selection in partial least squares regression (PLS) models, we created PLS models with different combinations of bilinear/trilinear X and group/time response dummy Y. In total, five PLS models were evaluated on two real datasets, and also on simulated datasets with varying characteristics (number of subjects, number of variables, inter-individual variability, intra-individual variability and number of time points). Variables showing specific temporal patterns observed visually and determined statistically were labelled as discriminating variables. Bootstrapped-VIP scores were calculated for variable selection and the variable selection performance of five PLS models were assessed based on their capacity to correctly select the discriminating variables. The results showed that the bilinear PLS model with group × time response as dummy Y provided the highest recall (true positive rate) of 83-95% with high precision, independent of most characteristics of the datasets. Trilinear PLS models tend to select a small number of variables with high precision but relatively high false negative rate (lower power). They are also less affected by the noise compared to bilinear PLS models. In datasets with high inter-individual variability, bilinear PLS models tend to provide higher recall while trilinear models tend to provide higher precision. Overall, we recommend bilinear PLS with group x time response Y for variable selection applications in metabolomics intervention time series studies.
Project description:<h4>Background</h4>Standardization of analytical approaches and reporting methods via community-wide collaboration can work synergistically with web-tool development to result in rapid community-driven expansion of online data repositories suitable for data mining and meta-analysis. In metabolomics, the inter-laboratory reproducibility of gas-chromatography/mass-spectrometry (GC/MS) makes it an obvious target for such development. While a number of web-tools offer access to datasets and/or tools for raw data processing and statistical analysis, none of these systems are currently set up to act as a public repository by easily accepting, processing and presenting publicly submitted GC/MS metabolomics datasets for public re-analysis.<h4>Description</h4>Here, we present MetabolomeExpress, a new File Transfer Protocol (FTP) server and web-tool for the online storage, processing, visualisation and statistical re-analysis of publicly submitted GC/MS metabolomics datasets. Users may search a quality-controlled database of metabolite response statistics from publicly submitted datasets by a number of parameters (eg. metabolite, species, organ/biofluid etc.). Users may also perform meta-analysis comparisons of multiple independent experiments or re-analyse public primary datasets via user-friendly tools for t-test, principal components analysis, hierarchical cluster analysis and correlation analysis. They may interact with chromatograms, mass spectra and peak detection results via an integrated raw data viewer. Researchers who register for a free account may upload (via FTP) their own data to the server for online processing via a novel raw data processing pipeline.<h4>Conclusions</h4>MetabolomeExpress https://www.metabolome-express.org provides a new opportunity for the general metabolomics community to transparently present online the raw and processed GC/MS data underlying their metabolomics publications. Transparent sharing of these data will allow researchers to assess data quality and draw their own insights from published metabolomics datasets.
Project description:Metabolomics datasets are commonly acquired by either mass spectrometry (MS) or nuclear magnetic resonance spectroscopy (NMR), despite their fundamental complementarity. In fact, combining MS and NMR datasets greatly improves the coverage of the metabolome and enhances the accuracy of metabolite identification, providing a detailed and high-throughput analysis of metabolic changes due to disease, drug treatment, or a variety of other environmental stimuli. Ideally, a single metabolomics sample would be simultaneously used for both MS and NMR analyses, minimizing the potential for variability between the two datasets. This necessitates the optimization of sample preparation, data collection and data handling protocols to effectively integrate direct-infusion MS data with one-dimensional (1D) (1)H NMR spectra. To achieve this goal, we report for the first time the optimization of (i) metabolomics sample preparation for dual analysis by NMR and MS, (ii) high throughput, positive-ion direct infusion electrospray ionization mass spectrometry (DI-ESI-MS) for the analysis of complex metabolite mixtures, and (iii) data handling protocols to simultaneously analyze DI-ESI-MS and 1D (1)H NMR spectral data using multiblock bilinear factorizations, namely multiblock principal component analysis (MB-PCA) and multiblock partial least squares (MB-PLS). Finally, we demonstrate the combined use of backscaled loadings, accurate mass measurements and tandem MS experiments to identify metabolites significantly contributing to class separation in MB-PLS-DA scores. We show that integration of NMR and DI-ESI-MS datasets yields a substantial improvement in the analysis of neurotoxin involvement in dopaminergic cell death.
Project description:Missing values exist widely in mass-spectrometry (MS) based metabolomics data. Various methods have been applied for handling missing values, but the selection can significantly affect following data analyses. Typically, there are three types of missing values, missing not at random (MNAR), missing at random (MAR), and missing completely at random (MCAR). Our study comprehensively compared eight imputation methods (zero, half minimum (HM), mean, median, random forest (RF), singular value decomposition (SVD), k-nearest neighbors (kNN), and quantile regression imputation of left-censored data (QRILC)) for different types of missing values using four metabolomics datasets. Normalized root mean squared error (NRMSE) and NRMSE-based sum of ranks (SOR) were applied to evaluate imputation accuracy. Principal component analysis (PCA)/partial least squares (PLS)-Procrustes analysis were used to evaluate the overall sample distribution. Student's t-test followed by correlation analysis was conducted to evaluate the effects on univariate statistics. Our findings demonstrated that RF performed the best for MCAR/MAR and QRILC was the favored one for left-censored MNAR. Finally, we proposed a comprehensive strategy and developed a public-accessible web-tool for the application of missing value imputation in metabolomics ( https://metabolomics.cc.hawaii.edu/software/MetImp/ ).
Project description:Data analysis for metabolomics is undergoing rapid progress thanks to the proliferation of novel tools and the standardization of existing workflows. As untargeted metabolomics datasets and experiments continue to increase in size and complexity, standardized workflows are often not sufficiently sophisticated. In addition, the ground truth for untargeted metabolomics experiments is intrinsically unknown and the performance of tools is difficult to evaluate. Here, the problem of dynamic multi-class metabolomics experiments was investigated using a simulated dataset with a known ground truth. This simulated dataset was used to evaluate the performance of tinderesting, a new and intuitive tool based on gathering expert knowledge to be used in machine learning. The results were compared to EDGE, a statistical method for time series data. This paper presents three novel outcomes. The first is a way to simulate dynamic metabolomics data with a known ground truth based on ordinary differential equations. This method is made available through the MetaboLouise R package. Second, the EDGE tool, originally developed for genomics data analysis, is highly performant in analyzing dynamic case vs. control metabolomics data. Third, the tinderesting method is introduced to analyse more complex dynamic metabolomics experiments. This tool consists of a Shiny app for collecting expert knowledge, which in turn is used to train a machine learning model to emulate the decision process of the expert. This approach does not replace traditional data analysis workflows for metabolomics, but can provide additional information, improved performance or easier interpretation of results. The advantage is that the tool is agnostic to the complexity of the experiment, and thus is easier to use in advanced setups. All code for the presented analysis, MetaboLouise and tinderesting are freely available.