Speaq 2.0: A complete workflow for high-throughput 1D NMR spectra processing and quantification.
ABSTRACT: Nuclear Magnetic Resonance (NMR) spectroscopy is, together with liquid chromatography-mass spectrometry (LC-MS), the most established platform to perform metabolomics. In contrast to LC-MS however, NMR data is predominantly being processed with commercial software. Meanwhile its data processing remains tedious and dependent on user interventions. As a follow-up to speaq, a previously released workflow for NMR spectral alignment and quantitation, we present speaq 2.0. This completely revised framework to automatically analyze 1D NMR spectra uses wavelets to efficiently summarize the raw spectra with minimal information loss or user interaction. The tool offers a fast and easy workflow that starts with the common approach of peak-picking, followed by grouping, thus avoiding the binning step. This yields a matrix consisting of features, samples and peak values that can be conveniently processed either by using included multivariate statistical functions or by using many other recently developed methods for NMR data analysis. speaq 2.0 facilitates robust and high-throughput metabolomics based on 1D NMR but is also compatible with other NMR frameworks or complementary LC-MS workflows. The methods are benchmarked using a simulated dataset and two publicly available datasets. speaq 2.0 is distributed through the existing speaq R package to provide a complete solution for NMR data processing. The package and the code for the presented case studies are freely available on CRAN (https://cran.r-project.org/package=speaq) and GitHub (https://github.com/beirnaert/speaq).
Project description:Global metabolomics based on high-resolution liquid chromatography mass spectrometry (LC-MS) has been increasingly employed in recent large-scale multi-omics studies. Processing and interpretation of these complex metabolomics datasets have become a key challenge in current computational metabolomics. Here, we introduce MetaboAnalystR 2.0 for comprehensive LC-MS data processing, statistical analysis, and functional interpretation. Compared to the previous version, this new release seamlessly integrates XCMS and CAMERA to support raw spectral processing and peak annotation, and also features high-performance implementations of mummichog and GSEA approaches for predictions of pathway activities. The application and utility of the MetaboAnalystR 2.0 workflow were demonstrated using a synthetic benchmark dataset and a clinical dataset. In summary, MetaboAnalystR 2.0 offers a unified and flexible workflow that enables end-to-end analysis of LC-MS metabolomics data within the open-source R environment.
Project description:Preprocessing data in a reproducible and robust way is one of the current challenges in untargeted metabolomics workflows. Data curation in liquid chromatography-mass spectrometry (LC-MS) involves the removal of biologically non-relevant features (retention time, m/z pairs) to retain only high-quality data for subsequent analysis and interpretation. The present work introduces TidyMS, a package for the Python programming language for preprocessing LC-MS data for quality control (QC) procedures in untargeted metabolomics workflows. It is a versatile strategy that can be customized or fit for purpose according to the specific metabolomics application. It allows performing quality control procedures to ensure accuracy and reliability in LC-MS measurements, and it allows preprocessing metabolomics data to obtain cleaned matrices for subsequent statistical analysis. The capabilities of the package are shown with pipelines for an LC-MS system suitability check, system conditioning, signal drift evaluation, and data curation. These applications were implemented to preprocess data corresponding to a new suite of candidate plasma reference materials developed by the National Institute of Standards and Technology (NIST; hypertriglyceridemic, diabetic, and African-American plasma pools) to be used in untargeted metabolomics studies in addition to NIST SRM 1950 Metabolites in Frozen Human Plasma. The package offers a rapid and reproducible workflow that can be used in an automated or semi-automated fashion, and it is an open and free tool available to all users.
Project description:Unidentified peaks remain a major problem in untargeted metabolomics by LC-MS/MS. Confidence in peak annotations increases by combining MS/MS matching and retention time. We here show how retention times can be predicted from molecular structures. Two large, publicly available data sets were used for model training in machine learning: the Fiehn hydrophilic interaction liquid chromatography data set (HILIC) of 981 primary metabolites and biogenic amines,and the RIKEN plant specialized metabolome annotation (PlaSMA) database of 852 secondary metabolites that uses reversed-phase liquid chromatography (RPLC). Five different machine learning algorithms have been integrated into the Retip R package: the random forest, Bayesian-regularized neural network, XGBoost, light gradient-boosting machine (LightGBM), and Keras algorithms for building the retention time prediction models. A complete workflow for retention time prediction was developed in R. It can be freely downloaded from the GitHub repository (https://www.retip.app). Keras outperformed other machine learning algorithms in the test set with minimum overfitting, verified by small error differences between training, test, and validation sets. Keras yielded a mean absolute error of 0.78 min for HILIC and 0.57 min for RPLC. Retip is integrated into the mass spectrometry software tools MS-DIAL and MS-FINDER, allowing a complete compound annotation workflow. In a test application on mouse blood plasma samples, we found a 68% reduction in the number of candidate structures when searching all isomers in MS-FINDER compound identification software. Retention time prediction increases the identification rate in liquid chromatography and subsequently leads to an improved biological interpretation of metabolomics data.
Project description:Liquid chromatography (LC) separation combined with electrochemical coulometric array detection (EC) is a sensitive, reproducible, and robust technique that can detect hundreds of redox-active metabolites down to the level of femtograms on column, making it ideal for metabolomics profiling. EC detection cannot, however, structurally characterize unknown metabolites that comprise these profiles. Several aspects of LC-EC methods prevent a direct transfer to other structurally informative analytical methods, such as LC-MS and NMR. These include system limits of detection, buffer requirements, and detection mechanisms. To address these limitations, we developed a workflow based on the concentration of plasma, metabolite extraction, and offline LC-UV fractionation. Pooled human plasma was used to provide sufficient material necessary for multiple sample concentrations and platform analyses. Offline parallel LC-EC and LC-MS methods were established that correlated standard metabolites between the LC-EC profiling method and the mass spectrometer. Peak retention times (RT) from the LC-MS and LC-EC system were linearly related (r(2) = 0.99); thus, LC-MS RTs could be directly predicted from the LC-EC signals. Subsequent offline microcoil-NMR analysis of these collected fractions was used to confirm LC-MS characterizations by providing complementary, structural data. This work provides a validated workflow that is transferrable across multiple platforms and provides the unambiguous structural identifications necessary to move primary mathematically driven LC-EC biomarker discovery into biological and clinical utility.
Project description:<h4>Motivation</h4>Modern analytical techniques such as LC-MS, GC-MS and NMR are increasingly being used to study the underlying dynamics of biological systems by tracking changes in metabolite levels over time. Such techniques are capable of providing information on large numbers of metabolites simultaneously, a feature that is exploited in non-targeted studies. However, since the dynamics of specific metabolites are unlikely to be known a priori this presents an initial subjective challenge as to where the focus of the investigation should be. Whilst a number of feed-forward software tools are available for manipulation of metabolomic data, no tool centralizes on clustering and focus is typically directed by a workflow that is chosen in advance.<h4>Results</h4>We present an interactive approach to time-course analyses and a complementary implementation in a software package, MetaboClust. This is presented through the analysis of two LC-MS time-course case studies on plants (Medicago truncatula and Alopecurus myosuroides). We demonstrate a dynamic, user-centric workflow to clustering with intrinsic visual feedback at all stages of analysis. The software is used to apply data correction, generate the time-profiles, perform exploratory statistical analysis and assign tentative metabolite identifications. Clustering is used to group metabolites in an unbiased manner, allowing pathway analysis to score metabolic pathways, based on their overlap with clusters showing interesting trends.
Project description:Liquid chromatography-coulometric array detection (LC-EC) is a sensitive, quantitative, and robust metabolomics profiling tool that complements the commonly used mass spectrometry (MS) and nuclear magnetic resonance (NMR)-based approaches. However, LC-EC provides little structural information. We recently demonstrated a workflow for the structural characterization of metabolites detected by LC-EC profiling combined with LC-electrospray ionization (ESI)-MS and microNMR. This methodology is now extended to include (i) gas chromatography (GC)-electron ionization (EI)-MS analysis to fill structural gaps left by LC-ESI-MS and NMR and (ii) secondary fractionation of LC-collected fractions containing multiple coeluting analytes. GC-EI-MS spectra have more informative fragment ions that are reproducible for database searches. Secondary fractionation provides enhanced metabolite characterization by reducing spectral overlap in NMR and ion suppression in LC-ESI-MS. The need for these additional methods in the analysis of the broad chemical classes and concentration ranges found in plasma is illustrated with discussion of four specific examples: (i) characterization of compounds for which one or more of the detectors is insensitive (e.g., positional isomers in LC-MS, the direct detection of carboxylic groups and sulfonic groups in (1)H NMR, or nonvolatile species in GC-MS), (ii) detection of labile compounds, (iii) resolution of closely eluting and/or coeluting compounds, and (iv) the capability to harness structural similarities common in many biologically related, LC-EC-detectable compounds.
Project description:Metabolomics and proteomics, like other omics domains, usually face a data mining challenge in providing an understandable output to advance in biomarker discovery and precision medicine. Often, statistical analysis is one of the most difficult challenges and it is critical in the subsequent biological interpretation of the results. Because of this, combined with the computational programming skills needed for this type of analysis, several bioinformatic tools aimed at simplifying metabolomics and proteomics data analysis have emerged. However, sometimes the analysis is still limited to a few hidebound statistical methods and to data sets with limited flexibility. POMAShiny is a web-based tool that provides a structured, flexible and user-friendly workflow for the visualization, exploration and statistical analysis of metabolomics and proteomics data. This tool integrates several statistical methods, some of them widely used in other types of omics, and it is based on the POMA R/Bioconductor package, which increases the reproducibility and flexibility of analyses outside the web environment. POMAShiny and POMA are both freely available at https://github.com/nutrimetabolomics/POMAShiny and https://github.com/nutrimetabolomics/POMA, respectively.
Project description:<h4>Introduction</h4>The field of metabolomics has expanded greatly over the past two decades, both as an experimental science with applications in many areas, as well as in regards to data standards and bioinformatics software tools. The diversity of experimental designs and instrumental technologies used for metabolomics has led to the need for distinct data analysis methods and the development of many software tools.<h4>Objectives</h4>To compile a comprehensive list of the most widely used freely available software and tools that are used primarily in metabolomics.<h4>Methods</h4>The most widely used tools were selected for inclusion in the review by either ? 50 citations on Web of Science (as of 08/09/16) or the use of the tool being reported in the recent Metabolomics Society survey. Tools were then categorised by the type of instrumental data (i.e. LC-MS, GC-MS or NMR) and the functionality (i.e. pre- and post-processing, statistical analysis, workflow and other functions) they are designed for.<h4>Results</h4>A comprehensive list of the most used tools was compiled. Each tool is discussed within the context of its application domain and in relation to comparable tools of the same domain. An extended list including additional tools is available at https://github.com/RASpicer/MetabolomicsTools which is classified and searchable via a simple controlled vocabulary.<h4>Conclusion</h4>This review presents the most widely used tools for metabolomics analysis, categorised based on their main functionality. As future work, we suggest a direct comparison of tools' abilities to perform specific data analysis tasks e.g. peak picking.
Project description:BACKGROUND:Advances in high-throughput methods have brought new challenges for biological data analysis, often requiring many interdependent steps applied to a large number of samples. To address this challenge, workflow management systems, such as Watchdog, have been developed to support scientists in the (semi-)automated execution of large analysis workflows. IMPLEMENTATION:Here, we present Watchdog 2.0, which implements new developments for module creation, reusability, and documentation and for reproducibility of analyses and workflow execution. Developments include a graphical user interface for semi-automatic module creation from software help pages, sharing repositories for modules and workflows, and a standardized module documentation format. The latter allows generation of a customized reference book of public and user-specific modules. Furthermore, extensive logging of workflow execution, module and software versions, and explicit support for package managers and container virtualization now ensures reproducibility of results. A step-by-step analysis protocol generated from the log file may, e.g., serve as a draft of a manuscript methods section. Finally, 2 new execution modes were implemented. One allows resuming workflow execution after interruption or modification without rerunning successfully executed tasks not affected by changes. The second one allows detaching and reattaching to workflow execution on a local computer while tasks continue running on computer clusters. CONCLUSIONS:Watchdog 2.0 provides several new developments that we believe to be of benefit for large-scale bioinformatics analysis and that are not completely covered by other competing workflow management systems. The software itself, module and workflow repositories, and comprehensive documentation are freely available at https://www.bio.ifi.lmu.de/watchdog.