MetAssign: probabilistic annotation of metabolites from LC-MS data using a Bayesian clustering approach.
ABSTRACT: The use of liquid chromatography coupled to mass spectrometry has enabled the high-throughput profiling of the metabolite composition of biological samples. However, the large amount of data obtained can be difficult to analyse and often requires computational processing to understand which metabolites are present in a sample. This article looks at the dual problem of annotating peaks in a sample with a metabolite, together with putatively annotating whether a metabolite is present in the sample. The starting point of the approach is a Bayesian clustering of peaks into groups, each corresponding to putative adducts and isotopes of a single metabolite.The Bayesian modelling introduced here combines information from the mass-to-charge ratio, retention time and intensity of each peak, together with a model of the inter-peak dependency structure, to increase the accuracy of peak annotation. The results inherently contain a quantitative estimate of confidence in the peak annotations and allow an accurate trade-off between precision and recall. Extensive validation experiments using authentic chemical standards show that this system is able to produce more accurate putative identifications than other state-of-the-art systems, while at the same time giving a probabilistic measure of confidence in the annotations.The software has been implemented as part of the mzMatch metabolomics analysis pipeline, which is available for download at http://mzmatch.sourceforge.net/.
Project description:Stable isotope-labelling experiments have recently gained increasing popularity in metabolomics studies, providing unique insights into the dynamics of metabolic fluxes, beyond the steady-state information gathered by routine mass spectrometry. However, most liquid chromatography-mass spectrometry data analysis software lacks features that enable automated annotation and relative quantification of labelled metabolite peaks. Here, we describe mzMatch-ISO, a new extension to the metabolomics analysis pipeline mzMatch.R.Targeted and untargeted isotope profiling using mzMatch-ISO provides a convenient visual summary of the quality and quantity of labelling for every metabolite through four types of diagnostic plots that show (i) the chromatograms of the isotope peaks of each compound in each sample group; (ii) the ratio of mono-isotopic and labelled peaks indicating the fraction of labelling; (iii) the average peak area of mono-isotopic and labelled peaks in each sample group; and (iv) the trend in the relative amount of labelling in a predetermined isotopomer. To aid further statistical analyses, the values used for generating these plots are also provided as a tab-delimited file. We demonstrate the power and versatility of mzMatch-ISO by analysing a (13)C-labelled metabolome dataset from trypanosomal parasites.mzMatch.R and mzMatch-ISO are available free of charge from http://mzmatch.sourceforge.net and can be used on Linux and Windows platforms running the latest version of R.email@example.com.Supplementary data are available at Bioinformatics online.
Project description:Untargeted metabolomics can detect more than 10?000 peaks in a single LC-MS run. The correspondence between these peaks and metabolites, however, remains unclear. Here, we introduce a Peak Annotation and Verification Engine (PAVE) for annotating untargeted microbial metabolomics data. The workflow involves growing cells in 13C and 15N isotope-labeled media to identify peaks from biological compounds and their carbon and nitrogen atom counts. Improved deisotoping and deadducting are enabled by algorithms that integrate positive mode, negative mode, and labeling data. To distinguish metabolites and their fragments, PAVE experimentally measures the response of each peak to weak in-source collision induced dissociation, which increases the peak intensity for fragments while decreasing it for their parent ions. The molecular formulas of the putative metabolites are then assigned based on database searching using both m/ z and C/N atom counts. Application of this procedure to Saccharomyces cerevisiae and Escherichia coli revealed that more than 80% of peaks do not label, i.e., are environmental contaminants. More than 70% of the biological peaks are isotopic variants, adducts, fragments, or mass spectrometry artifacts yielding ?2000 apparent metabolites across the two organisms. About 650 match to a known metabolite formula based on m/ z and C/N atom counts, with 220 assigned structures based on MS/MS and/or retention time to match to authenticated standards. Thus, PAVE enables systematic annotation of LC-MS metabolomics data with only ?4% of peaks annotated as apparent metabolites.
Project description:NMR-based metabolomics is a powerful tool to comprehensively monitor chemical processes in biological systems. Key to its success is the accurate and complete metabolite identification and quantification. Due to the inherent complexity of most metabolic mixtures, NMR peak overlap can make data analysis of 1D or even 2D NMR spectra challenging, especially for the 1H spectral region from 3.2-4.5 ppm that is dominated by carbohydrates and their derivatives. To address this problem, we present an effective method for carbohydrate signal removal in complex metabolomics samples by oxidation via the addition of sodium periodate (NaIO4). In an optional step, reaction products can be removed with hydrazide beads. The treated samples show substantially simplified 1D and 2D NMR spectra with their carbohydrate peaks removed, whereas noncarbohydrate peaks remain mostly unaffected. This allows the unrestricted detection of those metabolites that are otherwise obscured by carbohydrate signals. The method was first tested for metabolite model mixtures and then applied to urine and serum samples. It revealed a significant number of noncarbohydrates that were made unambiguously observable and identifiable by this method. The proposed protocol is simple and it is suitable for high-throughput sample treatment for the comprehensive metabolite identification in a broad range of samples.
Project description:Efficient and accurate quantitation of metabolites from LC-MS data has become an important topic. Here we present an automated tool, called iMet-Q (intelligent Metabolomic Quantitation), for label-free metabolomics quantitation from high-throughput MS1 data. By performing peak detection and peak alignment, iMet-Q provides a summary of quantitation results and reports ion abundance at both replicate level and sample level. Furthermore, it gives the charge states and isotope ratios of detected metabolite peaks to facilitate metabolite identification. An in-house standard mixture and a public Arabidopsis metabolome data set were analyzed by iMet-Q. Three public quantitation tools, including XCMS, MetAlign, and MZmine 2, were used for performance comparison. From the mixture data set, seven standard metabolites were detected by the four quantitation tools, for which iMet-Q had a smaller quantitation error of 12% in both profile and centroid data sets. Our tool also correctly determined the charge states of seven standard metabolites. By searching the mass values for those standard metabolites against Human Metabolome Database, we obtained a total of 183 metabolite candidates. With the isotope ratios calculated by iMet-Q, 49% (89 out of 183) metabolite candidates were filtered out. From the public Arabidopsis data set reported with two internal standards and 167 elucidated metabolites, iMet-Q detected all of the peaks corresponding to the internal standards and 167 metabolites. Meanwhile, our tool had small abundance variation (? 0.19) when quantifying the two internal standards and had higher abundance correlation (? 0.92) when quantifying the 167 metabolites. iMet-Q provides user-friendly interfaces and is publicly available for download at http://ms.iis.sinica.edu.tw/comics/Software_iMet-Q.html.
Project description:BACKGROUND: Mass spectrometry-based metabolomic analysis depends upon the identification of spectral peaks by their mass and retention time. Statistical analysis that follows the identification currently relies on one main peak of each compound. However, a compound present in the sample typically produces several spectral peaks due to its isotopic properties and the ionization process of the mass spectrometer device. In this work, we investigate the extent to which these additional peaks can be used to increase the statistical strength of differential analysis. RESULTS: We present a Bayesian approach for integrating data of multiple detected peaks that come from one compound. We demonstrate the approach through a simulated experiment and validate it on ultra performance liquid chromatography-mass spectrometry (UPLC-MS) experiments for metabolomics and lipidomics. Peaks that are likely to be associated with one compound can be clustered by the similarity of their chromatographic shape. Changes of concentration between sample groups can be inferred more accurately when multiple peaks are available. CONCLUSIONS: When the sample-size is limited, the proposed multi-peak approach improves the accuracy at inferring covariate effects. An R implementation and data are available at http://research.ics.aalto.fi/mi/software/peakANOVA/.
Project description:Quantitative one-dimensional (1D) (1)H NMR spectroscopy is a useful tool for determining metabolite concentrations because of the direct proportionality of signal intensity to the quantity of analyte. However, severe signal overlap in 1D (1)H NMR spectra of complex metabolite mixtures hinders accurate quantification. Extension of 1D (1)H to 2D (1)H-(13)C HSQC leads to the dispersion of peaks along the (13)C dimension and greatly alleviates peak overlapping. Although peaks are better resolved in 2D (1)H-(13)C HSQC than in 1D (1)H NMR spectra, the simple proportionality of cross peaks to the quantity of individual metabolites is lost by resonance-specific signal attenuation during the coherence transfer periods. As a result, peaks for individual metabolites usually are quantified by reference to calibration data collected from samples of known concentration. We show here that data from a series of HSQC spectra acquired with incremented repetition times (the time between the end of the first (1)H excitation pulse to the beginning of data acquisition) can be extrapolated back to zero time to yield a time-zero 2D (1)H-(13)C HSQC spectrum (HSQC(0)) in which signal intensities are proportional to concentrations of individual metabolites. Relative concentrations determined from cross peak intensities can be converted to absolute concentrations by reference to an internal standard of known concentration. Clustering of the HSQC(0) cross peaks by their normalized intensities identifies those corresponding to metabolites present at a given concentration, and this information can assist in assigning these peaks to specific compounds. The concentration measurement for an individual metabolite can be improved by averaging the intensities of multiple, nonoverlapping cross peaks assigned to that metabolite.
Project description:Liquid chromatography coupled to mass spectrometry (LCMS) is widely used in metabolomics due to its sensitivity, reproducibility, speed and versatility. Metabolites are detected as peaks which are characterised by mass-over-charge ratio (m/z) and retention time (rt), and one of the most critical but also the most challenging tasks in metabolomics is to annotate the large number of peaks detected in biological samples. Accurate m/z measurements enable the prediction of molecular formulae which provide clues to the chemical identity of peaks, but often a number of metabolites have identical molecular formulae. Chromatographic behaviour, reflecting the physicochemical properties of metabolites, should also provide structural information. However, the variation in rt between analytical runs, and the complicating factors underlying the observed time shifts, make the use of such information for peak annotation a non-trivial task. To this end, we conducted Quantitative Structure-Retention Relationship (QSRR) modelling between the calculated molecular descriptors (MDs) and the experimental retention times (rts) of 93 authentic compounds analysed using hydrophilic interaction liquid chromatography (HILIC) coupled to high resolution MS. A predictive QSRR model based on Random Forests algorithm outperformed a Multiple Linear Regression based model, and achieved a high correlation between predicted rts and experimental rts (Pearson's correlation coefficient = 0.97), with mean and median absolute error of 0.52 min and 0.34 min (corresponding to 5.1 and 3.2 % error), respectively. We demonstrate that rt prediction with the precision achieved enables the systematic utilisation of rts for annotating unknown peaks detected in a metabolomics study. The application of the QSRR model with the strategy we outlined enhanced the peak annotation process by reducing the number of false positives resulting from database queries by matching accurate mass alone, and enriching the reference library. The predicted rts were validated using either authentic compounds or ion fragmentation patterns.
Project description:Comprehensive two-dimensional gas chromatography coupled with mass spectrometry (GC × GC-MS) is a powerful technique which has gained increasing attention over the last two decades. The GC × GC-MS provides much increased separation capacity, chemical selectivity and sensitivity for complex sample analysis and brings more accurate information about compound retention times and mass spectra. Despite these advantages, the retention times of the resolved peaks on the two-dimensional gas chromatographic columns are always shifted due to experimental variations, introducing difficulty in the data processing for metabolomics analysis. Therefore, the retention time variation must be adjusted in order to compare multiple metabolic profiles obtained from different conditions.We developed novel peak alignment algorithms for both homogeneous (acquired under the identical experimental conditions) and heterogeneous (acquired under the different experimental conditions) GC × GC-MS data using modified Smith-Waterman local alignment algorithms along with mass spectral similarity. Compared with literature reported algorithms, the proposed algorithms eliminated the detection of landmark peaks and the usage of retention time transformation. Furthermore, an automated peak alignment software package was established by implementing a likelihood function for optimal peak alignment.The proposed Smith-Waterman local alignment-based algorithms are capable of aligning both the homogeneous and heterogeneous data of multiple GC × GC-MS experiments without the transformation of retention times and the selection of landmark peaks. An optimal version of the SW-based algorithms was also established based on the associated likelihood function for the automatic peak alignment. The proposed alignment algorithms outperform the literature reported alignment method by analyzing the experiment data of a mixture of compound standards and a metabolite extract of mouse plasma with spiked-in compound standards.
Project description:We introduce a web-based tool, Peak Annotation and Visualization (PAVIS), for annotating and visualizing ChIP-seq peak data. PAVIS is designed with non-bioinformaticians in mind and presents a straightforward user interface to facilitate biological interpretation of ChIP-seq peak or other genomic enrichment data. PAVIS, through association with annotation, provides relevant genomic context for each peak, such as peak location relative to genomic features including transcription start site, intron, exon or 5'/3'-untranslated region. PAVIS reports the relative enrichment P-values of peaks in these functionally distinct categories, and provides a summary plot of the relative proportion of peaks in each category. PAVIS, unlike many other resources, provides a peak-oriented annotation and visualization system, allowing dynamic visualization of tens to hundreds of loci from one or more ChIP-seq experiments, simultaneously. PAVIS enables rapid, and easy examination and cross-comparison of the genomic context and potential functions of the underlying genomic elements, thus supporting downstream hypothesis generation.
Project description:A set of data preprocessing algorithms for peak detection and peak list alignment are reported for analysis of liquid chromatography-mass spectrometry (LC-MS)-based metabolomics data. For spectrum deconvolution, peak picking is achieved at the selected ion chromatogram (XIC) level. To estimate and remove the noise in XICs, each XIC is first segmented into several peak groups based on the continuity of scan number, and the noise level is estimated by all the XIC signals, except the regions potentially with presence of metabolite ion peaks. After removing noise, the peaks of molecular ions are detected using both the first and the second derivatives, followed by an efficient exponentially modified Gaussian-based peak deconvolution method for peak fitting. A two-stage alignment algorithm is also developed, where the retention times of all peaks are first transferred into the z-score domain and the peaks are aligned based on the measure of their mixture scores after retention time correction using a partial linear regression. Analysis of a set of spike-in LC-MS data from three groups of samples containing 16 metabolite standards mixed with metabolite extract from mouse livers demonstrates that the developed data preprocessing method performs better than two of the existing popular data analysis packages, MZmine2.6 and XCMS(2), for peak picking, peak list alignment, and quantification.