IMatch2: compound identification using retention index for analysis of gas chromatography-mass spectrometry data.
ABSTRACT: We developed a method, iMatch2, for compound identification using retention indices (RI) in NIST11 library. Three-way ANOVA test and Kruskal-Wallis test respectively demonstrate that column class and temperature program type defined by the NIST library are the most dominant factors affecting the magnitude of retention index while the retention index data type does not cause significant difference. The developed linear regression transformation for merging retention indices with different data types, but the same column class and temperature program type, reduces the standard deviation of retention index up to 8%, compared to the simple union approach used in the original iMatch. As for outlier detection methods to remove retention indices having large difference with the remaining data of the same compound, Tietjen-Moore test and generalized extreme studentized deviate test are the strictest methods, while methods such as Dixon's test, Thompson tau approach, and Grubbs' test are more conservative. To improve the accuracy of retention index window, a concept of compound specific retention index window is introduced for compounds with a large number of retention indices in the NIST11 library, while the retention index window is calculated from empirical distributions for the compounds with a small number of retention indices. Analysis of the experimental data of a mixture of compound standards and the metabolite extract from mouse liver show significant improvement of retention index quality in the NIST11 library and the new data analysis methods.
Project description:Prediction of gas chromatographic retention indices based on compound structure is an important task for analytical chemistry. The predicted retention indices can be used as a reference in a mass spectrometry library search despite the fact that their accuracy is worse in comparison with the experimental reference ones. In the last few years, deep learning was applied for this task. The use of deep learning drastically improved the accuracy of retention index prediction for non-polar stationary phases. In this work, we demonstrate for the first time the use of deep learning for retention index prediction on polar (e.g., polyethylene glycol, DB-WAX) and mid-polar (e.g., DB-624, DB-210, DB-1701, OV-17) stationary phases. The achieved accuracy lies in the range of 16-50 in terms of the mean absolute error for several stationary phases and test data sets. We also demonstrate that our approach can be directly applied to the prediction of the second dimension retention times (GC × GC) if a large enough data set is available. The achieved accuracy is considerably better compared with the previous results obtained using linear quantitative structure-retention relationships and ACD ChromGenius software. The source code and pre-trained models are available online.
Project description:Comprehensive two-dimensional gas chromatography mass spectrometry (GC × GC-MS) has been widely used for analysis of volatile compounds. However, the second dimension retention index (I) of each compound is not widely used to aid compound identification owing to the limited accuracy of I calculation. We report a surface fitting approach to the calculation of I using n-alkanes (C7-C30) as references, where the second dimension retention time (2tR) and the second dimension column temperature (2Te) formed the X-Y plane and the I was the Z-axis to form the I surface. Compared to the conventional approach for calculating I using isovolatility curves, the surface fitting approach eliminated the construction of isovolatility curves for the reference compounds and gives better reproducibility. The goodness of the proposed surface fitting achieved R2 = 0.9999 and RMSE = 6.1 retention index units (iu). Ten-fold cross validation demonstrated the surface fitting approach had a good predictability with average R2 = 0.9999 and RMSE = 6.6 iu. The developed method was also applied to calculate the second dimension retention indices of compound standards in two commercial mixtures MegaMix A and MegaMix B. The mean standard deviation of the calculated I was only 1.6 iu for compounds in MegaMix A and 3.4 iu for compounds in MegaMix B. Compared with the literature results, the small value of standard deviation in the calculated retention index using surface fitting method shows that the surface fitting method has less measurement variability than the conventional isovolatility curve approach.
Project description:A method was developed to employ National Institute of Standards and Technology (NIST) 2008 retention index database information for molecular retention matching via constructing a set of empirical distribution functions (DFs) of the absolute retention index deviation to its mean value. The effects of different experimental parameters on the molecules' retention indices were first assessed. The column class, the column type, and the data type have significant effects on the retention index values acquired on capillary columns. However, the normal alkane retention index (I(norm)) with the ramp condition is similar to the linear retention index (I(T)), while the I(norm) with the isothermal condition is similar to the Kováts retention index (I). As for the I(norm) with the complex condition, these data should be treated as an additional group, because the mean I(norm) value of the polar column is significantly different from the I(T). Based on this analysis, nine DFs were generated from the grouped retention index data. The DF information was further implemented into a software program called iMatch. The performance of iMatch was evaluated using experimental data of a mixture of standards and metabolite extract of rat plasma with spiked-in standards. About 19% of the molecules identified by ChromaTOF were filtered out by iMatch from the identification list of electron ionization (EI) mass spectral matching, while all of the spiked-in standards were preserved. The analysis results demonstrate that using the retention index values, via constructing a set of DFs, can improve the spectral matching-based identifications by reducing a significant portion of false-positives.
Project description:We report a compound identification method (SimMR), which simultaneously evaluates the mass spectrum similarity and the retention index distance using an empirical mixture score function, for the analysis of GC-MS data. The performance of the developed SimMR method was compared to that of two existing compound identification strategies. One is the mass spectrum matching method without incorporation of retention index information (SM). The other is the method that sequentially evaluates the mass spectrum similarity and retention index distance (SeqMR). For comparison purposes, we used the NIST/EPA/NIH Mass Spectral Library 2005. Our study demonstrates that SimMR performs the best among the three compound identification methods, by improving the overall identification accuracy up to 1.53% and 4.81% compared to SeqMR and SM, respectively.
Project description:Retention index (RI) is useful for metabolite identification. However, when RI is integrated with mass spectral similarity for metabolite identification, many controversial RI threshold setup are reported in literatures. In this study, a large scale test dataset of 5844 compounds with both mass spectra and RI information were created from National Institute of Standards and Technology (NIST) repetitive mass spectra (MS) and RI library. Three MS similarity measures: NIST composite measure, the real part of Discrete Fourier Transform (DFT.R) and the detail of Discrete Wavelet Transform (DWT.D) were used to investigate the accuracy of compound identification using the test dataset. To imitate real identification experiments, NIST MS main library was employed as reference library and the test dataset was used as search data. Our study shows that the optimal RI thresholds are 22, 15, and 15 i.u. for the NIST composite, DFT.R and DWT.D measures, respectively, when the RI and mass spectral similarity are integrated for compound identification. Compared to the mass spectrum matching, using both RI and mass spectral matching can improve the identification accuracy by 1.7%, 3.5%, and 3.5% for the three mass spectral similarity measures, respectively. It is concluded that the improvement of RI matching for compound identification heavily depends on the method of MS spectral similarity measure and the accuracy of RI data.
Project description:<h4>Background</h4>Volatile compounds comprise diverse chemical groups with wide-ranging sources and functions. These compounds originate from major pathways of secondary metabolism in many organisms and play essential roles in chemical ecology in both plant and animal kingdoms. In past decades, sampling methods and instrumentation for the analysis of complex volatile mixtures have improved; however, design and implementation of database tools to process and store the complex datasets have lagged behind.<h4>Description</h4>The volatile compound BinBase (vocBinBase) is an automated peak annotation and database system developed for the analysis of GC-TOF-MS data derived from complex volatile mixtures. The vocBinBase DB is an extension of the previously reported metabolite BinBase software developed to track and identify derivatized metabolites. The BinBase algorithm uses deconvoluted spectra and peak metadata (retention index, unique ion, spectral similarity, peak signal-to-noise ratio, and peak purity) from the Leco ChromaTOF software, and annotates peaks using a multi-tiered filtering system with stringent thresholds. The vocBinBase algorithm assigns the identity of compounds existing in the database. Volatile compound assignments are supported by the Adams mass spectral-retention index library, which contains over 2,000 plant-derived volatile compounds. Novel molecules that are not found within vocBinBase are automatically added using strict mass spectral and experimental criteria. Users obtain fully annotated data sheets with quantitative information for all volatile compounds for studies that may consist of thousands of chromatograms. The vocBinBase database may also be queried across different studies, comprising currently 1,537 unique mass spectra generated from 1.7 million deconvoluted mass spectra of 3,435 samples (18 species). Mass spectra with retention indices and volatile profiles are available as free download under the CC-BY agreement (http://vocbinbase.fiehnlab.ucdavis.edu).<h4>Conclusions</h4>The BinBase database algorithms have been successfully modified to allow for tracking and identification of volatile compounds in complex mixtures. The database is capable of annotating large datasets (hundreds to thousands of samples) and is well-suited for between-study comparisons such as chemotaxonomy investigations. This novel volatile compound database tool is applicable to research fields spanning chemical ecology to human health. The BinBase source code is freely available at http://binbase.sourceforge.net/ under the LGPL 2.0 license agreement.
Project description:One of the major obstacles in metabolomics is the identification of unknown metabolites. We tested constraints for reidentifying the correct structures of 29 known metabolite peaks from GCT premier accurate mass chemical ionization GC-TOF mass spectrometry data without any use of mass spectral libraries. Correct elemental formulas were retrieved within the top-3 hits for most molecular ion adducts using the "Seven Golden Rules" algorithm. An average of 514 potential structures per formula was downloaded from the PubChem chemical database and in-silico-derivatized using the ChemAxon software package. After chemical curation, Kovats retention indices (RI) were predicted for up to 747 potential structures per formula using the NIST MS group contribution algorithm and corrected for contribution of trimethylsilyl groups using the Fiehnlib RI library. When matching the range of predicted RI values against the experimentally determined peak retention, all but three incorrect formulas were excluded. For all remaining isomeric structures, accurate mass electron ionization spectra were predicted using the MassFrontier software and scored against experimental spectra. Using a mass error window of 10 ppm for fragment ions, 89% of all isomeric structures were removed and the correct structure was reported in 73% within the top-5 hits of the cases.
Project description:Reverse phase high pressure liquid chromatography was employed in order to evaluate the lipophilicity of antioxidant compounds from different classes, such as phenolic acids, flavanones, flavanols, flavones, anthocyanins, stilbenes, xantonoids, and proanthocyanidins. The retention time of each compound was measured using five different HPLC columns: RP18 (LiChroCART, Purosphere RP-18e), C8 (Zorbax, Eclipse XDBC8), C16-Amide (Discovery RP-Amide C16), CN100 (Saulentechnik, Lichrosphere), and pentafluorophenyl (Phenomenex, Kinetex PFP), and the mobile phase consisted of methanol and water (0.1% formic acid) in different proportions. The measurements were conducted at two different column temperatures, room temperature (22 °C) and, in order to mimic the environment from the human body, 37 °C. Furthermore, principal component analysis (PCA) was used to obtain new lipophilicity indices and holistic lipophilicity charts. Additionally, highly representative depictions of the chromatographic behavior of the investigated compounds and stationary phases at different temperatures were obtained using two new chemometric approaches, namely two-way joining cluster analysis and sum of ranking differences.
Project description:BACKGROUND:Scoliosis is a spine abnormal deviation, which is an idiopathic disorder among children and adolescents. As a matter of the fact, distribution of loads on the patient's spine and load-carrying capacity of the vertebral column are both random variables. Therefore, the probabilistic approach may consider as a sophisticated method to deal with this problem. METHOD:Reliability analysis is a probabilistic-based approach to consider the uncertainties of load and resistance of the vertebral column. The main contribution of this paper is to compare the reliability level of a normal and scoliosis spinal. To do so, the numerical analyses associated with the inherent random parameters of bones and applied load are performed. Then, the reliability indices for all vertebrae and discs are determined. Accordingly, as the main innovation of this paper, the system reliability indices of the spinal column for both normal and damaged backbone systems are represented. RESULTS:Based on the required reliability index for normal spinal curvature the target system reliability level for scoliosis disorder is proposed. CONCLUSION:Since the proposed target reliability index is based on the strength limit state of the vertebral column, it can be considered as a reliability level for any proposed treatment approaches.