Wavelet- and Fourier-transform-based spectrum similarity approaches to compound identification in gas chromatography/mass spectrometry.
ABSTRACT: The high-throughput gas chromatography/mass spectrometry (GC/MS) technology offers a powerful means of analyzing a large number of chemical and biological samples. One of the important analyses of GC/MS data is compound identification. In this work, novel spectral similarity measures based on the discrete wavelet and Fourier transforms were proposed. The proposed methods are composite similarities that are composed of weighted intensities and wavelet/Fourier coefficients using cosine correlation. The performance of the proposed approaches along with the existing similarity measures was evaluated using the NIST Chemistry WebBook mass database maintained by the National Institute of Standards and Technology (NIST) as a library of reference spectra and repetitive mass spectral data as query spectra. The analysis results showed that the identification accuracies of the wavelet- and Fourier-transform-based methods were improved by 2.02% and 1.95%, respectively, compared to that of the weighted dot product (cosine correlation) and by 3.01% and 3.08%, respectively, compared to that of the composite similarity measure. The improved identification accuracy demonstrates that the proposed approaches outperformed the existing similarity measures in the literature.
Project description:Retention index (RI) is useful for metabolite identification. However, when RI is integrated with mass spectral similarity for metabolite identification, many controversial RI threshold setup are reported in literatures. In this study, a large scale test dataset of 5844 compounds with both mass spectra and RI information were created from National Institute of Standards and Technology (NIST) repetitive mass spectra (MS) and RI library. Three MS similarity measures: NIST composite measure, the real part of Discrete Fourier Transform (DFT.R) and the detail of Discrete Wavelet Transform (DWT.D) were used to investigate the accuracy of compound identification using the test dataset. To imitate real identification experiments, NIST MS main library was employed as reference library and the test dataset was used as search data. Our study shows that the optimal RI thresholds are 22, 15, and 15 i.u. for the NIST composite, DFT.R and DWT.D measures, respectively, when the RI and mass spectral similarity are integrated for compound identification. Compared to the mass spectrum matching, using both RI and mass spectral matching can improve the identification accuracy by 1.7%, 3.5%, and 3.5% for the three mass spectral similarity measures, respectively. It is concluded that the improvement of RI matching for compound identification heavily depends on the method of MS spectral similarity measure and the accuracy of RI data.
Project description:Compound identification is a key component of data analysis in the applications of gas chromatography-mass spectrometry (GC-MS). Currently, the most widely used compound identification is mass spectrum matching, in which the dot product and its composite version are employed as spectral similarity measures. Several forms of transformations for fragment ion intensities have also been proposed to increase the accuracy of compound identification. In this study, we introduced partial and semipartial correlations as mass spectral similarity measures and applied them to identify compounds along with different transformations of peak intensity. The mixture versions of the proposed method were also developed to further improve the accuracy of compound identification. To demonstrate the performance of the proposed spectral similarity measures, the National Institute of Standards and Technology (NIST) mass spectral library and replicate spectral library were used as the reference library and the query spectra, respectively. Identification results showed that the mixture partial and semipartial correlations always outperform both the dot product and its composite measure. The mixture similarity with semipartial correlation has the highest accuracy of 84.6% in compound identification with a transformation of (0.53,1.3) for fragment ion intensity and m/z value, respectively.
Project description:Peak alignment is a critical procedure in mass spectrometry-based biomarker discovery in metabolomics. One of peak alignment approaches to comprehensive two-dimensional gas chromatography mass spectrometry (GC×GC-MS) data is peak matching-based alignment. A key to the peak matching-based alignment is the calculation of mass spectral similarity scores. Various mass spectral similarity measures have been developed mainly for compound identification, but the effect of these spectral similarity measures on the performance of peak matching-based alignment still remains unknown. Therefore, we selected five mass spectral similarity measures, cosine correlation, Pearson's correlation, Spearman's correlation, partial correlation, and part correlation, and examined their effects on peak alignment using two sets of experimental GC×GC-MS data. The results show that the spectral similarity measure does not affect the alignment accuracy significantly in analysis of data from less complex samples, while the partial correlation performs much better than other spectral similarity measures when analyzing experimental data acquired from complex biological samples.
Project description:The Fourier representations (FRs) are indispensable mathematical formulations for modeling and analysis of physical phenomena and engineering systems. This study presents a new set of generalized Fourier representations (GFRs) and phase transforms (PTs). The PTs are special cases of the GFRs and true generalizations of the Hilbert transforms. In particular, the Fourier transform based kernel of the PT is derived and its various properties are discussed. The time derivative and integral, including fractional order, of a signal are obtained using the GFR. It is demonstrated that the general class of time-invariant and time-variant filtering operations, analog and digital modulations can be obtained from the proposed GFR. A narrowband Fourier representation for the time-frequency analysis of a signal is also presented using the GFR. A discrete cosine transform based implementation, to avoid end artifacts due to discontinuities present in the both ends of a signal, is proposed. A fractional-delay in a discrete-time signal using the FR is introduced. The fast Fourier transform implementation of all the proposed representations is developed. Moreover, using the analytic wavelet transform, a wavelet phase transform (WPT) is proposed to obtain a desired phase-shift in a signal under-analysis. A wavelet quadrature transform (WQT) is also presented which is a special case of the WPT with a phase-shift of ?/2 radians. Thus, a wavelet analytic signal representation is derived from the WQT. Theoretical analysis and numerical experiments are conducted to evaluate effectiveness of the proposed methods.
Project description:We report a compound identification method (SimMR), which simultaneously evaluates the mass spectrum similarity and the retention index distance using an empirical mixture score function, for the analysis of GC-MS data. The performance of the developed SimMR method was compared to that of two existing compound identification strategies. One is the mass spectrum matching method without incorporation of retention index information (SM). The other is the method that sequentially evaluates the mass spectrum similarity and retention index distance (SeqMR). For comparison purposes, we used the NIST/EPA/NIH Mass Spectral Library 2005. Our study demonstrates that SimMR performs the best among the three compound identification methods, by improving the overall identification accuracy up to 1.53% and 4.81% compared to SeqMR and SM, respectively.
Project description:Many ecological experiments are based on the extraction and downstream analyses of microorganisms from different environmental samples. Due to its high throughput, cost-effectiveness and rapid performance, Matrix Assisted Laser Desorption/Ionization Mass Spectrometry with Time-of-Flight detector (MALDI-TOF MS), which has been proposed as a promising tool for bacterial identification and classification, could be advantageously used for dereplication of recurrent bacterial isolates. In this study, we compared whole-cell MALDI-TOF MS-based analyses of 49 bacterial cultures to two well-established bacterial identification and classification methods based on nearly complete 16S rRNA gene sequence analyses: a phylotype-based approach, using a closest type strain assignment, and a sequence similarity-based approach involving a 98.65% sequence similarity threshold, which has been found to best delineate bacterial species. Culture classification using reference-based MALDI-TOF MS was comparable to that yielded by phylotype assignment up to the genus level. At the species level, agreement between 16S rRNA gene analysis and MALDI-TOF MS was found to be limited, potentially indicating that spectral reference databases need to be improved. We also evaluated the mass spectral similarity technique for species-level delineation which can be used independently of reference databases. We established optimal mass spectral similarity thresholds which group MALDI-TOF mass spectra of common environmental isolates analogically to phylotype- and sequence similarity-based approaches. When using a mass spectrum similarity approach, we recommend a mass range of 4-10 kDa for analysis, which is populated with stable mass signals and contains the majority of phylotype-determining peaks. We show that a cosine similarity (CS) threshold of 0.79 differentiate mass spectra analogously to 98.65% species-level delineation sequence similarity threshold, with corresponding precision and recall values of 0.70 and 0.73, respectively. When matched to species-level phylotype assignment, an optimal CS threshold of 0.92 was calculated, with associated precision and recall values of 0.83 and 0.64, respectively. Overall, our research indicates that a similarity-based MALDI-TOF MS approach can be routinely used for efficient dereplication of isolates for downstream analyses, with minimal loss of unique organisms. In addition, MALDI-TOF MS analysis has further improvement potential unlike 16S rRNA gene analysis, whose methodological limits have reached a plateau.
Project description:Local field potential (LFP) oscillations are primarily shaped by the superposition of postsynaptic currents. Hippocampal LFP oscillations in the 25- to 50-Hz range ("slow ?") are proposed to support memory retrieval independent of other frequencies. However, ? harmonics extend up to 48 Hz, necessitating a study to determine whether these oscillations are fundamentally the same. We compared the spectral analysis methods of wavelet, ensemble empirical-mode decomposition (EEMD), and Fourier transform. EEMD, as previously applied, failed to account for the ? harmonics. Depending on analytical parameters selected, wavelet may convolve over high-order ? harmonics due to the variable time-frequency atoms, creating the appearance of a broad 25- to 50-Hz rhythm. As an illustration of this issue, wavelet and EEMD depicted slow ? in a synthetic dataset that only contained ? and its harmonics. Oscillatory transience cannot explain the difference in approaches as Fourier decomposition identifies ripples triggered to epochs of high-power, 120- to 250-Hz events. When Fourier is applied to high power, 25- to 50-Hz events, only ? harmonics are resolved. This analysis challenges the identification of the slow ? rhythm as a unique fundamental hippocampal oscillation. While there may be instances in which slow ? is present in the rat hippocampus, the analysis presented here shows that unless care is exerted in the application of EEMD and wavelet techniques, the results may be misleading, in this case misrepresenting ? harmonics. Moreover, it is necessary to reconsider the characteristics that define a fundamental hippocampal oscillation as well as theories based on multiple independent ? bands.
Project description:The challenge of detecting research topics in a specific research field has attracted attention from researchers in the bibliometrics community. In this study, to solve two problems of clustering papers, i.e., the influence of different distributions of citation links and involved textual features on similarity computation, the authors propose a hybrid self-optimized clustering model to detect research topics by extending the hybrid clustering model to identify "core documents". First, the Amsler network, consisting of bibliographic coupling and co-citation links, is created to calculate the citation-based similarity based on the cosine angle of papers. Second, the cosine similarity is also used to compute the text-based similarity, which consists of the textual statistical and topological features. Then, the cosine angle of the linear combination of citation- and text-based similarity is considered as the hybrid similarity. Finally, the Louvain method is applied to cluster papers, and the terms based on term frequency are used to label clusters. To test the performance of the proposed model, a dataset related to the data envelopment analysis field is used for comparison and analysis of clustering results. Based on the benchmark built, different clustering methods with different citation links or textual features are compared according to evaluation measures. The results show that the proposed model can obtain reasonable and effective clustering results, and the research topics of data envelopment analysis field are also analyzed based on the proposed model. As different features are considered in the proposed model compared with previous hybrid clustering models, the proposed clustering model can provide inspiration for further studies on topic identification by other researchers.
Project description:Spectral similarity is used as a proxy for structural similarity in many tandem mass spectrometry (MS/MS) based metabolomics analyses such as library matching and molecular networking. Although weaknesses in the relationship between spectral similarity scores and the true structural similarities have been described, little development of alternative scores has been undertaken. Here, we introduce Spec2Vec, a novel spectral similarity score inspired by a natural language processing algorithm-Word2Vec. Spec2Vec learns fragmental relationships within a large set of spectral data to derive abstract spectral embeddings that can be used to assess spectral similarities. Using data derived from GNPS MS/MS libraries including spectra for nearly 13,000 unique molecules, we show how Spec2Vec scores correlate better with structural similarity than cosine-based scores. We demonstrate the advantages of Spec2Vec in library matching and molecular networking. Spec2Vec is computationally more scalable allowing structural analogue searches in large databases within seconds.
Project description:Compound identification in gas chromatography-mass spectrometry (GC-MS) is usually achieved by matching query spectra to spectra present in a reference library. Although several spectral similarity measures have been developed and compared using a small reference library, it still remains unknown how the relationship between the spectral similarity measure and the size of reference library affects on the identification accuracy as well as the optimal weight factor. We used three reference libraries to investigate the dependency of the optimal weight factor, spectral similarity measure and the size of reference library. Our study demonstrated that the optimal weight factor depends on not only spectral similarity measure but also the size of reference library. The mixture semi-partial correlation measure outperforms all existing spectral similarity measures in all tested reference libraries, in spite of the computational expense. Furthermore, the accuracy of compound identification using a larger reference library in future is estimated by varying the size of reference library. Simulation study indicates that the mixture semi-partial correlation measure will have the best performance with the increase of reference library in future.