Project description:Machine learning has emerged as an invaluable tool in many research areas. In the present work, we harness this power to predict highly accurate molecular infrared spectra with unprecedented computational efficiency. To account for vibrational anharmonic and dynamical effects - typically neglected by conventional quantum chemistry approaches - we base our machine learning strategy on ab initio molecular dynamics simulations. While these simulations are usually extremely time consuming even for small molecules, we overcome these limitations by leveraging the power of a variety of machine learning techniques, not only accelerating simulations by several orders of magnitude, but also greatly extending the size of systems that can be treated. To this end, we develop a molecular dipole moment model based on environment dependent neural network charges and combine it with the neural network potential approach of Behler and Parrinello. Contrary to the prevalent big data philosophy, we are able to obtain very accurate machine learning models for the prediction of infrared spectra based on only a few hundreds of electronic structure reference points. This is made possible through the use of molecular forces during neural network potential training and the introduction of a fully automated sampling scheme. We demonstrate the power of our machine learning approach by applying it to model the infrared spectra of a methanol molecule, n-alkanes containing up to 200 atoms and the protonated alanine tripeptide, which at the same time represents the first application of machine learning techniques to simulate the dynamics of a peptide. In all of these case studies we find an excellent agreement between the infrared spectra predicted via machine learning models and the respective theoretical and experimental spectra.
Project description:Lung cancer patients with malignant pleural effusions (MPE) have a particular poor prognosis. It is crucial to distinguish MPE from benign pleural effusion (BPE). The present study aims to develop a rapid, convenient and economical diagnostic method based on FTIR near-infrared spectroscopy (NIRS) combined with machine learning strategy for clinical pleural effusion classification. NIRS spectra were recorded for 47 MPE samples and 35 BPE samples. The sample data were randomly divided into train set (n = 62) and test set (n = 20). Partial least squares, random forest, support vector machine (SVM), and gradient boosting machine models were trained, and subsequent predictive performance were predicted on the test set. Besides the whole spectra used in modeling, selected features using SVM recursive feature elimination algorithm were also investigated in modeling. Among those models, NIRS combined with SVM showed the best predictive performance (accuracy: 1.0, kappa: 1.0, and AUCROC: 1.0). SVM with the top 50 feature wavenumbers also displayed a high predictive performance (accuracy: 0.95, kappa: 0.89, AUCROC: 0.99). Our study revealed that the combination of NIRS and machine learning is an innovative, rapid, and convenient method for clinical pleural effusion classification, and worth further evaluation.
Project description:Application of machine learning (ML) algorithms to spectroscopic data has a great potential for obtaining hidden correlations between structural information and spectral features. Here, we apply ML algorithms to theoretically simulated infrared (IR) spectra to establish the structure-spectrum correlations in zeolites. Two hundred thirty different types of zeolite frameworks were considered in the study whose theoretical IR spectra were used as the training ML set. A classification problem was solved to predict the presence or absence of possible tilings and secondary building units (SBUs). Several natural tilings and SBUs were also predicted with an accuracy above 89%. The set of continuous descriptors was also suggested, and the regression problem was also solved using the ExtraTrees algorithm. For the latter problem, additional IR spectra were computed for the structures with artificially modified cell parameters, expanding the database to 470 different spectra of zeolites. The resulting prediction quality above or close to 90% was obtained for the average Si-O distances, Si-O-Si angles, and volume of TO4 tetrahedra. The obtained results provide new possibilities for utilization of infrared spectra as a quantitative tool for characterization of zeolites.
Project description:Accuracy of infrared (IR) models to measure soil particle-size distribution (PSD) depends on soil preparation, methodology (sedimentation, laser), settling times and relevant soil features. Compositional soil data may require log ratio (ilr) transformation to avoid numerical biases. Machine learning can relate numerous independent variables that may impact on NIR spectra to assess particle-size distribution. Our objective was to reach high IRS prediction accuracy across a large range of PSD methods and soil properties. A total of 1298 soil samples from eastern Canada were IR-scanned. Spectra were processed by Stochastic Gradient Boosting (SGB) to predict sand, silt, clay and carbon. Slope and intercept of the log-log relationships between settling time and suspension density function (SDF) (R2 = 0.84-0.92) performed similarly to NIR spectra using either ilr-transformed (R2 = 0.81-0.93) or raw percentages (R2 = 0.76-0.94). Settling times of 0.67-min and 2-h were the most accurate for NIR predictions (R2 = 0.49-0.79). The NIR prediction of sand sieving method (R2 = 0.66) was more accurate than sedimentation method(R2 = 0.53). The NIR 2X gain was less accurate (R2 = 0.69-0.92) than 4X (R2 = 0.87-0.95). The MIR (R2 = 0.45-0.80) performed better than NIR (R2 = 0.40-0.71) spectra. Adding soil carbon, reconstituted bulk density, pH, red-green-blue color, oxalate and Mehlich3 extracts returned R2 value of 0.86-0.91 for texture prediction. In addition to slope and intercept of the SDF, 4X gain, method and pre-treatment classes, soil carbon and color appeared to be promising features for routine SGB-processed NIR particle-size analysis. Machine learning methods support cost-effective soil texture NIR analysis.
Project description:Dynamic protein structures are crucial for deciphering their diverse biological functions. Two-dimensional infrared (2DIR) spectroscopy stands as an ideal tool for tracing rapid conformational evolutions in proteins. However, linking spectral characteristics to dynamic structures poses a formidable challenge. Here, we present a pretrained machine learning model based on 2DIR spectra analysis. This model has learned signal features from approximately 204,300 spectra to establish a "spectrum-structure" correlation, thereby tracing the dynamic conformations of proteins. It excels in accurately predicting the dynamic content changes of various secondary structures and demonstrates universal transferability on real folding trajectories spanning timescales from microseconds to milliseconds. Beyond exceptional predictive performance, the model offers attention-based spectral explanations of dynamic conformational changes. Our 2DIR-based pretrained model is anticipated to provide unique insights into the dynamic structural information of proteins in their native environments.
Project description:The integrity of extra virgin olive oil (EVOO) quality markers can be compromised owing to deceptive marketing practices, such as misleading geographical origin claims or counterfeit certification labels, i.e., protected designations of origin (PDO). Therefore, it is imperative to introduce ecofriendly, rapid, and economical analytical methods for authenticating EVOO, such as near-infrared (NIR) spectroscopy. Unlike traditional techniques such as chromatography, NIR spectra contain unresolved bands; hence, chemometric tools such as principal component analysis (PCA) are essential for extracting valuable information from them. Herein, PCA was employed to reduce the high dimensionality of the NIR spectra. The PCA factors were then integrated as explanatory variables in machine-learning classification models, enabling the classification of EVOO based on its geographical origin or PDO. Furthermore, the classification models were improved by incorporating agro-climatic data, resulting in a noticeable improvement in the accuracy and reliability of the results. These results were cross-validated by changing the calibration and validation subsamples in successive iterations and averaging the obtained ratios. The results were robust when the olive varieties differed. Consequently, our findings highlight the potential benefits of incorporating agro-climatic information with NIR spectral data in classification models.
Project description:The authors of this study developed the use of attenuated total reflectance Fourier transform infrared spectroscopy (ATR-FTIR) combined with machine learning as a point-of-care (POC) diagnostic platform, considering neonatal respiratory distress syndrome (nRDS), for which no POC currently exists, as an example. nRDS can be diagnosed by a ratio of less than 2.2 of two nRDS biomarkers, lecithin and sphingomyelin (L/S ratio), and in this study, ATR-FTIR spectra were recorded from L/S ratios of between 1.0 and 3.4, which were generated using purified reagents. The calibration of principal component (PCR) and partial least squares (PLSR) regression models was performed using 155 raw baselined and second derivative spectra prior to predicting the concentration of a further 104 spectra. A three-factor PLSR model of second derivative spectra best predicted L/S ratios across the full range (R2: 0.967; MSE: 0.014). The L/S ratios from 1.0 to 3.4 were predicted with a prediction interval of +0.29, -0.37 when using a second derivative spectra PLSR model and had a mean prediction interval of +0.26, -0.34 around the L/S 2.2 region. These results support the validity of combining ATR-FTIR with machine learning to develop a point-of-care device for detecting and quantifying any biomarker with an interpretable mid-infrared spectrum.
Project description:SignificanceFluorescence-guided surgery (FGS) provides specific real-time visualization of tumors, but intensity-based measurement of fluorescence is prone to errors. Multispectral imaging (MSI) in the short-wave infrared (SWIR) has the potential to improve tumor delineation by enabling machine-learning classification of pixels based on their spectral characteristics.AimDetermine whether MSI can be applied to FGS and combined with machine learning to provide a robust method for tumor visualization.ApproachA multispectral SWIR fluorescence imaging device capable of collecting data from six spectral filters was constructed and deployed on neuroblastoma (NB) subcutaneous xenografts ( n=6 ) after the injection of a NB-specific NIR-I fluorescent probe (Dinutuximab-IRDye800). We constructed image cubes representing fluorescence collected from ∼850 to 1450 nm and compared the performance of seven learning-based methods for pixel-by-pixel classification, including linear discriminant analysis, k -nearest neighbor classification, and a neural network.ResultsThe spectra of tumor and non-tumor tissue were subtly different and conserved between individuals. In classification, a combine principal component analysis and k -nearest-neighbor approach with area under curve normalization performed best, achieving 97.5% per-pixel classification accuracy (97.1%, 93.5%, and 99.2% for tumor, non-tumor tissue and background, respectively).ConclusionsThe development of dozens of new imaging agents provides a timely opportunity for multispectral SWIR imaging to revolutionize next-generation FGS.
Project description:Precision livestock farming technologies are used to monitor animal health and welfare parameters continuously and in real time in order to optimize nutrition and productivity and to detect health issues at an early stage. The possibility of predicting blood metabolites from milk samples obtained during routine milking by means of infrared spectroscopy has become increasingly attractive. We developed, for the first time, prediction equations for a set of blood metabolites using diverse machine learning methods and milk near-infrared spectra collected by the AfiLab instrument. Our dataset was obtained from 385 Holstein Friesian dairy cows. Stacking ensemble and multi-layer feedforward artificial neural network outperformed the other machine learning methods tested, with a reduction in the root mean square error of between 3 and 6% in most blood parameters. We obtained moderate correlations (r) between the observed and predicted phenotypes for γ-glutamyl transferase (r = 0.58), alkaline phosphatase (0.54), haptoglobin (0.66), globulins (0.61), total reactive oxygen metabolites (0.60) and thiol groups (0.57). The AfiLab instrument has strong potential but may not yet be ready to predict the metabolic stress of dairy cows in practice. Further research is needed to find out methods that allow an improvement in accuracy of prediction equations.
Project description:Neuromyelitis optica spectrum disorder (NMOSD) and multiple sclerosis (MS) are both autoimmune inflammatory and demyelinating diseases of the central nervous system. NMOSD is a highly disabling disease and rapid introduction of the appropriate treatment at the acute phase is crucial to prevent sequelae. Specific criteria were established in 2015 and provide keys to distinguish NMOSD and MS. One of the most reliable criteria for NMOSD diagnosis is detection in patient's serum of an antibody that attacks the water channel aquaporin-4 (AQP-4). Another target in NMOSD is myelin oligodendrocyte glycoprotein (MOG), delineating a new spectrum of diseases called MOG-associated diseases. Lastly, patients with NMOSD can be negative for both AQP-4 and MOG antibodies. At disease onset, NMOSD symptoms are very similar to MS symptoms from a clinical and radiological perspective. Thus, at first episode, given the urgency of starting the anti-inflammatory treatment, there is an unmet need to differentiate NMOSD subtypes from MS. Here, we used Fourier transform infrared spectroscopy in combination with a machine learning algorithm with the aim of distinguishing the infrared signatures of sera of a first episode of NMOSD from those of a first episode of relapsing-remitting MS, as well as from those of healthy subjects and patients with chronic inflammatory demyelinating polyneuropathy. Our results showed that NMOSD patients were distinguished from MS patients and healthy subjects with a sensitivity of 100% and a specificity of 100%. We also discuss the distinction between the different NMOSD serostatuses. The coupling of infrared spectroscopy of sera to machine learning is a promising cost-effective, rapid and reliable differential diagnosis tool capable of helping to gain valuable time in patients' treatment.