Project description:BackgroundPartial Least-Squares Discriminant Analysis (PLS-DA) is a popular machine learning tool that is gaining increasing attention as a useful feature selector and classifier. In an effort to understand its strengths and weaknesses, we performed a series of experiments with synthetic data and compared its performance to its close relative from which it was initially invented, namely Principal Component Analysis (PCA).ResultsWe demonstrate that even though PCA ignores the information regarding the class labels of the samples, this unsupervised tool can be remarkably effective as a feature selector. In some cases, it outperforms PLS-DA, which is made aware of the class labels in its input. Our experiments range from looking at the signal-to-noise ratio in the feature selection task, to considering many practical distributions and models encountered when analyzing bioinformatics and clinical data. Other methods were also evaluated. Finally, we analyzed an interesting data set from 396 vaginal microbiome samples where the ground truth for the feature selection was available. All the 3D figures shown in this paper as well as the supplementary ones can be viewed interactively at http://biorg.cs.fiu.edu/plsda CONCLUSIONS: Our results highlighted the strengths and weaknesses of PLS-DA in comparison with PCA for different underlying data models.
Project description:Classification studies are widely applied, e.g. in biomedical research to classify objects/patients into predefined groups. The goal is to find a classification function/rule which assigns each object/patient to a unique group with the greatest possible accuracy (classification error). Especially in gene expression experiments often a lot of variables (genes) are measured for only few objects/patients. A suitable approach is the well-known method PLS-DA, which searches for a transformation to a lower dimensional space. Resulting new components are linear combinations of the original variables. An advancement of PLS-DA leads to PPLS-DA, introducing a so called 'power parameter', which is maximized towards the correlation between the components and the group-membership. We introduce an extension of PPLS-DA for optimizing this power parameter towards the final aim, namely towards a minimal classification error. We compare this new extension with the original PPLS-DA and also with the ordinary PLS-DA using simulated and experimental datasets. For the investigated data sets with weak linear dependency between features/variables, no improvement is shown for PPLS-DA and for the extensions compared to PLS-DA. A very weak linear dependency, a low proportion of differentially expressed genes for simulated data, does not lead to an improvement of PPLS-DA over PLS-DA, but our extension shows a lower prediction error. On the contrary, for the data set with strong between-feature collinearity and a low proportion of differentially expressed genes and a large total number of genes, the prediction error of PPLS-DA and the extensions is clearly lower than for PLS-DA. Moreover we compare these prediction results with results of support vector machines with linear kernel and linear discriminant analysis.
Project description:Metabolic fingerprinting studies rely on interpretations drawn from low-dimensional representations of spectral data generated by methods of multivariate analysis such as principal components analysis and projection to latent structures discriminant analysis. The growth of metabolic fingerprinting and chemometric analyses involving these low-dimensional scores plots necessitates the use of quantitative statistical measures to describe significant differences between experimental groups. Our updated version of the PCAtoTree software provides methods to reliably visualize and quantify separations in scores plots through dendrograms employing both nonparametric and parametric hypothesis testing to assess node significance, as well as scores plots identifying 95% confidence ellipsoids for all experimental groups.
Project description:The World Health Organization (WHO) declared the Omicron variant (B.1.1.529) of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the pathogen responsible for the Coronavirus disease 2019 (COVID-19) pandemic, as a variant of concern on 26 November 2021. By this time, 42% of the world's population had received at least one dose of the vaccine against COVID-19. As on 1 October 2022, only 68% of the world population got the first dose of the vaccine. Although the vaccination is incredibly protective against severe complications of the disease and death, the highly contagious Omicron variant, compared to the Delta variant (B.1.617.2), has led the whole world into more chaotic situations. Furthermore, the virus has a high mutation rate, and hence, the possibility of a new variant of concern in the future cannot be ruled out. To face such a challenging situation, paramount importance should be given to rapid diagnosis and isolation of the infected patient. Current diagnosis methods, including reverse transcription-polymerase chain reaction and rapid antigen tests, face significant burdens during a COVID-19 wave. However, studies reported ultrarapid, reagent-free, cost-efficient, and non-destructive diagnosis methods based on chemometrics for COVID-19 and COVID-19 severity diagnosis. These studies used a smaller sample cohort to construct the diagnosis model and failed to discuss the robustness of the model. The current study systematically evaluated the robustness of the diagnosis models trained using smaller (real and augmented spectra) and larger (augmented spectra) datasets. The Monte Carlo cross-validation and permutation test results suggest that diagnosis using models trained by larger datasets was accurate and statistically significant (Q 2 > 99% and AUROC = 100%).
Project description:Fast-growing trees like Capirona, Bolaina, and Pashaco have the potential to reduce forest degradation because of their ecological features, the economic importance in the Amazon Forest, and an industry based on wood-polymer composites. Therefore, a practical method to discriminate specie (to avoid illegal logging) and determine chemical composition (tree breeding programs) is needed. This study aimed to validate a model for the classification of wood species and a universal model for the rapid determination of cellulose, hemicellulose, and lignin using FTIR spectroscopy coupled with chemometrics. Our results showed that PLS-DA models for the classification of wood species (0.84 ≤ R2 ≤ 0.91, 0.12 ≤ RMSEP ≤ 0.20, accuracy, specificity, and sensibility between 95.2 and 100%) were satisfied with the full spectra and the differentiation among these species based on IR peaks related to cellulose, lignin, and hemicellulose. Besides, the full spectra helped build a three-species universal PLS model to quantify the principal wood chemical components. Lignin (RPD = 2.27, [Formula: see text] = 0.84) and hemicellulose (RPD = 2.46, [Formula: see text] = 0.83) models showed a good prediction, while cellulose model (RPD = 3.43, [Formula: see text] = 0.91) classified as efficient. This study showed that FTIR-ATR, together with chemometrics, is a reliable method to discriminate wood species and to determine the wood chemical composition in juvenile trees of Pashaco, Capirona, and Bolaina.
Project description:Partial Least Squares-Discriminant Analysis (PLS-DA) is a PLS regression method with a special binary 'dummy' y-variable and it is commonly used for classification purposes and biomarker selection in metabolomics studies. Several statistical approaches are currently in use to validate outcomes of PLS-DA analyses e.g. double cross validation procedures or permutation testing. However, there is a great inconsistency in the optimization and the assessment of performance of PLS-DA models due to many different diagnostic statistics currently employed in metabolomics data analyses. In this paper, properties of four diagnostic statistics of PLS-DA, namely the number of misclassifications (NMC), the Area Under the Receiver Operating Characteristic (AUROC), Q(2) and Discriminant Q(2) (DQ(2)) are discussed. All four diagnostic statistics are used in the optimization and the performance assessment of PLS-DA models of three different-size metabolomics data sets obtained with two different types of analytical platforms and with different levels of known differences between two groups: control and case groups. Statistical significance of obtained PLS-DA models was evaluated with permutation testing. PLS-DA models obtained with NMC and AUROC are more powerful in detecting very small differences between groups than models obtained with Q(2) and Discriminant Q(2) (DQ(2)). Reproducibility of obtained PLS-DA models outcomes, models complexity and permutation test distributions are also investigated to explain this phenomenon. DQ(2) and Q(2) (in contrary to NMC and AUROC) prefer PLS-DA models with lower complexity and require higher number of permutation tests and submodels to accurately estimate statistical significance of the model performance. NMC and AUROC seem more efficient and more reliable diagnostic statistics and should be recommended in two group discrimination metabolomic studies. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s11306-011-0330-3) contains supplementary material, which is available to authorized users.
Project description:In the dataset presented in this article, 168 rice samples comprising sixteen rice varieties (including Indica and Japonica sub species) from a Portuguese Rice Breeding Program obtained from three different sites along four seasons, and 11 standard rice varieties from International Rice Research Institute were characterised. The amylose concentration was evaluated based on iodine method, and the near infrared (NIR) spectra were determined. To assess the advantage of Near infrared spectroscopy, different rice varieties and specific algorithms based on Matlab software such as Standard Normal Variate (SNV), Multiple Scatter Calibration (MSC) and Savitzky-Golay filter were used for NIR spectra pre-processing.
Project description:Prostate-specific antigen (PSA) is the main biomarker for the screening of prostate cancer (PCa), which has a high sensibility (higher than 80%) that is negatively offset by its poor specificity (only 30%, with the European cut-off of 4 ng/mL). This generates a large number of useless biopsies, involving both risks for the patients and costs for the national healthcare systems. Consequently, efforts were recently made to discover new biomarkers useful for PCa screening, including our proposal of interpreting a multi-parametric urinary steroidal profile with multivariate statistics. This approach has been expanded to investigate new alleged biomarkers by the application of untargeted urinary metabolomics. Urine samples from 91 patients (43 affected by PCa; 48 by benign hyperplasia) were deconjugated, extracted in both basic and acidic conditions, derivatized with different reagents, and analyzed with different gas chromatographic columns. Three-dimensional data were obtained from full-scan electron impact mass spectra. The PARADISe software, coupled with NIST libraries, was employed for the computation of PARAFAC2 models, the extraction of the significative components (alleged biomarkers), and the generation of a semiquantitative dataset. After variables selection, a partial least squares-discriminant analysis classification model was built, yielding promising performances. The selected biomarkers need further validation, possibly involving, yet again, a targeted approach.
Project description:Genomic approaches have provided detailed insight into chromosome architecture. However, commonly deployed techniques do not preserve connectivity-based information, leaving large-scale genome organization poorly characterized. Here, we developed CheC-PLS: a proximity-labeling technique that indelibly marks, and then decodes, protein-associated sites. CheC-PLS tethers dam methyltransferase to a protein of interest, followed by Nanopore sequencing to identify methylated bases - indicative of in vivo proximity - along reads >100kb. As proof-of-concept we analyzed, in budding yeast, a cohesin-based meiotic backbone that organizes chromatin into an array of loops. Our data recapitulates previously obtained association patterns, and, importantly, exposes variability between cells. Single read data reveals cohesin translocation on DNA and, by anchoring reads onto unique regions, we define the internal organization of the ribosomal DNA locus. Our versatile technique, which we also deployed on isolated nuclei with nanobodies, promises to illuminate diverse chromosomal processes by describing the in vivo conformations of single chromosomes.