An extension of PPLS-DA for classification and comparison to ordinary PLS-DA.
ABSTRACT: Classification studies are widely applied, e.g. in biomedical research to classify objects/patients into predefined groups. The goal is to find a classification function/rule which assigns each object/patient to a unique group with the greatest possible accuracy (classification error). Especially in gene expression experiments often a lot of variables (genes) are measured for only few objects/patients. A suitable approach is the well-known method PLS-DA, which searches for a transformation to a lower dimensional space. Resulting new components are linear combinations of the original variables. An advancement of PLS-DA leads to PPLS-DA, introducing a so called 'power parameter', which is maximized towards the correlation between the components and the group-membership. We introduce an extension of PPLS-DA for optimizing this power parameter towards the final aim, namely towards a minimal classification error. We compare this new extension with the original PPLS-DA and also with the ordinary PLS-DA using simulated and experimental datasets. For the investigated data sets with weak linear dependency between features/variables, no improvement is shown for PPLS-DA and for the extensions compared to PLS-DA. A very weak linear dependency, a low proportion of differentially expressed genes for simulated data, does not lead to an improvement of PPLS-DA over PLS-DA, but our extension shows a lower prediction error. On the contrary, for the data set with strong between-feature collinearity and a low proportion of differentially expressed genes and a large total number of genes, the prediction error of PPLS-DA and the extensions is clearly lower than for PLS-DA. Moreover we compare these prediction results with results of support vector machines with linear kernel and linear discriminant analysis.
Project description:It is crucial for the efficacy of the apple storage to apply methods like electronic nose systems for detection and prediction of spoilage or infection by Penicillium expansum. Based on the acquisition of electronic nose signals, selected sensitive feature sensors of spoilage apple and all sensors were analyzed and compared by the recognition effect. Principal component analysis (PCA), principle component analysis-discriminant analysis (PCA-DA), linear discriminant analysis (LDA), partial least squares discriminate analysis (PLS-DA) and K-nearest neighbor (KNN) were used to establish the classification model of apple with different degrees of corruption. PCA-DA has the best prediction, the accuracy of training set and prediction set was 100% and 97.22%, respectively. synergy interval (SI), genetic algorithm (GA) and competitive adaptive reweighted sampling (CARS) are three selection methods used to accurately and quickly extract appropriate feature variables, while constructing a PLS model to predict plaque area. Among them, the PLS model with unique variables was optimized by CARS method, and the best prediction result of the area of the rotten apple was obtained. The best results are as follows: Rc = 0.953, root mean square error of calibration (RMSEC) = 1.28, Rp = 0.972, root mean square error of prediction (RMSEP) = 1.01. The results demonstrated that the electronic nose has a potential application in the classification of rotten apples and the quantitative detection of spoilage area.
Project description:Partial Least Squares-Discriminant Analysis (PLS-DA) is a PLS regression method with a special binary 'dummy' y-variable and it is commonly used for classification purposes and biomarker selection in metabolomics studies. Several statistical approaches are currently in use to validate outcomes of PLS-DA analyses e.g. double cross validation procedures or permutation testing. However, there is a great inconsistency in the optimization and the assessment of performance of PLS-DA models due to many different diagnostic statistics currently employed in metabolomics data analyses. In this paper, properties of four diagnostic statistics of PLS-DA, namely the number of misclassifications (NMC), the Area Under the Receiver Operating Characteristic (AUROC), Q(2) and Discriminant Q(2) (DQ(2)) are discussed. All four diagnostic statistics are used in the optimization and the performance assessment of PLS-DA models of three different-size metabolomics data sets obtained with two different types of analytical platforms and with different levels of known differences between two groups: control and case groups. Statistical significance of obtained PLS-DA models was evaluated with permutation testing. PLS-DA models obtained with NMC and AUROC are more powerful in detecting very small differences between groups than models obtained with Q(2) and Discriminant Q(2) (DQ(2)). Reproducibility of obtained PLS-DA models outcomes, models complexity and permutation test distributions are also investigated to explain this phenomenon. DQ(2) and Q(2) (in contrary to NMC and AUROC) prefer PLS-DA models with lower complexity and require higher number of permutation tests and submodels to accurately estimate statistical significance of the model performance. NMC and AUROC seem more efficient and more reliable diagnostic statistics and should be recommended in two group discrimination metabolomic studies. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s11306-011-0330-3) contains supplementary material, which is available to authorized users.
Project description:To address high-dimensional genomic data, most of the proposed prediction methods make use of genomic data alone without considering clinical data, which are often available and known to have predictive value. Recent studies suggest that combining clinical and genomic information may improve predictions. We consider here methods for classification purposes that simultaneously use both types of variables but apply dimensionality reduction only to the high-dimensional genomic ones.Using partial least squares (PLS), we propose some one-step approaches based on three extensions of the least squares (LS)-PLS method for logistic regression. A comparison of their prediction performances via a simulation and on real data sets from cancer studies is conducted.In general, those methods using only clinical data or only genomic data perform poorly. The advantage of using LS-PLS methods for classification and their performances are shown and then used to analyze clinical and genomic data. The corresponding prediction results are encouraging and stable regardless of the data set and/or number of selected features. These extensions have been implemented in the R package lsplsGlm to enhance their use.
Project description:BACKGROUND: Variable selection on high throughput biological data, such as gene expression or single nucleotide polymorphisms (SNPs), becomes inevitable to select relevant information and, therefore, to better characterize diseases or assess genetic structure. There are different ways to perform variable selection in large data sets. Statistical tests are commonly used to identify differentially expressed features for explanatory purposes, whereas Machine Learning wrapper approaches can be used for predictive purposes. In the case of multiple highly correlated variables, another option is to use multivariate exploratory approaches to give more insight into cell biology, biological pathways or complex traits. RESULTS: A simple extension of a sparse PLS exploratory approach is proposed to perform variable selection in a multiclass classification framework. CONCLUSIONS: sPLS-DA has a classification performance similar to other wrapper or sparse discriminant analysis approaches on public microarray and SNP data sets. More importantly, sPLS-DA is clearly competitive in terms of computational efficiency and superior in terms of interpretability of the results via valuable graphical outputs. sPLS-DA is available in the R package mixOmics, which is dedicated to the analysis of large biological data sets.
Project description:Paris polyphylla, as a traditional herb with long history, has been widely used to treat diseases in multiple nationalities of China. Nevertheless, the quality of P. yunnanensis fluctuates among from different geographical origins, so that a fast and accurate classification method was necessary for establishment. In our study, the geographical origin identification of 462 P. yunnanensis rhizome and leaf samples from Kunming, Yuxi, Chuxiong, Dali, Lijiang, and Honghe were analyzed by Fourier transform mid infrared (FT-MIR) spectra, combined with partial least squares discriminant analysis (PLS-DA), random forest (RF), and hierarchical cluster analysis (HCA) methods. The obvious cluster tendency of rhizomes and leaves FT-MIR spectra was displayed by principal component analysis (PCA). The distribution of the variable importance for the projection (VIP) was more uniform than the important variables obtained by RF, while PLS-DA models obtained higher classification abilities. Hence, a PLS-DA model was more suitably used to classify the different geographical origins of P. yunnanensis than the RF model. Additionally, the clustering results of different geographical origins obtained by HCA dendrograms also proved the chemical information difference between rhizomes and leaves. The identification performances of PLS-DA and the RF models of leaves FT-MIR matrixes were better than those of rhizomes datasets. In addition, the model classification abilities of combination datasets were higher than the individual matrixes of rhizomes and leaves spectra. Our study provides a reference to the rational utilization of resources, as well as a fast and accurate identification research for P. yunnanensis samples.
Project description:Linear discriminant analysis (LDA) is a classical statistical approach for dimensionality reduction and classification. In many cases, the projection direction of the classical and extended LDA methods is not considered optimal for special applications. Herein we combine the Partial Least Squares (PLS) method with LDA algorithm, and then propose two improved methods, named LDA-PLS and ex-LDA-PLS, respectively. The LDA-PLS amends the projection direction of LDA by using the information of PLS, while ex-LDA-PLS is an extension of LDA-PLS by combining the result of LDA-PLS and LDA, making the result closer to the optimal direction by an adjusting parameter. Comparative studies are provided between the proposed methods and other traditional dimension reduction methods such as Principal component analysis (PCA), LDA and PLS-LDA on two data sets. Experimental results show that the proposed method can achieve better classification performance.
Project description:Mahonia bealei (Fort.) Carr. (M. bealei) plays an important role in the treatment of many diseases. In the present study, a comprehensive method combining supercritical fluid chromatography (SFC) fingerprints and chemical pattern recognition (CPR) for quality evaluation of M. bealei was developed. Similarity analysis, hierarchical cluster analysis (HCA), principal component analysis (PCA) were applied to classify and evaluate the samples of wild M. bealei, cultivated M. bealei and its substitutes according to the peak area of 11 components but an accurate classification could not be achieved. PLS-DA was then adopted to select the characteristic variables based on variable importance in projection (VIP) values that responsible for accurate classification. Six characteristics peaks with higher VIP values (?1) were selected for building the CPR model. Based on the six variables, three types of samples were accurately classified into three related clusters. The model was further validated by a testing set samples and predication set samples. The results indicated the model was successfully established and predictive ability was also verified satisfactory. The established model demonstrated that the developed SFC coupled with PLS-DA method showed a great potential application for quality assessment of M. bealei.
Project description:Volatile metabolites are currently under investigation as potential biomarkers for the detection and identification of pathogenic microorganisms, including bacteria, fungi, and viruses. Unlike bacteria and fungi, which produce distinct volatile metabolic signatures associated with innate differences in both primary and secondary metabolic processes, viruses are wholly reliant on the metabolic machinery of infected cells for replication and propagation. In the present study, the ability of volatile metabolites to discriminate between respiratory cells infected and uninfected with virus, in vitro, was investigated. Two important respiratory viruses, namely respiratory syncytial virus (RSV) and influenza A virus (IAV), were evaluated. Data were analyzed using three different machine learning algorithms (random forest (RF), linear support vector machines (linear SVM), and partial least squares-discriminant analysis (PLS-DA)), with volatile metabolites identified from a training set used to predict sample classifications in a validation set. The discriminatory performances of RF, linear SVM, and PLS-DA were comparable for the comparison of IAV-infected versus uninfected cells, with area under the receiver operating characteristic curves (AUROCs) between 0.78 and 0.82, while RF and linear SVM demonstrated superior performance in the classification of RSV-infected versus uninfected cells (AUROCs between 0.80 and 0.84) relative to PLS-DA (0.61). A subset of discriminatory features were assigned putative compound identifications, with an overabundance of hydrocarbons observed in both RSV- and IAV-infected cell cultures relative to uninfected controls. This finding is consistent with increased oxidative stress, a process associated with viral infection of respiratory cells.
Project description:BACKGROUND:Small sample sizes combined with multiple correlated endpoints pose a major challenge in the statistical analysis of preclinical neurotrauma studies. The standard approach of applying univariate tests on individual response variables has the advantage of simplicity of interpretation, but it fails to account for the covariance/correlation in the data. In contrast, multivariate statistical techniques might more adequately capture the multi-dimensional pathophysiological pattern of neurotrauma and therefore provide increased sensitivity to detect treatment effects. RESULTS:We systematically evaluated the performance of univariate ANOVA, Welch's ANOVA and linear mixed effects models versus the multivariate techniques, ANOVA on principal component scores and MANOVA tests by manipulating factors such as sample and effect size, normality and homogeneity of variance in computer simulations. Linear mixed effects models demonstrated the highest power when variance between groups was equal or variance ratio was 1:2. In contrast, Welch's ANOVA outperformed the remaining methods with extreme variance heterogeneity. However, power only reached acceptable levels of 80% in the case of large simulated effect sizes and at least 20 measurements per group or moderate effects with at least 40 replicates per group. In addition, we evaluated the capacity of the ordination techniques, principal component analysis (PCA), redundancy analysis (RDA), linear discriminant analysis (LDA), and partial least squares discriminant analysis (PLS-DA) to capture patterns of treatment effects without formal hypothesis testing. While LDA suffered from a high false positive rate due to multicollinearity, PCA, RDA, and PLS-DA were robust and PLS-DA outperformed PCA and RDA in capturing a true treatment effect pattern. CONCLUSIONS:Multivariate tests do not provide an appreciable increase in power compared to univariate techniques to detect group differences in preclinical studies. However, PLS-DA seems to be a useful ordination technique to explore treatment effect patterns without formal hypothesis testing.
Project description:Due to the existence of Lingzhi adulteration, there is a growing demand for species classification of medicinal mushrooms by various techniques. The objective of this study was to explore a rapid and reliable way to distinguish between different Lingzhi species and compare the influence of data pretreatment methods on the recognition results. To this end, 120 fresh fruiting bodies of Lingzhi were collected, and all of them were analyzed by attenuated total reflection-Fourier transform infrared spectroscopy (ATR-FTIR). Random forest (RF), support vector machine (SVM) and partial least squares discriminant analysis (PLS-DA) classification models were established for raw and pretreated second derivative (SD) spectral matrices to authenticate different Lingzhi species. The results of multivariate statistical analysis indicated that the SD preprocessing method displayed a higher classification ability, which may be attributed to the analysis of powder samples that requires removal of overlapping peaks and baseline shifts. Compared with RF, the results of the SVM and PLS-DA methods were more satisfying, and their accuracies for the test set were both 100%. Among SVM and PLS-DA, the training set and test set accuracy of PLS-DA were both 100%. In conclusion, ATR-FTIR spectroscopy data pretreated by SD combined with PLS-DA is a simple, rapid, non-destructive and relatively inexpensive method to discriminate between mushroom species and provide a good reference to quality assessment.