Principal component analysis-based unsupervised feature extraction applied to in silico drug discovery for posttraumatic stress disorder-mediated heart disease.
ABSTRACT: Feature extraction (FE) is difficult, particularly if there are more features than samples, as small sample numbers often result in biased outcomes or overfitting. Furthermore, multiple sample classes often complicate FE because evaluating performance, which is usual in supervised FE, is generally harder than the two-class problem. Developing sample classification independent unsupervised methods would solve many of these problems.Two principal component analysis (PCA)-based FE, specifically, variational Bayes PCA (VBPCA) was extended to perform unsupervised FE, and together with conventional PCA (CPCA)-based unsupervised FE, were tested as sample classification independent unsupervised FE methods. VBPCA- and CPCA-based unsupervised FE both performed well when applied to simulated data, and a posttraumatic stress disorder (PTSD)-mediated heart disease data set that had multiple categorical class observations in mRNA/microRNA expression of stressed mouse heart. A critical set of PTSD miRNAs/mRNAs were identified that show aberrant expression between treatment and control samples, and significant, negative correlation with one another. Moreover, greater stability and biological feasibility than conventional supervised FE was also demonstrated. Based on the results obtained, in silico drug discovery was performed as translational validation of the methods.Our two proposed unsupervised FE methods (CPCA- and VBPCA-based) worked well on simulated data, and outperformed two conventional supervised FE methods on a real data set. Thus, these two methods have suggested equivalence for FE on categorical multiclass data sets, with potential translational utility for in silico drug discovery.
Project description:The recently proposed principal component analysis (PCA) based unsupervised feature extraction (FE) has successfully been applied to various bioinformatics problems ranging from biomarker identification to the screening of disease causing genes using gene expression/epigenetic profiles. However, the conditions required for its successful use and the mechanisms involved in how it outperforms other supervised methods is unknown, because PCA based unsupervised FE has only been applied to challenging (i.e. not well known) problems.In this study, PCA based unsupervised FE was applied to an extensively studied organism, i.e., budding yeast. When applied to two gene expression profiles expected to be temporally periodic, yeast metabolic cycle (YMC) and yeast cell division cycle (YCDC), PCA based unsupervised FE outperformed simple but powerful conventional methods, with sinusoidal fitting with regards to several aspects: (i) feasible biological term enrichment without assuming periodicity for YMC; (ii) identification of periodic profiles whose period was half as long as the cell division cycle for YMC; and (iii) the identification of no more than 37 genes associated with the enrichment of biological terms related to cell division cycle for the integrated analysis of seven YCDC profiles, for which sinusoidal fittings failed. The explantation for differences between methods used and the necessary conditions required were determined by comparing PCA based unsupervised FE with fittings to various periodic (artificial, thus pre-defined) profiles. Furthermore, four popular unsupervised clustering algorithms applied to YMC were not as successful as PCA based unsupervised FE.PCA based unsupervised FE is a useful and effective unsupervised method to investigate YMC and YCDC. This study identified why the unsupervised method without pre-judged criteria outperformed supervised methods requiring human defined criteria.
Project description:In order to have a better understanding of unexplained heritability for complex diseases in conventional Genome-Wide Association Studies (GWAS), aggregated association analyses based on predefined functional regions, such as genes and pathways, become popular recently as they enable evaluating joint effect of multiple Single-Nucleotide Polymorphisms (SNPs), which helps increase the detection power, especially when investigating genetic variants with weak individual effects. In this paper, we focus on aggregated analysis methods based on the idea of Principal Component Analysis (PCA). The past approaches using PCA mostly make some inherent genotype data and/or risk effect model assumptions, which may hinder the accurate detection of potential disease SNPs that influence disease phenotypes. In this paper, we derive a general Supervised Categorical Principal Component Analysis (SCPCA), which explicitly models categorical SNP data without imposing any risk effect model assumption. We have evaluated the efficacy of SCPCA with the comparison to a traditional Supervised PCA (SPCA) and a previously developed Supervised Logistic Principal Component Analysis (SLPCA) based on both the simulated genotype data by HAPGEN2 and the genotype data of Crohn's Disease (CD) from Wellcome Trust Case Control Consortium (WTCCC). Our preliminary results have demonstrated the superiority of SCPCA over both SPCA and SLPCA due to its modeling explicitly designed for categorical SNP data as well as its flexibility on the risk effect model assumption.
Project description:Methods of multiblock bilinear factorizations have increased in popularity in chemistry and biology as recent increases in the availability of information-rich spectroscopic platforms has made collecting multiple spectroscopic observations per sample a practicable possibility. Of the existing multiblock methods, consensus PCA (CPCA-W) and multiblock PLS (MB-PLS) have been shown to bear desirable qualities for multivariate modeling, most notably their computability from single-block PCA and PLS factorizations. While MB-PLS is a powerful extension to the nonlinear iterative partial least squares (NIPALS) framework, it still spreads predictive information across multiple components when response-uncorrelated variation exists in the data. The OnPLS extension to O2PLS provides a means of simultaneously extracting predictive and uncorrelated variation from a set of matrices, but is more suited to unsupervised data discovery than regression. We describe the union of NIPALS MB-PLS with an orthogonal signal correction (OSC) filter, called MB-OPLS, and illustrate its equivalence to single-block OPLS for regression and discriminant analysis.
Project description:BACKGROUND:Extreme heat poses current and future risks to human health. Heat vulnerability indices (HVIs), commonly developed using principal components analysis (PCA), are mapped to identify populations vulnerable to extreme heat. Few studies critically assess implications of analytic choices made when employing this methodology for fine-scale vulnerability mapping. OBJECTIVE:We investigated sensitivity of HVIs created by applying PCA to input variables and whether training input variables on heat-health data produced HVIs with similar spatial vulnerability patterns for Detroit, Michigan, USA. METHODS:We acquired 2010 Census tract and block group level data, land cover data, daily ambient apparent temperature, and all-cause mortality during May-September, 2000-2009. We used PCA to construct HVIs using: a) "unsupervised"-PCA applied to variables selected a priori as risk factors for heat-related health outcomes; b) "supervised"-PCA applied only to variables significantly correlated with proportion of all-cause mortality occurring on extreme heat days (i.e., days with 2-d mean apparent temperature above month-specific 95th percentiles). RESULTS:Unsupervised and supervised HVIs yielded differing spatial vulnerability patterns, depending on selected land cover input variables. Supervised PCA explained 62% of variance in the input variables and was applied on half the variables used in the unsupervised method. Census tract-level supervised HVI values were positively associated with increased proportion of mortality occurring on extreme heat days; supervised PCA could not be applied to block group data. Unsupervised HVI values were not associated with extreme heat mortality for either tracts or block groups. DISCUSSION:HVIs calculated using PCA are sensitive to input data and scale. Supervised HVIs may provide marginally more specific indicators of heat vulnerability than unsupervised HVIs. PCA-derived HVIs address correlation among vulnerability indicators, although the resulting output requires careful contextual interpretation beyond generating epidemiological research questions. Methods with reliably stable outputs should be leveraged for prioritizing heat interventions. https://doi.org/10.1289/EHP4030.
Project description:We present results of our machine learning approach to the problem of classifying GC-MS data originating from wheat grains of different farming systems. The aim is to investigate the potential of learning algorithms to classify GC-MS data to be either from conventionally grown or from organically grown samples and considering different cultivars. The motivation of our work is rather obvious nowadays: increased demand for organic food in post-industrialized societies and the necessity to prove organic food authenticity. The background of our data set is given by up to 11 wheat cultivars that have been cultivated in both farming systems, organic and conventional, throughout 3?years. More than 300 GC-MS measurements were recorded and subsequently processed and analyzed in the MeltDB 2.0 metabolomics analysis platform, being briefly outlined in this paper. We further describe how unsupervised (t-SNE, PCA) and supervised (SVM) methods can be applied for sample visualization and classification. Our results clearly show that years have most and wheat cultivars have second-most influence on the metabolic composition of a sample. We can also show that for a given year and cultivar, organic and conventional cultivation can be distinguished by machine-learning algorithms.
Project description:Visualization and exploration of high-dimensional data is a ubiquitous challenge across disciplines. Widely used techniques such as principal component analysis (PCA) aim to identify dominant trends in one dataset. However, in many settings we have datasets collected under different conditions, e.g., a treatment and a control experiment, and we are interested in visualizing and exploring patterns that are specific to one dataset. This paper proposes a method, contrastive principal component analysis (cPCA), which identifies low-dimensional structures that are enriched in a dataset relative to comparison data. In a wide variety of experiments, we demonstrate that cPCA with a background dataset enables us to visualize dataset-specific patterns missed by PCA and other standard methods. We further provide a geometric interpretation of cPCA and strong mathematical guarantees. An implementation of cPCA is publicly available, and can be used for exploratory data analysis in many applications where PCA is currently used.
Project description:Although single-cell RNA sequencing (scRNA-seq) technology is newly invented and a promising one, but because of lack of enough information that labels individual cells, it is hard to interpret the obtained gene expression of each cell. Because of insufficient information available, unsupervised clustering, for example, t-distributed stochastic neighbor embedding and uniform manifold approximation and projection, is usually employed to obtain low-dimensional embedding that can help to understand cell-cell relationship. One possible drawback of this strategy is that the outcome is highly dependent upon genes selected for the usage of clustering. In order to fulfill this requirement, there are many methods that performed unsupervised gene selection. In this study, a tensor decomposition (TD)-based unsupervised feature extraction (FE) was applied to the integration of two scRNA-seq expression profiles that measure human and mouse midbrain development. TD-based unsupervised FE could select not only coincident genes between human and mouse but also biologically reliable genes. Coincidence between two species as well as biological reliability of selected genes is increased compared with that using principal component analysis (PCA)-based FE applied to the same data set in the previous study. Since PCA-based unsupervised FE outperformed the other three popular unsupervised gene selection methods, highly variable genes, bimodal genes, and dpFeature, TD-based unsupervised FE can do so as well. In addition to this, 10 transcription factors (TFs) that might regulate selected genes and might contribute to midbrain development were identified. These 10 TFs, BHLHE40, EGR1, GABPA, IRF3, PPARG, REST, RFX5, STAT3, TCF7L2, and ZBTB33, were previously reported to be related to brain functions and diseases. TD-based unsupervised FE is a promising method to integrate two scRNA-seq profiles effectively.
Project description:Essential Dynamics (ED) is a common application of principal component analysis (PCA) to extract biologically relevant motions from atomic trajectories of proteins. Covariance and correlation based PCA are two common approaches to determine PCA modes (eigenvectors) and their eigenvalues. Protein dynamics can be characterized in terms of Cartesian coordinates or internal distance pairs. In understanding protein dynamics, a comparison of trajectories taken from a set of proteins for similarity assessment provides insight into conserved mechanisms. Comprehensive software is needed to facilitate comparative-analysis with user-friendly features that are rooted in best practices from multivariate statistics.We developed a Java based Essential Dynamics toolkit called JED to compare the ED from multiple protein trajectories. Trajectories from different simulations and different proteins can be pooled for comparative studies. JED implements Cartesian-based coordinates (cPCA) and internal distance pair coordinates (dpPCA) as options to construct covariance (Q) or correlation (R) matrices. Statistical methods are implemented for treating outliers, benchmarking sampling adequacy, characterizing the precision of Q and R, and reporting partial correlations. JED output results as text files that include transformed coordinates for aligned structures, several metrics that quantify protein mobility, PCA modes with their eigenvalues, and displacement vector (DV) projections onto the top principal modes. Pymol scripts together with PDB files allow movies of individual Q- and R-cPCA modes to be visualized, and the essential dynamics occurring within user-selected time scales. Subspaces defined by the top eigenvectors are compared using several statistical metrics to quantify similarity/overlap of high dimensional vector spaces. Free energy landscapes can be generated for both cPCA and dpPCA.JED offers a convenient toolkit that encourages best practices in applying multivariate statistics methods to perform comparative studies of essential dynamics over multiple proteins. For each protein, Cartesian coordinates or internal distance pairs can be employed over the entire structure or user-selected parts to quantify similarity/differences in mobility and correlations in dynamics to develop insight into protein structure/function relationships.
Project description:Over the past few decades, the rise of multiple chronic conditions has become a major concern for clinicians. However, it is still not known precisely how multiple chronic conditions emerge among patients. We propose an unsupervised multi-level temporal Bayesian network to provide a compact representation of the relationship among emergence of multiple chronic conditions and patient level risk factors over time. To improve the efficiency of the learning process, we use an extension of maximum weight spanning tree algorithm and greedy search algorithm to study the structure of the proposed network in three stages, starting with learning the inter-relationship of comorbidities within each year, followed by learning the intra-relationship of comorbidity emergence between consecutive years, and finally learning the hierarchical relationship of comorbidities and patient level risk factors. We also use a longest path algorithm to identify the most likely sequence of comorbidities emerging from and/or leading to specific chronic conditions. Using a de-identified dataset of more than 250,000 patients receiving care from the U.S. Department of Veterans Affairs for a period of five years, we compare the performance of the proposed unsupervised Bayesian network in comparison with those of Bayesian networks developed based on supervised and semi-supervised learning approaches, as well as multivariate probit regression, multinomial logistic regression, and latent regression Markov mixture clustering focusing on traumatic brain injury (TBI), post-traumatic stress disorder (PTSD), depression (Depr), substance abuse (SuAb), and back pain (BaPa). Our findings show that the unsupervised approach has noticeably accurate predictive performance that is comparable to the best performing semi-supervised and the second-best performing supervised approaches. These findings also revealed that the unsupervised approach has improved performance over multivariate probit regression, multinomial logistic regression, and latent regression Markov mixture clustering.
Project description:Purpose:The aim of this study was to compare the effects of supervised combined physical training and unsupervised physician-prescribed regular exercise on the functional capacity and quality of life of heart failure patients. Methods:This is a longitudinal prospective study composed of 28 consecutive heart failure with reduced ejection fraction patients randomly divided into two age- and gender-matched groups: trained group (n?=?17) and nontrained group (n?=?11). All patients were submitted to clinical evaluation, transthoracic echocardiography, the Cooper walk test, and a Quality of Life questionnaire before and after a 12-week study protocol. Categorical variables were expressed as proportions and compared with the chi-square test. Two-way ANOVA was performed to compare the continuous variables considering the cofactor groups and time of intervention, and Pearson correlation tests were conducted for the associations in the same group. Results:No significant differences between groups were found at baseline. At the end of the protocol, there were improvements in the functional capacity and ejection fraction of the trained group in relation to the nontrained group (p < 0.05). There was time and group interaction for improvement in the quality of life in the trained group. Conclusions:In patients with heart failure with reduced ejection fraction, supervised combined physical training improved exercise tolerance and quality of life compared with the unsupervised regular exercise prescribed in routine medical consultations. Left ventricular systolic function was improved with supervised physical training.