Project description:The large p small n problem is a challenge without a de facto standard method available to it. In this study, we propose a tensor-decomposition (TD)-based unsupervised feature extraction (FE) formalism applied to multiomics datasets, in which the number of features is more than 100,000 whereas the number of samples is as small as about 100, hence constituting a typical large p small n problem. The proposed TD-based unsupervised FE outperformed other conventional supervised feature selection methods, random forest, categorical regression (also known as analysis of variance, or ANOVA), penalized linear discriminant analysis, and two unsupervised methods, multiple non-negative matrix factorization and principal component analysis (PCA) based unsupervised FE when applied to synthetic datasets and four methods other than PCA based unsupervised FE when applied to multiomics datasets. The genes selected by TD-based unsupervised FE were enriched in genes known to be related to tissues and transcription factors measured. TD-based unsupervised FE was demonstrated to be not only the superior feature selection method but also the method that can select biologically reliable genes. To our knowledge, this is the first study in which TD-based unsupervised FE has been successfully applied to the integration of this variety of multiomics measurements.
Project description:Analysis of single-cell multiomics datasets is a novel topic and is considerably challenging because such datasets contain a large number of features with numerous missing values. In this study, we implemented a recently proposed tensor-decomposition (TD)-based unsupervised feature extraction (FE) technique to address this difficult problem. The technique can successfully integrate single-cell multiomics data composed of gene expression, DNA methylation, and accessibility. Although the last two have large dimensions, as many as ten million, containing only a few percentage of nonzero values, TD-based unsupervised FE can integrate three omics datasets without filling in missing values. Together with UMAP, which is used frequently when embedding single-cell measurements into two-dimensional space, TD-based unsupervised FE can produce two-dimensional embedding coincident with classification when integrating single-cell omics datasets. Genes selected based on TD-based unsupervised FE are also significantly related to reasonable biological roles.
Project description:Although single-cell RNA sequencing (scRNA-seq) technology is newly invented and a promising one, but because of lack of enough information that labels individual cells, it is hard to interpret the obtained gene expression of each cell. Because of insufficient information available, unsupervised clustering, for example, t-distributed stochastic neighbor embedding and uniform manifold approximation and projection, is usually employed to obtain low-dimensional embedding that can help to understand cell-cell relationship. One possible drawback of this strategy is that the outcome is highly dependent upon genes selected for the usage of clustering. In order to fulfill this requirement, there are many methods that performed unsupervised gene selection. In this study, a tensor decomposition (TD)-based unsupervised feature extraction (FE) was applied to the integration of two scRNA-seq expression profiles that measure human and mouse midbrain development. TD-based unsupervised FE could select not only coincident genes between human and mouse but also biologically reliable genes. Coincidence between two species as well as biological reliability of selected genes is increased compared with that using principal component analysis (PCA)-based FE applied to the same data set in the previous study. Since PCA-based unsupervised FE outperformed the other three popular unsupervised gene selection methods, highly variable genes, bimodal genes, and dpFeature, TD-based unsupervised FE can do so as well. In addition to this, 10 transcription factors (TFs) that might regulate selected genes and might contribute to midbrain development were identified. These 10 TFs, BHLHE40, EGR1, GABPA, IRF3, PPARG, REST, RFX5, STAT3, TCF7L2, and ZBTB33, were previously reported to be related to brain functions and diseases. TD-based unsupervised FE is a promising method to integrate two scRNA-seq profiles effectively.
Project description:In the current era of big data, the amount of data available is continuously increasing. Both the number and types of samples, or features, are on the rise. The mixing of distinct features often makes interpretation more difficult. However, separate analysis of individual types requires subsequent integration. A tensor is a useful framework to deal with distinct types of features in an integrated manner without mixing them. On the other hand, tensor data is not easy to obtain since it requires the measurements of huge numbers of combinations of distinct features; if there are m kinds of features, each of which has N dimensions, the number of measurements needed are as many as Nm, which is often too large to measure. In this paper, I propose a new method where a tensor is generated from individual features without combinatorial measurements, and the generated tensor was decomposed back to matrices, by which unsupervised feature extraction was performed. In order to demonstrate the usefulness of the proposed strategy, it was applied to synthetic data, as well as three omics datasets. It outperformed other matrix-based methodologies.
Project description:BACKGROUND:Although post-traumatic stress disorder (PTSD) is primarily a mental disorder, it can cause additional symptoms that do not seem to be directly related to the central nervous system, which PTSD is assumed to directly affect. PTSD-mediated heart diseases are some of such secondary disorders. In spite of the significant correlations between PTSD and heart diseases, spatial separation between the heart and brain (where PTSD is primarily active) prevents researchers from elucidating the mechanisms that bridge the two disorders. Our purpose was to identify genes linking PTSD and heart diseases. METHODS:In this study, gene expression profiles of various murine tissues observed under various types of stress or without stress were analyzed in an integrated manner using tensor decomposition (TD). RESULTS:Based upon the obtained features, ∼ 400 genes were identified as candidate genes that may mediate heart diseases associated with PTSD. Various gene enrichment analyses supported biological reliability of the identified genes. Ten genes encoding protein-, DNA-, or mRNA-interacting proteins-ILF2, ILF3, ESR1, ESR2, RAD21, HTT, ATF2, NR3C1, TP53, and TP63-were found to be likely to regulate expression of most of these ∼ 400 genes and therefore are candidate primary genes that cause PTSD-mediated heart diseases. Approximately 400 genes in the heart were also found to be strongly affected by various drugs whose known adverse effects are related to heart diseases and/or fear memory conditioning; these data support the reliability of our findings. CONCLUSIONS:TD-based unsupervised feature extraction turned out to be a useful method for gene selection and successfully identified possible genes causing PTSD-mediated heart diseases.
Project description:Tensor decomposition- and principal component analysis-based unsupervised feature extraction were proposed almost 5 and 10 years ago, respectively; although these methods have been successfully applied to a wide range of genome analyses, including drug repositioning, biomarker identification, and disease-causing genes' identification, some fundamental problems have been identified: the number of genes identified was too small to assume that there were no false negatives, and the histogram of P values derived was not fully coincident with the null hypothesis that principal component and singular value vectors follow the Gaussian distribution. Optimizing the standard deviation such that the histogram of P values is as much as possible coincident with the null hypothesis results in an increase in the number and biological reliability of the selected genes. Our contribution was that we improved these methods so as to be able to select biologically more reasonable differentially expressed genes than the state of art methods that must empirically assume negative binomial distributions and dispersion relation, which is required for the selecting more expressed genes than less expressed ones, which can be achieved by the proposed methods that do not have to assume these.
Project description:Gene expression profiles of tissues treated with drugs have recently been used to infer clinical outcomes. Although this method is often successful from the application point of view, gene expression altered by drugs is rarely analyzed in detail, because of the extremely large number of genes involved. Here, we applied tensor decomposition (TD)-based unsupervised feature extraction (FE) to the gene expression profiles of 24 mouse tissues treated with 15 drugs. TD-based unsupervised FE enabled identification of the common effects of 15 drugs including an interesting universal feature: these drugs affect genes in a gene-group-wide manner and were dependent on three tissue types (neuronal, muscular, and gastroenterological). For each tissue group, TD-based unsupervised FE enabled identification of a few tens to a few hundreds of genes affected by the drug treatment. These genes are distinctly expressed between drug treatments and controls as well as between tissues in individual tissue groups and other tissues. We also validated the assignment of genes to individual tissue groups using multiple enrichment analyses. We conclude that TD-based unsupervised FE is a promising method for integrated analysis of gene expression profiles from multiple tissues treated with multiple drugs in a completely unsupervised manner.
Project description:Although hypoxia is a critical factor that can drive the progression of various diseases, the mechanism underlying hypoxia itself remains unclear. Recently, m6A has been proposed as an important factor driving hypoxia. Despite successful analyses, potential genes were not selected with statistical significance but were selected based solely on fold changes. Because the number of genes is large while the number of samples is small, it was impossible to select genes using conventional feature selection methods with statistical significance. In this study, we applied the recently proposed principal component analysis (PCA), tensor decomposition (TD), and kernel tensor decomposition (KTD)-based unsupervised feature extraction (FE) to a hypoxia data set. We found that PCA, TD, and KTD-based unsupervised FE could successfully identify a limited number of genes associated with altered gene expression and m6A profiles, as well as the enrichment of hypoxia-related biological terms, with improved statistical significance.
Project description:BACKGROUND:Although in silico drug discovery is necessary for drug development, two major strategies, a structure-based and ligand-based approach, have not been completely successful. Currently, the third approach, inference of drug candidates from gene expression profiles obtained from the cells treated with the compounds under study requires the use of a training dataset. Here, the purpose was to develop a new approach that does not require any pre-existing knowledge about the drug-protein interactions, but these interactions can be inferred by means of an integrated approach using gene expression profiles obtained from the cells treated with the analysed compounds and the existing data describing gene-gene interactions. RESULTS:In the present study, using tensor decomposition-based unsupervised feature extraction, which represents an extension of the recently proposed principal-component analysis-based feature extraction, gene sets and compounds with a significant dose-dependent activity were screened without any training datasets. Next, after these results were combined with the data showing perturbations in single-gene expression profiles, genes targeted by the analysed compounds were inferred. The set of target genes thus identified was shown to significantly overlap with known target genes of the compounds under study. CONCLUSIONS:The method is specifically designed for large-scale datasets (including hundreds of treatments with compounds), not for conventional small-scale datasets. The obtained results indicate that two compounds that have not been extensively studied, WZ-3105 and CGP-60474, represent promising drug candidates targeting multiple cancers, including melanoma, adenocarcinoma, liver carcinoma, and breast, colon, and prostate cancers, which were analysed in this in silico study.
Project description:BackgroundCOVID-19 is a critical pandemic that has affected human communities worldwide, and there is an urgent need to develop effective drugs. Although there are a large number of candidate drug compounds that may be useful for treating COVID-19, the evaluation of these drugs is time-consuming and costly. Thus, screening to identify potentially effective drugs prior to experimental validation is necessary.MethodIn this study, we applied the recently proposed method tensor decomposition (TD)-based unsupervised feature extraction (FE) to gene expression profiles of multiple lung cancer cell lines infected with severe acute respiratory syndrome coronavirus 2. We identified drug candidate compounds that significantly altered the expression of the 163 genes selected by TD-based unsupervised FE.ResultsNumerous drugs were successfully screened, including many known antiviral drug compounds such as C646, chelerythrine chloride, canertinib, BX-795, sorafenib, sorafenib, QL-X-138, radicicol, A-443654, CGP-60474, alvocidib, mitoxantrone, QL-XII-47, geldanamycin, fluticasone, atorvastatin, quercetin, motexafin gadolinium, trovafloxacin, doxycycline, meloxicam, gentamicin, and dibromochloromethane. The screen also identified ivermectin, which was first identified as an anti-parasite drug and recently the drug was included in clinical trials for SARS-CoV-2.ConclusionsThe drugs screened using our strategy may be effective candidates for treating patients with COVID-19.