Robust auto-weighted multi-view subspace clustering with common subspace representation matrix.
ABSTRACT: In many computer vision and machine learning applications, the data sets distribute on certain low-dimensional subspaces. Subspace clustering is a powerful technology to find the underlying subspaces and cluster data points correctly. However, traditional subspace clustering methods can only be applied on data from one source, and how to extend these methods and enable the extensions to combine information from various data sources has become a hot area of research. Previous multi-view subspace methods aim to learn multiple subspace representation matrices simultaneously and these learning task for different views are treated equally. After obtaining representation matrices, they stack up the learned representation matrices as the common underlying subspace structure. However, for many problems, the importance of sources and the importance of features in one source both can be varied, which makes the previous approaches ineffective. In this paper, we propose a novel method called Robust Auto-weighted Multi-view Subspace Clustering (RAMSC). In our method, the weight for both the sources and features can be learned automatically via utilizing a novel trick and introducing a sparse norm. More importantly, the objective of our method is a common representation matrix which directly reflects the common underlying subspace structure. A new efficient algorithm is derived to solve the formulated objective with rigorous theoretical proof on its convergency. Extensive experimental results on five benchmark multi-view datasets well demonstrate that the proposed method consistently outperforms the state-of-the-art methods.
Project description:Integration of distinct biological data types could provide a comprehensive view of biological processes or complex diseases. The combinations of molecules responsible for different phenotypes form multiple embedded (expression) subspaces, thus identifying the intrinsic data structure is challenging by regular integration methods. In this paper, we propose a novel framework of "Multi-view Subspace Clustering Analysis (MSCA)," which could measure the local similarities of samples in the same subspace and obtain the global consensus sample patterns (structures) for multiple data types, thereby comprehensively capturing the underlying heterogeneity of samples. Applied to various synthetic datasets, MSCA performs effectively to recognize the predefined sample patterns, and is robust to data noises. Given a real biological dataset, i.e., Cancer Cell Line Encyclopedia (CCLE) data, MSCA successfully identifies cell clusters of common aberrations across cancer types. A remarkable superiority over the state-of-the-art methods, such as iClusterPlus, SNF, and ANF, has also been demonstrated in our simulation and case studies.
Project description:Discovering the common modules that are co-expressed across various stages can lead to an improved understanding of the underlying molecular mechanisms of cancers. There is a shortage of efficient tools for integrative analysis of gene expression and protein interaction networks for discovering common modules associated with cancer progression. To address this issue, we propose a novel regularized multi-view subspace clustering (rMV-spc) algorithm to obtain a representation matrix for each stage and a joint representation matrix that balances the agreement across various stages. To avoid the heterogeneity of data, the protein interaction network is incorporated into the objective of rMV-spc via regularization. Based on the interior point algorithm, we solve the optimization problem to obtain the common modules. By using artificial networks, we demonstrate that the proposed algorithm outperforms state-of-the-art methods in terms of accuracy. Furthermore, the rMV-spc discovers common modules in breast cancer networks based on the breast data, and these modules serve as biomarkers to predict stages of breast cancer. The proposed model and algorithm effectively integrate heterogeneous data for dynamic modules.
Project description:Hydrodynamic interactions play an important role in the dynamics of macromolecules. The most common way to take into account hydrodynamic effects in molecular simulations is in the context of a brownian dynamics simulation. However, the calculation of correlated brownian noise vectors in these simulations is computationally very demanding and alternative methods are desirable. This paper studies methods based on Krylov subspaces for computing brownian noise vectors. These methods are related to Chebyshev polynomial approximations, but do not require eigenvalue estimates. We show that only low accuracy is required in the brownian noise vectors to accurately compute values of dynamic and static properties of polymer and monodisperse suspension models. With this level of accuracy, the computational time of Krylov subspace methods scales very nearly as O(N(2)) for the number of particles N up to 10 000, which was the limit tested. The performance of the Krylov subspace methods, especially the "block" version, is slightly better than that of the Chebyshev method, even without taking into account the additional cost of eigenvalue estimates required by the latter. Furthermore, at N = 10,000, the Krylov subspace method is 13 times faster than the exact Cholesky method. Thus, Krylov subspace methods are recommended for performing large-scale brownian dynamics simulations with hydrodynamic interactions.
Project description:BACKGROUND:Gene Co-expression Network Analysis (GCNA) helps identify gene modules with potential biological functions and has become a popular method in bioinformatics and biomedical research. However, most current GCNA algorithms use correlation to build gene co-expression networks and identify modules with highly correlated genes. There is a need to look beyond correlation and identify gene modules using other similarity measures for finding novel biologically meaningful modules. RESULTS:We propose a new generalized gene co-expression analysis algorithm via subspace clustering that can identify biologically meaningful gene co-expression modules with genes that are not all highly correlated. We use low-rank representation to construct gene co-expression networks and local maximal quasi-clique merger to identify gene co-expression modules. We applied our method on three large microarray datasets and a single-cell RNA sequencing dataset. We demonstrate that our method can identify gene modules with different biological functions than current GCNA methods and find gene modules with prognostic values. CONCLUSIONS:The presented method takes advantage of subspace clustering to generate gene co-expression networks rather than using correlation as the similarity measure between genes. Our generalized GCNA method can provide new insights from gene expression datasets and serve as a complement to current GCNA algorithms.
Project description:Accurate identification of the cancer types is essential to cancer diagnoses and treatments. Since cancer tissue and normal tissue have different gene expression, gene expression data can be used as an efficient feature source for cancer classification. However, accurate cancer classification directly using original gene expression profiles remains challenging due to the intrinsic high-dimension feature and the small size of the data samples. We proposed a new self-training subspace clustering algorithm under low-rank representation, called SSC-LRR, for cancer classification on gene expression data. Low-rank representation (LRR) is first applied to extract discriminative features from the high-dimensional gene expression data; the self-training subspace clustering (SSC) method is then used to generate the cancer classification predictions. The SSC-LRR was tested on two separate benchmark datasets in control with four state-of-the-art classification methods. It generated cancer classification predictions with an overall accuracy 89.7 percent and a general correlation 0.920, which are 18.9 and 24.4 percent higher than that of the best control method respectively. In addition, several genes (RNF114, HLA-DRB5, USP9Y, and PTPN20) were identified by SSC-LRR as new cancer identifiers that deserve further clinical investigation. Overall, the study demonstrated a new sensitive avenue to recognize cancer classifications from large-scale gene expression data.
Project description:MOTIVATION:Technologies that generate high-throughput omics data are flourishing, creating enormous, publicly available repositories of multi-omics data. As many data repositories continue to grow, there is an urgent need for computational methods that can leverage these data to create comprehensive clusters of patients with a given disease. RESULTS:Our proposed approach creates a patient-to-patient similarity graph for each data type as an intermediate representation of each omics data type and merges the graphs through subspace analysis on a Grassmann manifold. We hypothesize that this approach generates more informative clusters by preserving the complementary information from each level of omics data. We applied our approach to The Cancer Genome Atlas (TCGA) breast cancer dataset and show that by integrating gene expression, microRNA and DNA methylation data, our proposed method can produce clinically useful subtypes of breast cancer. We then investigate the molecular characteristics underlying these subtypes. We discover a highly expressed cluster of genes on chromosome 19p13 that strongly correlates with survival in TCGA breast cancer patients and validate these results in three additional breast cancer datasets. We also compare our approach with previous integrative clustering approaches and obtain comparable or superior results. AVAILABILITY AND IMPLEMENTATION:https://github.com/michaelsharpnack/GrassmannCluster. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:Identifying subspace gene clusters from the gene expression data is useful for discovering novel functional gene interactions. In this paper, we propose to use low-rank representation (LRR) to identify the subspace gene clusters from microarray data. LRR seeks the lowest-rank representation among all the candidates that can represent the genes as linear combinations of the bases in the dataset. The clusters can be extracted based on the block diagonal representation matrix obtained using LRR, and they can well capture the intrinsic patterns of genes with similar functions. Meanwhile, the parameter of LRR can balance the effect of noise so that the method is capable of extracting useful information from the data with high level of background noise. Compared with traditional methods, our approach can identify genes with similar functions yet without similar expression profiles. Also, it could assign one gene into different clusters. Moreover, our method is robust to the noise and can identify more biologically relevant gene clusters. When applied to three public datasets, the results show that the LRR based method is superior to existing methods for identifying subspace gene clusters.
Project description:BACKGROUND: Genomes of lower organisms have been observed with a large amount of horizontal gene transfers, which cause difficulties in their evolutionary study. Bacteriophage genomes are a typical example. One recent approach that addresses this problem is the unsupervised clustering of genomes based on gene order and genome position, which helps to reveal species relationships that may not be apparent from traditional phylogenetic methods. RESULTS: We propose the use of an overlapping subspace clustering algorithm for such genome classification problems. The advantage of subspace clustering over traditional clustering is that it can associate clusters with gene arrangement patterns, preserving genomic information in the clusters produced. Additionally, overlapping capability is desirable for the discovery of multiple conserved patterns within a single genome, such as those acquired from different species via horizontal gene transfers. The proposed method involves a novel strategy to vectorize genomes based on their gene distribution. A number of existing subspace clustering and biclustering algorithms were evaluated to identify the best framework upon which to develop our algorithm; we extended a generic subspace clustering algorithm called HARP to incorporate overlapping capability. The proposed algorithm was assessed and applied on bacteriophage genomes. The phage grouping results are consistent overall with the Phage Proteomic Tree and showed common genomic characteristics among the TP901-like, Sfi21-like and sk1-like phage groups. Among 441 phage genomes, we identified four significantly conserved distribution patterns structured by the terminase, portal, integrase, holin and lysin genes. We also observed a subgroup of Sfi21-like phages comprising a distinctive divergent genome organization and identified nine new phage members to the Sfi21-like genus: Staphylococcus 71, phiPVL108, Listeria A118, 2389, Lactobacillus phi AT3, A2, Clostridium phi3626, Geobacillus GBSV1, and Listeria monocytogenes PSA. CONCLUSION: The method described in this paper can assist evolutionary study through objectively classifying genomes based on their resemblance in gene order, gene content and gene positions. The method is suitable for application to genomes with high genetic exchange and various conserved gene arrangement, as demonstrated through our application on phages.
Project description:The rapid development of single-cell transcriptome sequencing technology has provided us with a cell-level perspective to study biological problems. Identification of cell types is one of the fundamental issues in computational analysis of single-cell data. Due to the large amount of noise from single-cell technologies and high dimension of expression profiles, traditional clustering methods are not so applicable to solve it. To address the problem, we have designed an adaptive sparse subspace clustering method, called AdaptiveSSC, to identify cell types. AdaptiveSSC is based on the assumption that the expression of cells with the same type lies in the same subspace; one cell can be expressed as a linear combination of the other cells. Moreover, it uses a data-driven adaptive sparse constraint to construct the similarity matrix. The comparison results of 10 scRNA-seq datasets show that AdaptiveSSC outperforms original subspace clustering and other state-of-art methods in most cases. Moreover, the learned similarity matrix can also be integrated with a modified t-SNE to obtain an improved visualization result.
Project description:Feature subspace learning plays a significant role in pattern recognition, and many efforts have been made to generate increasingly discriminative learning models. Recently, several discriminative feature learning methods based on a representation model have been proposed, which have not only attracted considerable attention but also achieved success in practical applications. Nevertheless, these methods for constructing the learning model simply depend on the class labels of the training instances and fail to consider the essential subspace structural information hidden in them. In this paper, we propose a robust feature subspace learning approach based on a low-rank representation. In our approach, the low-rank representation coefficients are considered as weights to construct the constraint item for feature learning, which can introduce a subspace structural similarity constraint in the proposed learning model for facilitating data adaptation and robustness. Moreover, by placing the subspace learning and low-rank representation into a unified framework, they can benefit each other during the iteration process to realize an overall optimum. To achieve extra discrimination, linear regression is also incorporated into our model to enforce the projection features around and close to their label-based centers. Furthermore, an iterative numerical scheme is designed to solve our proposed objective function and ensure convergence. Extensive experimental results obtained using several public image datasets demonstrate the advantages and effectiveness of our novel approach compared with those of the existing methods.