Independent Component Analysis (ICA) based-clustering of temporal RNA-seq data.
ABSTRACT: Gene expression time series (GETS) analysis aims to characterize sets of genes according to their longitudinal patterns of expression. Due to the large number of genes evaluated in GETS analysis, an useful strategy to summarize biological functional processes and regulatory mechanisms is through clustering of genes that present similar expression pattern over time. Traditional cluster methods usually ignore the challenges in GETS, such as the lack of data normality and small number of temporal observations. Independent Component Analysis (ICA) is a statistical procedure that uses a transformation to convert raw time series data into sets of values of independent variables, which can be used for cluster analysis to identify sets of genes with similar temporal expression patterns. ICA allows clustering small series of distribution-free data while accounting for the dependence between subsequent time-points. Using temporal simulated and real (four libraries of two pig breeds at 21, 40, 70 and 90 days of gestation) RNA-seq data set we present a methodology (ICAclust) that jointly considers independent components analysis (ICA) and a hierarchical method for clustering GETS. We compare ICAclust results with those obtained for K-means clustering. ICAclust presented, on average, an absolute gain of 5.15% over the best K-means scenario. Considering the worst scenario for K-means, the gain was of 84.85%, when compared with the best ICAclust result. For the real data set, genes were grouped into six distinct clusters with 89, 51, 153, 67, 40, and 58 genes each, respectively. In general, it can be observed that the 6 clusters presented very distinct expression patterns. Overall, the proposed two-step clustering method (ICAclust) performed well compared to K-means, a traditional method used for cluster analysis of temporal gene expression data. In ICAclust, genes with similar expression pattern over time were clustered together.
Project description:Cluster analyses are used to analyze microarray time-course data for gene discovery and pattern recognition. However, in general, these methods do not take advantage of the fact that time is a continuous variable, and existing clustering methods often group biologically unrelated genes together.We propose a quadratic regression method for identification of differentially expressed genes and classification of genes based on their temporal expression profiles for non-cyclic short time-course microarray data. This method treats time as a continuous variable, therefore preserves actual time information. We applied this method to a microarray time-course study of gene expression at short time intervals following deafferentation of olfactory receptor neurons. Nine regression patterns have been identified and shown to fit gene expression profiles better than k-means clusters. EASE analysis identified over-represented functional groups in each regression pattern and each k-means cluster, which further demonstrated that the regression method provided more biologically meaningful classifications of gene expression profiles than the k-means clustering method. Comparison with Peddada et al.'s order-restricted inference method showed that our method provides a different perspective on the temporal gene profiles. Reliability study indicates that regression patterns have the highest reliabilities.Our results demonstrate that the proposed quadratic regression method improves gene discovery and pattern recognition for non-cyclic short time-course microarray data. With a freely accessible Excel macro, investigators can readily apply this method to their microarray data.
Project description:MOTIVATION:Clustering algorithms like K-Means and standard Gaussian mixture models (GMM) fail to account for the structure of variability of replicated data or repeated measures over time. Additionally, a priori cluster number assumptions add an additional complexity to the process. Current methods to optimize cluster labels and number can be inaccurate or computationally intensive for temporal gene expression data with this additional variability. RESULTS:An extension to a model-based clustering algorithm is proposed using mixtures of mixed effects polynomial regression models and the EM algorithm with an entropy penalized log-likelihood function (EPEM). The EPEM is used to cluster temporal gene expression data with this additional variability. The addition of random effects in our model decreased the misclassification error when compared to mixtures of fixed effects models or other methods such as K-Means and GMM. Applying our method to microarray data from a fracture healing study revealed distinct temporal patterns of gene expression. AVAILABILITY AND IMPLEMENTATION:https://github.com/darlenelu72/EPEM-GMM. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:High throughput gene expression analysis using qPCR is commonly used to identify molecular markers of complex cellular processes. However, statistical analysis of multi-dimensional, temporal gene expression data is complicated by limited biological replicates and large number of measurements. Moreover, many available statistical tools for analysis of time series data assume that the data sequence is static and does not evolve over time. With this assumption, the parameters used to model the time series are fixed and thus, can be estimated by pooling data together. However, in many cases, dynamic processes of biological systems involve abrupt changes at unknown time points, making the assumption of stationary time series break down. We addressed this problem using a combination of statistical methods including hierarchical clustering, change point detection, and multiple testing. We applied this multi-step method to multi-dimensional, temporal gene expression data that resulted from our study of colony size-dependent neural cell differentiation of stem cells. The gene expression data were time series as the observations were recorded sequentially over time. Hierarchical clustering segregated the genes into three distinct clusters based on their temporal expression profiles; change point detection identified specific time points at which the entire dataset was divided into several homogenous subsets to allow a separate analysis of each subset; and multiple testing procedure identified the differentially expressed genes in each cluster within each subset of data. We established that our multi-step approach pinpoints specific sets of genes that underlie colony size-mediated neural differentiation of stem cells and demonstrated its advantages over conventional parametric and non-parametric tests that do not take into account temporal dynamics of the data. Importantly, our proposed approach is broadly applicable to any multivariate data sets of limited sample size from high throughput and high content screening such as in drug and biomarker discovery studies.
Project description:Taxonomic markers such as the 16S ribosomal RNA gene are widely used in microbial community analysis. A common first step in marker-gene analysis is grouping genes into clusters to reduce data sets to a more manageable size and potentially mitigate the effects of sequencing error. Instead of clustering based on sequence identity, marker-gene data sets collected over time can be clustered based on temporal correlation to reveal ecologically meaningful associations. We present Ananke, a free and open-source algorithm and software package that complements existing sequence-identity-based clustering approaches by clustering marker-gene data based on time-series profiles and provides interactive visualization of clusters, including highlighting of internal OTU inconsistencies. Ananke is able to cluster distinct temporal patterns from simulations of multiple ecological patterns, such as periodic seasonal dynamics and organism appearances/disappearances. We apply our algorithm to two longitudinal marker gene data sets: faecal communities from the human gut of an individual sampled over one year, and communities from a freshwater lake sampled over eleven years. Within the gut, the segregation of the bacterial community around a food-poisoning event was immediately clear. In the freshwater lake, we found that high sequence identity between marker genes does not guarantee similar temporal dynamics, and Ananke time-series clusters revealed patterns obscured by clustering based on sequence identity or taxonomy. Ananke is free and open-source software available at https://github.com/beiko-lab/ananke.
Project description:<h4>Background</h4>Clustering is a widely used technique for analysis of gene expression data. Most clustering methods group genes based on the distances, while few methods group genes according to the similarities of the distributions of the gene expression levels. Furthermore, as the biological annotation resources accumulated, an increasing number of genes have been annotated into functional categories. As a result, evaluating the performance of clustering methods in terms of the functional consistency of the resulting clusters is of great interest.<h4>Results</h4>In this paper, we proposed the WDCM (Weibull Distribution-based Clustering Method), a robust approach for clustering gene expression data, in which the gene expressions of individual genes are considered as the random variables following unique Weibull distributions. Our WDCM is based on the concept that the genes with similar expression profiles have similar distribution parameters, and thus the genes are clustered via the Weibull distribution parameters. We used the WDCM to cluster three cancer gene expression data sets from the lung cancer, B-cell follicular lymphoma and bladder carcinoma and obtained well-clustered results. We compared the performance of WDCM with k-means and Self Organizing Map (SOM) using functional annotation information given by the Gene Ontology (GO). The results showed that the functional annotation ratios of WDCM are higher than those of the other methods. We also utilized the external measure Adjusted Rand Index to validate the performance of the WDCM. The comparative results demonstrate that the WDCM provides the better clustering performance compared to k-means and SOM algorithms. The merit of the proposed WDCM is that it can be applied to cluster incomplete gene expression data without imputing the missing values. Moreover, the robustness of WDCM is also evaluated on the incomplete data sets.<h4>Conclusions</h4>The results demonstrate that our WDCM produces clusters with more consistent functional annotations than the other methods. The WDCM is also verified to be robust and is capable of clustering gene expression data containing a small quantity of missing values.
Project description:BACKGROUND: The notion that genes are non-randomly organized within the chromosomes of eukaryotic organisms has recently received strong experimental support. Clusters of co-expressed and co-localized genes have been recognized as playing key roles in a number of functional pathways and adaptive responses including organism development, differentiation, disease states and aging. The identification of genes arranged in close proximity with each other within a particular temporal and spatial transcriptional program is anticipated to unravel possible functional links and reciprocal interactions. RESULTS: We developed a novel software tool Gepoclu (Gene Positional Clustering) that automatically selects genes based on expression values from multiple sources, including microarray, EST and qRT-PCR, and performs positional clustering. Gepoclu provides expression-based gene selection from multiple experimental sources, position-based gene clustering and cluster visualization functionalities, all as parts of the same fully integrated, and interactive, package. This means rapid iterations while exploring for emergent behavior, and full programmability of the filtering and clustering steps. CONCLUSIONS: Gepoclu is a useful data-mining tool for exploring relationships among transcriptional data deriving form different sources. It provides an easy interactive environment for analyzing positional clustering behavior of co-expressed genes, and at the same time it is fully programmable, so that it can be customized and extended to support specific analysis needs.
Project description:Oat ranks sixth in world cereal production and has a higher content of health-promoting compounds compared with other cereals. However, there is neither a robust oat reference genome nor transcriptome. Using deeply sequenced full-length mRNA libraries of oat cultivar Ogle-C, a de novo high-quality and comprehensive oat seed transcriptome was assembled. With this reference transcriptome and QuantSeq 3' mRNA sequencing, gene expression was quantified during seed development from 22 diverse lines across six time points. Transcript expression showed higher correlations between adjacent time points. Based on differentially expressed genes, we identified 22 major temporal co-expression (TCoE) patterns of gene expression and revealed enriched gene ontology biological processes. Within each TCoE set, highly correlated transcripts, putatively commonly affected by genetic background, were clustered and termed genetic co-expression (GCoE) sets. Seventeen of the 22 TCoE sets had GCoE sets with median heritabilities higher than 0.50, and these heritability estimates were much higher than that estimated from permutation analysis, with no divergence observed in cluster sizes between permutation and non-permutation analyses. Linear regression between 634 metabolites from mature seeds and the PC1 score of each of the GCoE sets showed significantly lower p-values than permutation analysis. Temporal expression patterns of oat avenanthramides and lipid biosynthetic genes were concordant with previous studies of avenanthramide biosynthetic enzyme activity and lipid accumulation. This study expands our understanding of physiological processes that occur during oat seed maturation and provides plant breeders the means to change oat seed composition through targeted manipulation of key pathways.
Project description:BACKGROUND:Cluster analysis, and in particular hierarchical clustering, is widely used to extract information from gene expression data. The aim is to discover new classes, or sub-classes, of either individuals or genes. Performing a cluster analysis commonly involve decisions on how to; handle missing values, standardize the data and select genes. In addition, pre-processing, involving various types of filtration and normalization procedures, can have an effect on the ability to discover biologically relevant classes. Here we consider cluster analysis in a broad sense and perform a comprehensive evaluation that covers several aspects of cluster analyses, including normalization. RESULT:We evaluated 2780 cluster analysis methods on seven publicly available 2-channel microarray data sets with common reference designs. Each cluster analysis method differed in data normalization (5 normalizations were considered), missing value imputation (2), standardization of data (2), gene selection (19) or clustering method (11). The cluster analyses are evaluated using known classes, such as cancer types, and the adjusted Rand index. The performances of the different analyses vary between the data sets and it is difficult to give general recommendations. However, normalization, gene selection and clustering method are all variables that have a significant impact on the performance. In particular, gene selection is important and it is generally necessary to include a relatively large number of genes in order to get good performance. Selecting genes with high standard deviation or using principal component analysis are shown to be the preferred gene selection methods. Hierarchical clustering using Ward's method, k-means clustering and Mclust are the clustering methods considered in this paper that achieves the highest adjusted Rand. Normalization can have a significant positive impact on the ability to cluster individuals, and there are indications that background correction is preferable, in particular if the gene selection is successful. However, this is an area that needs to be studied further in order to draw any general conclusions. CONCLUSIONS:The choice of cluster analysis, and in particular gene selection, has a large impact on the ability to cluster individuals correctly based on expression profiles. Normalization has a positive effect, but the relative performance of different normalizations is an area that needs more research. In summary, although clustering, gene selection and normalization are considered standard methods in bioinformatics, our comprehensive analysis shows that selecting the right methods, and the right combinations of methods, is far from trivial and that much is still unexplored in what is considered to be the most basic analysis of genomic data.
Project description:As the number of publically available microarray experiments increases, the ability to analyze extremely large datasets across multiple experiments becomes critical. There is a requirement to develop algorithms which are fast and can cluster extremely large datasets without affecting the cluster quality. Clustering is an unsupervised exploratory technique applied to microarray data to find similar data structures or expression patterns. Because of the high input/output costs involved and large distance matrices calculated, most of the algomerative clustering algorithms fail on large datasets (30,000 + genes/200 + arrays). In this article, we propose a new two-stage algorithm which partitions the high-dimensional space associated with microarray data using hyperplanes. The first stage is based on the Balanced Iterative Reducing and Clustering using Hierarchies algorithm with the second stage being a conventional k-means clustering technique. This algorithm has been implemented in a software tool (HPCluster) designed to cluster gene expression data. We compared the clustering results using the two-stage hyperplane algorithm with the conventional k-means algorithm from other available programs. Because, the first stage traverses the data in a single scan, the performance and speed increases substantially. The data reduction accomplished in the first stage of the algorithm reduces the memory requirements allowing us to cluster 44,460 genes without failure and significantly decreases the time to complete when compared with popular k-means programs. The software was written in C# (.NET 1.1).The program is freely available and can be downloaded from http://www.amdcc.org/bioinformatics/bioinformatics.aspx.Supplementary data are available at Bioinformatics online.
Project description:BACKGROUND: Searching for similarities in a set of biological data is intrinsically difficult due to possible data points that should not be clustered, or that should group within several clusters. Under these hypotheses, hierarchical agglomerative clustering is not appropriate. Moreover, if the dataset is not known enough, like often is the case, supervised classification is not appropriate either. RESULTS: CLAG (for CLusters AGgregation) is an unsupervised non hierarchical clustering algorithm designed to cluster a large variety of biological data and to provide a clustered matrix and numerical values indicating cluster strength. CLAG clusterizes correlation matrices for residues in protein families, gene-expression and miRNA data related to various cancer types, sets of species described by multidimensional vectors of characters, binary matrices. It does not ask to all data points to cluster and it converges yielding the same result at each run. Its simplicity and speed allows it to run on reasonably large datasets. CONCLUSIONS: CLAG can be used to investigate the cluster structure present in biological datasets and to identify its underlying graph. It showed to be more informative and accurate than several known clustering methods, as hierarchical agglomerative clustering, k-means, fuzzy c-means, model-based clustering, affinity propagation clustering, and not to suffer of the convergence problem proper to this latter.