Project description:Single-cell Assay Transposase Accessible Chromatin sequencing (scATAC-seq) has been widely used in profiling genome-wide chromatin accessibility in thousands of individual cells. However, compared with single-cell RNA-seq, the peaks of scATAC-seq are much sparser due to the lower copy numbers (diploid in humans) and the inherent missing signals, which makes it more challenging to classify cell type based on specific expressed gene or other canonical markers. Here, we present svmATAC, a support vector machine (SVM)-based method for accurately identifying cell types in scATAC-seq datasets by enhancing peak signal strength and imputing signals through patterns of co-accessibility. We applied svmATAC to several scATAC-seq data from human immune cells, human hematopoietic system cells, and peripheral blood mononuclear cells. The benchmark results showed that svmATAC is free of literature-based markers and robust across datasets in different libraries and platforms. The source code of svmATAC is available at https://github.com/mrcuizhe/svmATAC under the MIT license.
Project description:Chromatin accessibility profiles at single cell resolution can reveal cell type-specific regulatory programs, help dissect highly specialized cell functions and trace cell origin and evolution. Accurate cell type assignment is critical for effectively gaining biological and pathological insights, but is difficult in scATAC-seq. Hence, by extensively reviewing the literature, we designed scATAC-Ref (https://bio.liclab.net/scATAC-Ref/), a manually curated scATAC-seq database aimed at providing a comprehensive, high-quality source of chromatin accessibility profiles with known cell labels across broad cell types. Currently, scATAC-Ref comprises 1 694 372 cells with known cell labels, across various biological conditions, >400 cell/tissue types and five species. We used uniform system environment and software parameters to perform comprehensive downstream analysis on these chromatin accessibility profiles with known labels, including gene activity score, TF enrichment score, differential chromatin accessibility regions, pathway/GO term enrichment analysis and co-accessibility interactions. The scATAC-Ref also provided a user-friendly interface to query, browse and visualize cell types of interest, thereby providing a valuable resource for exploring epigenetic regulation in different tissues and cell types.
Project description:BackgroundAdvances in single-cell RNA-seq technology have led to great opportunities for the quantitative characterization of cell types, and many clustering algorithms have been developed based on single-cell gene expression. However, we found that different data preprocessing methods show quite different effects on clustering algorithms. Moreover, there is no specific preprocessing method that is applicable to all clustering algorithms, and even for the same clustering algorithm, the best preprocessing method depends on the input data.ResultsWe designed a graph-based algorithm, SC3-e, specifically for discriminating the best data preprocessing method for SC3, which is currently the most widely used clustering algorithm for single cell clustering. When tested on eight frequently used single-cell RNA-seq data sets, SC3-e always accurately selects the best data preprocessing method for SC3 and therefore greatly enhances the clustering performance of SC3.ConclusionThe SC3-e algorithm is practically powerful for discriminating the best data preprocessing method, and therefore largely enhances the performance of cell-type clustering of SC3. It is expected to play a crucial role in the related studies of single-cell clustering, such as the studies of human complex diseases and discoveries of new cell types.
Project description:Most existing dimensionality reduction and clustering packages for single-cell RNA-seq (scRNA-seq) data deal with dropouts by heavy modeling and computational machinery. Here, we introduce CIDR (Clustering through Imputation and Dimensionality Reduction), an ultrafast algorithm that uses a novel yet very simple implicit imputation approach to alleviate the impact of dropouts in scRNA-seq data in a principled manner. Using a range of simulated and real data, we show that CIDR improves the standard principal component analysis and outperforms the state-of-the-art methods, namely t-SNE, ZIFA, and RaceID, in terms of clustering accuracy. CIDR typically completes within seconds when processing a data set of hundreds of cells and minutes for a data set of thousands of cells. CIDR can be downloaded at https://github.com/VCCRI/CIDR .
Project description:Cells whose accessibility landscape has been profiled with scATAC-seq cannot readily be annotated to a particular cell type. In fact, annotating cell-types in scATAC-seq data is a challenging task since, unlike in scRNA-seq data, we lack knowledge of 'marker regions' which could be used for cell-type annotation. Current annotation methods typically translate accessibility to expression space and rely on gene expression patterns. We propose a novel approach, scATAcat, that leverages characterized bulk ATAC-seq data as prototypes to annotate scATAC-seq data. To mitigate the inherent sparsity of single-cell data, we aggregate cells that belong to the same cluster and create pseudobulk. To demonstrate the feasibility of our approach we collected a number of datasets with respective annotations to quantify the results and evaluate performance for scATAcat. scATAcat is available as a python package at https://github.com/aybugealtay/scATAcat.
Project description:BackgroundTechnical improvement in ATAC-seq makes it possible for high throughput profiling the chromatin states of single cells. However, data from multiple sources frequently show strong technical variations, which is referred to as batch effects. In order to perform joint analysis across multiple datasets, specialized method is required to remove technical variations between datasets while keep biological information.ResultsHere we present an algorithm named epiConv to perform joint analyses on scATAC-seq datasets. We first show that epiConv better corrects batch effects and is less prone to over-fitting problem than existing methods on a collection of PBMC datasets. In a collection of mouse brain data, we show that epiConv is capable of aligning low-depth scATAC-Seq from co-assay data (simultaneous profiling of transcriptome and chromatin) onto high-quality ATAC-seq reference and increasing the resolution of chromatin profiles of co-assay data. Finally, we show that epiConv can be used to integrate cells from different biological conditions (T cells in normal vs. germ-free mouse; normal vs. malignant hematopoiesis), which reveals hidden cell populations that would otherwise be undetectable.ConclusionsIn this study, we introduce epiConv to integrate multiple scATAC-seq datasets and perform joint analysis on them. Through several case studies, we show that epiConv removes the batch effects and retains the biological signal. Moreover, joint analysis across multiple datasets improves the performance of clustering and differentially accessible peak calling, especially when the biological signal is weak in single dataset.
Project description:Bulk ATAC-seq assays have been used to map and profile the chromatin accessibility of regulatory elements such as enhancers, promoters, and insulators. This has provided great insight into the regulation of gene expression in many cell types in a variety of organisms. To date, ATAC-seq has most often been used to provide an average evaluation of chromatin accessibility in populations of cells. The development of a single cell approach (scATAC-seq) assay enables researchers to evaluate chromatin accessibility in individual cells and identify sub-groups in mixed populations of cells. To investigate the full potential of single-cell epigenomic data, we have comprehensively compared the information derived from bulk ATAC-seq and scATAC-seq in populations of cells. We found that the chromatin architecture signal is the same using bulk ATAC-seq and scATAC-seq to analyse aliquots of the same cell population. However, scATAC-seq provides substantially higher data quality compared to bulk ATAC-seq improving the sensitivity to detect relatively weak, but functionally important ATAC-seq signals. Furthermore, we found that scATAC-seq identified differences in what was previously assumed to be a homogenous population of cells. Finally, we determined the number of cells required to generate aggregated open chromatin profiles from single cells and to identify biologically meaningful clusters after pseudo-bulking of data. This study illustrates the added value of using scATAC-seq rather than bulk ATAC-seq in evaluating both homogeneous and heterogeneous populations of cells. This paper provides a comprehensive guide on the benefits of using scATAC-seq data to study gene regulation.
Project description:With the advent of single-cell RNA sequencing (scRNA-seq), one major challenging is the so-called 'dropout' events that distort gene expression and remarkably influence downstream analysis in single-cell transcriptome. To address this issue, much effort has been done and several scRNA-seq imputation methods were developed with two categories: model-based and deep learning-based. However, comprehensively and systematically comparing existing methods are still lacking. In this work, we use six simulated and two real scRNA-seq datasets to comprehensively evaluate and compare a total of 12 available imputation methods from the following four aspects: (i) gene expression recovering, (ii) cell clustering, (iii) gene differential expression, and (iv) cellular trajectory reconstruction. We demonstrate that deep learning-based approaches generally exhibit better overall performance than model-based approaches under major benchmarking comparison, indicating the power of deep learning for imputation. Importantly, we built scIMC (single-cell Imputation Methods Comparison platform), the first online platform that integrates all available state-of-the-art imputation methods for benchmarking comparison and visualization analysis, which is expected to be a convenient and useful tool for researchers of interest. It is now freely accessible via https://server.wei-group.net/scIMC/.
Project description:Single-cell RNA sequencing (scRNA-seq) is a cutting-edge technique in molecular biology and genomics, revealing the cellular heterogeneity. However, scRNA-seq data often suffer from dropout events, meaning that certain genes exhibit very low or even zero expression levels due to technical limitations. Existing imputation methods for dropout events lack comprehensive evaluations in downstream analyses and do not demonstrate robustness across various scenarios. In response to this challenge, we propose a consensus clustering-based imputation (CCI) method. CCI performs clustering on each subset of data sampling across genes and summarizes clustering outcomes to define cellular similarities. CCI leverages the information from similar cells and employs the similarities to impute gene expression levels. Our comprehensive evaluations demonstrate that CCI not only reconstructs the original data pattern, but also improves the performance of downstream analyses. CCI outperforms existing methods for data imputation under different scenarios, exhibiting accuracy, robustness, and generalization.
Project description:The past two decades of microRNA (miRNA) research has solidified the role of these small non-coding RNAs as key regulators of many biological processes and promising biomarkers for disease. The concurrent development in high-throughput profiling technology has further advanced our understanding of the impact of their dysregulation on a global scale. Currently, next-generation sequencing is the platform of choice for the discovery and quantification of miRNAs. Despite this, there is no clear consensus on how the data should be preprocessed before conducting downstream analyses. Often overlooked, data preprocessing is an essential step in data analysis: the presence of unreliable features and noise can affect the conclusions drawn from downstream analyses. Using a spike-in dilution study, we evaluated the effects of several general-purpose aligners (BWA, Bowtie, Bowtie 2 and Novoalign), and normalization methods (counts-per-million, total count scaling, upper quartile scaling, Trimmed Mean of M, DESeq, linear regression, cyclic loess and quantile) with respect to the final miRNA count data distribution, variance, bias and accuracy of differential expression analysis. We make practical recommendations on the optimal preprocessing methods for the extraction and interpretation of miRNA count data from small RNA-sequencing experiments.