Fast analysis of scATAC-seq data using a predefined set of genomic regions.
ABSTRACT: Background: Analysis of scATAC-seq data has been recently scaled to thousands of cells. While processing of other types of single cell data was boosted by the implementation of alignment-free techniques, pipelines available to process scATAC-seq data still require large computational resources. We propose here an approach based on pseudoalignment, which reduces the execution times and hardware needs at little cost for precision. Methods: Public data for 10k PBMC were downloaded from 10x Genomics web site. Reads were aligned to various references derived from DNase I Hypersensitive Sites (DHS) using kallisto and quantified with bustools. We compared our results with the ones publicly available derived by cellranger-atac. Results: We found that kallisto does not introduce biases in quantification of known peaks and cells groups are identified in a consistent way. We also found that cell identification is robust when analysis is performed using DHS-derived reference in place of de novo identification of ATAC peaks. Lastly, we found that our approach is suitable for reliable quantification of gene activity based on scATAC-seq signal, thus allows for efficient labelling of cell groups based on marker genes. Conclusions: Analysis of scATAC-seq data by means of kallisto produces results in line with standard pipelines while being considerably faster; using a set of known DHS sites as reference does not affect the ability to characterize the cell populations.
Project description:Single-cell ATAC-seq (scATAC) yields sparse data that make conventional analysis challenging. We developed chromVAR (http://www.github.com/GreenleafLab/chromVAR), an R package for analyzing sparse chromatin-accessibility data by estimating gain or loss of accessibility within peaks sharing the same motif or annotation while controlling for technical biases. chromVAR enables accurate clustering of scATAC-seq profiles and characterization of known and de novo sequence motifs associated with variation in chromatin accessibility.
Project description:We present Model-based AnalysEs of Transcriptome and RegulOme (MAESTRO), a comprehensive open-source computational workflow ( http://github.com/liulab-dfci/MAESTRO ) for the integrative analyses of single-cell RNA-seq (scRNA-seq) and ATAC-seq (scATAC-seq) data from multiple platforms. MAESTRO provides functions for pre-processing, alignment, quality control, expression and chromatin accessibility quantification, clustering, differential analysis, and annotation. By modeling gene regulatory potential from chromatin accessibilities at the single-cell level, MAESTRO outperforms the existing methods for integrating the cell clusters between scRNA-seq and scATAC-seq. Furthermore, MAESTRO supports automatic cell-type annotation using predefined cell type marker genes and identifies driver regulators from differential scRNA-seq genes and scATAC-seq peaks.
Project description:BACKGROUND:Long non-coding RNAs (lncRNAs) are emerging as important regulators of various biological processes. While many studies have exploited public resources such as RNA sequencing (RNA-Seq) data in The Cancer Genome Atlas to study lncRNAs in cancer, it is crucial to choose the optimal method for accurate expression quantification. RESULTS:In this study, we compared the performance of pseudoalignment methods Kallisto and Salmon, alignment-based transcript quantification method RSEM, and alignment-based gene quantification methods HTSeq and featureCounts, in combination with read aligners STAR, Subread, and HISAT2, in lncRNA quantification, by applying them to both un-stranded and stranded RNA-Seq datasets. Full transcriptome annotation, including protein-coding and non-coding RNAs, greatly improves the specificity of lncRNA expression quantification. Pseudoalignment methods and RSEM outperform HTSeq and featureCounts for lncRNA quantification at both sample- and gene-level comparison, regardless of RNA-Seq protocol type, choice of aligners, and transcriptome annotation. Pseudoalignment methods and RSEM detect more lncRNAs and correlate highly with simulated ground truth. On the contrary, HTSeq and featureCounts often underestimate lncRNA expression. Antisense lncRNAs are poorly quantified by alignment-based gene quantification methods, which can be improved using stranded protocols and pseudoalignment methods. CONCLUSIONS:Considering the consistency with ground truth and computational resources, pseudoalignment methods Kallisto or Salmon in combination with full transcriptome annotation is our recommended strategy for RNA-Seq analysis for lncRNAs.
Project description:Single-cell ATAC-seq (scATAC-seq) profiles the chromatin accessibility landscape at single cell level, thus revealing cell-to-cell variability in gene regulation. However, the high dimensionality and sparsity of scATAC-seq data often complicate the analysis. Here, we introduce a method for analyzing scATAC-seq data, called Single-Cell ATAC-seq analysis via Latent feature Extraction (SCALE). SCALE combines a deep generative framework and a probabilistic Gaussian Mixture Model to learn latent features that accurately characterize scATAC-seq data. We validate SCALE on datasets generated on different platforms with different protocols, and having different overall data qualities. SCALE substantially outperforms the other tools in all aspects of scATAC-seq data analysis, including visualization, clustering, and denoising and imputation. Importantly, SCALE also generates interpretable features that directly link to cell populations, and can potentially reveal batch effects in scATAC-seq experiments.
Project description:Alignment of scRNA-Seq data are the first and one of the most critical steps of the scRNA-Seq analysis workflow, and thus the choice of proper aligners is of paramount importance. Recently, STAR an alignment method and Kallisto a pseudoalignment method have both gained a vast amount of popularity in the single cell sequencing field. However, an unbiased third-party comparison of these two methods in scRNA-Seq is lacking. Here we conduct a systematic comparison of them on a variety of Drop-seq, Fluidigm and 10x genomics data, from the aspects of gene abundance, alignment accuracy, as well as computational speed and memory use. We observe that STAR globally produces more genes and higher gene-expression values, compared to Kallisto, as well as Bowtie2, another popular alignment method for bulk RNA-Seq. STAR also yields higher correlations of the Gini index for the genes with RNA-FISH validation results. Using 10x genomics PBMC 3K scRNA-Seq and mouse cortex single nuclei RNA-Seq data, STAR shows similar or better cell-type annotation results, by detecting a larger subset of known gene markers. However, the gain of accuracy and gene abundance of STAR alignment comes with the price of significantly slower computation time (4 folds) and more memory (7.7 folds), compared to Kallisto.
Project description:SUMMARY:Single-cell assay of transposase-accessible chromatin followed by sequencing (scATAC-seq) is an emerging new technology for the study of gene regulation with single-cell resolution. The data from scATAC-seq are unique-sparse, binary and highly variable even within the same cell type. As such, neither methods developed for bulk ATAC-seq nor single-cell RNA-seq data are appropriate. Here, we present Destin, a bioinformatic and statistical framework for comprehensive scATAC-seq data analysis. Destin performs cell-type clustering via weighted principle component analysis, weighting accessible chromatin regions by existing genomic annotations and publicly available regulomic datasets. The weights and additional tuning parameters are determined via model-based likelihood. We evaluated the performance of Destin using downsampled bulk ATAC-seq data of purified samples and scATAC-seq data from seven diverse experiments. Compared to existing methods, Destin was shown to outperform across all datasets and platforms. For demonstration, we further applied Destin to 2088 adult mouse forebrain cells and identified cell-type-specific association of previously reported schizophrenia GWAS loci. AVAILABILITY AND IMPLEMENTATION:Destin toolkit is freely available as an R package at https://github.com/urrutiag/destin. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:Conventional high-throughput genomic technologies for mapping regulatory element activities in bulk samples such as ChIP-seq, DNase-seq and FAIRE-seq cannot analyze samples with small numbers of cells. The recently developed low-input and single-cell regulome mapping technologies such as ATAC-seq and single-cell ATAC-seq (scATAC-seq) allow analyses of small-cell-number and single-cell samples, but their signals remain highly discrete or noisy. Compared to these regulome mapping technologies, transcriptome profiling by RNA-seq is more widely used. Transcriptome data in single-cell and small-cell-number samples are more continuous and often less noisy. Here, we show that one can globally predict chromatin accessibility and infer regulatory element activities using RNA-seq. Genome-wide chromatin accessibility predicted by RNA-seq from 30 cells can offer better accuracy than ATAC-seq from 500 cells. Predictions based on single-cell RNA-seq (scRNA-seq) can more accurately reconstruct bulk chromatin accessibility than using scATAC-seq. Integrating ATAC-seq with predictions from RNA-seq increases the power and value of both methods. Thus, transcriptome-based prediction provides a new tool for decoding gene regulatory circuitry in samples with limited cell numbers.
Project description:ATAC-seq is a recently developed method to identify the areas of open chromatin in a cell. These regions usually correspond to active regulatory elements and their location profile is unique to a given cell type. When done at single-cell resolution, ATAC-seq provides an insight into the cell-to-cell variability that emerges from otherwise identical DNA sequences by identifying the variability in the genomic location of open chromatin sites in each of the cells. This paper presents Scasat (single-cell ATAC-seq analysis tool), a complete pipeline to process scATAC-seq data with simple steps. Scasat treats the data as binary and applies statistical methods that are especially suitable for binary data. The pipeline is developed in a Jupyter notebook environment that holds the executable code along with the necessary description and results. It is robust, flexible, interactive and easy to extend. Within Scasat we developed a novel differential accessibility analysis method based on information gain to identify the peaks that are unique to a cell. The results from Scasat showed that open chromatin locations corresponding to potential regulatory elements can account for cellular heterogeneity and can identify regulatory regions that separates cells from a complex population.
Project description:Assay for Transposase Accessible Chromatin with high-throughput sequencing (ATAC-seq) is a powerful genomic technology that is used for the global mapping and analysis of open chromatin regions. However, for users to process and analyze such data they either have to use a number of complicated bioinformatic tools or attempt to use the currently available ATAC-seq analysis software, which are not very user friendly and lack visualization of the ATAC-seq results. Because of these issues, biologists with minimal bioinformatics background who wish to process and analyze their own ATAC-seq data by themselves will find these tasks difficult and ultimately will need to seek help from bioinformatics experts. Moreover, none of the available tools provide complete solution for ATAC-seq data analysis. Therefore, to enable non-programming researchers to analyze ATAC-seq data on their own, we developed a tool called Graphical User interface for the Analysis and Visualization of ATAC-seq data (GUAVA). GUAVA is a standalone software that provides users with a seamless solution from beginning to end including adapter trimming, read mapping, the identification and differential analysis of ATAC-seq peaks, functional annotation, and the visualization of ATAC-seq results. We believe GUAVA will be a highly useful and time-saving tool for analyzing ATAC-seq data for biologists with minimal or no bioinformatics background. Since GUAVA can also operate through command-line, it can easily be integrated into existing pipelines, thus providing flexibility to users with computational experience.
Project description:Characterizing epigenetic heterogeneity at the cellular level is a critical problem in the modern genomics era. Assays such as single cell ATAC-seq (scATAC-seq) offer an opportunity to interrogate cellular level epigenetic heterogeneity through patterns of variability in open chromatin. However, these assays exhibit technical variability that complicates clear classification and cell type identification in heterogeneous populations. We present scABC, an R package for the unsupervised clustering of single-cell epigenetic data, to classify scATAC-seq data and discover regions of open chromatin specific to cell identity.