Protocol for Identification and Removal of Doublets with DoubletDecon.
ABSTRACT: Retention of multiplet captures in single-cell RNA sequencing (scRNA-seq) data can hinder identification of discrete or transitional cell populations and associated marker genes. To overcome this challenge, we created DoubletDecon to identify and remove doublets, multiplets of two cells, by using a combination of deconvolution to identify putative doublets and analyses of unique gene expression. Here, we provide the protocol for running DoubletDecon on scRNA-seq data. For complete details on the use and execution of this protocol, please refer to DePasquale et al. (2019).
Project description:Methods for single-cell RNA sequencing (scRNA-seq) have greatly advanced in recent years. While droplet- and well-based methods have increased the capture frequency of cells for scRNA-seq, these technologies readily produce technical artifacts, such as doublet cell captures. Doublets occurring between distinct cell types can appear as hybrid scRNA-seq profiles, but do not have distinct transcriptomes from individual cell states. We introduce DoubletDecon, an approach that detects doublets with a combination of deconvolution analyses and the identification of unique cell-state gene expression. We demonstrate the ability of DoubletDecon to identify synthetic, mixed-species, genetic, and cell-hashing cell doublets from scRNA-seq datasets of varying cellular complexity with a high sensitivity relative to alternative approaches. Importantly, this algorithm prevents the prediction of valid mixed-lineage and transitional cell states as doublets by considering their unique gene expression. DoubletDecon has an easy-to-use graphical user interface and is compatible with diverse species and unsupervised population detection algorithms.
Project description:Single-cell RNA-sequencing has become a widely used, powerful approach for studying cell populations. However, these methods often generate multiplet artifacts, where two or more cells receive the same barcode, resulting in a hybrid transcriptome. In most experiments, multiplets account for several percent of transcriptomes and can confound downstream data analysis. Here, we present Single-Cell Remover of Doublets (Scrublet), a framework for predicting the impact of multiplets in a given analysis and identifying problematic multiplets. Scrublet avoids the need for expert knowledge or cell clustering by simulating multiplets from the data and building a nearest neighbor classifier. To demonstrate the utility of this approach, we test Scrublet on several datasets that include independent knowledge of cell multiplets. Scrublet is freely available for download at github.com/AllonKleinLab/scrublet.
Project description:BACKGROUND: Evidence strongly suggests that spontaneous doublet mutations in normal mouse tissues generally arise from chronocoordinate events. These chronocoordinate mutations sometimes reflect "mutation showers", which are multiple chronocoordinate mutations spanning many kilobases. However, little is known about mutagenesis of doublet and multiplet mutations (domuplets) in human cancer. Lung cancer accounts for about 25% of all cancer deaths. Herein, we analyze the epidemiology of domuplets in the EGFR and TP53 genes in lung cancer. The EGFR gene is an oncogene in which doublets are generally driver plus driver mutations, while the TP53 gene is a tumor suppressor gene with a more typical situation in which doublets derive from a driver and passenger mutation. METHODOLOGY/PRINCIPAL FINDINGS: EGFR mutations identified by sequencing were collected from 66 published papers and our updated EGFR mutation database (www.egfr.org). TP53 mutations were collected from IARC version 12 (www-p53.iarc.fr). For EGFR and TP53 doublets, no clearly significant differences in race, ethnicity, gender and smoking status were observed. Doublets in the EGFR and TP53 genes in human lung cancer are elevated about eight- and three-fold, respectively, relative to spontaneous doublets in mouse (6% and 2.3% versus 0.7%). CONCLUSIONS/SIGNIFICANCE: Although no one characteristic is definitive, the aggregate properties of doublet and multiplet mutations in lung cancer are consistent with a subset derived from chronocoordinate events in the EGFR gene: i) the eight frameshift doublets (present in 0.5% of all patients with EGFR mutations) are clustered and produce a net in-frame change; ii) about 32% of doublets are very closely spaced (< or =30 nt); and iii) multiplets contain two or more closely spaced mutations. TP53 mutations in lung cancer are very closely spaced (< or =30 nt) in 33% of doublets, and multiplets generally contain two or more very closely spaced mutations. Work in model systems is necessary to confirm the significance of chronocoordinate events in lung and other cancers.
Project description:Single-cell RNA sequencing (scRNA-seq) is a powerful technique for deconvoluting and clustering thousands of otherwise intermingled cells based on their gene expression. Here, we present a complete protocol for the unbiased evaluation of regenerating murine skeletal muscle using scRNA-seq. The skeletal muscle is unique in its cellular composition as being primarily multinucleated muscle cells (myofibers). This protocol focuses on isolating mononuclear cells from muscle for subsequent scRNA-seq analysis and can be modified to assess cell populations in other tissues of interest. For complete details on the use and execution of this protocol, please refer to Liu et al. (2015) and Oprescu et al. (2020).
Project description:Single-cell RNA sequencing (scRNA-seq) data are commonly affected by technical artifacts known as "doublets," which limit cell throughput and lead to spurious biological conclusions. Here, we present a computational doublet detection tool-DoubletFinder-that identifies doublets using only gene expression data. DoubletFinder predicts doublets according to each real cell's proximity in gene expression space to artificial doublets created by averaging the transcriptional profile of randomly chosen cell pairs. We first use scRNA-seq datasets where the identity of doublets is known to show that DoubletFinder identifies doublets formed from transcriptionally distinct cells. When these doublets are removed, the identification of differentially expressed genes is enhanced. Second, we provide a method for estimating DoubletFinder input parameters, allowing its application across scRNA-seq datasets with diverse distributions of cell types. Lastly, we present "best practices" for DoubletFinder applications and illustrate that DoubletFinder is insensitive to an experimentally validated kidney cell type with "hybrid" expression features.
Project description:The development of high-throughput single-cell RNA sequencing (scRNA-seq) has enabled access to information about gene expression in individual cells and insights into new biological areas. Although the interest in scRNA-seq has rapidly grown in recent years, the existing methods are plagued by many challenges when performing scRNA-seq on multiple samples. To simultaneously analyze multiple samples with scRNA-seq, we developed a universal sample barcoding method through transient transfection with short barcode oligonucleotides. By conducting a species-mixing experiment, we have validated the accuracy of our method and confirmed the ability to identify multiplets and negatives. Samples from a 48-plex drug treatment experiment were pooled and analyzed by a single run of Drop-Seq. This revealed unique transcriptome responses for each drug and target-specific gene expression signatures at the single-cell level. Our cost-effective method is widely applicable for the single-cell profiling of multiple experimental conditions, enabling the widespread adoption of scRNA-seq for various applications.
Project description:The development of hormone-mediated Ca2+ signals was analysed in polarized doublets, triplets and quadruplets of rat hepatocytes by video imaging of fura2 fluorescence. These multicellular models showed dilated bile canaliculi, and gap junctions were observed by using an anti-connexin-32 antibody. They also showed highly organized Ca2+ signals in response to vasopressin or noradrenaline. Surprisingly, the primary rises in intracellular Ca2+ concentration ([Ca2+]i) did not start randomly from any cell of the multiplet. It originated invariably in the same hepatocyte (first-responding cell), and then was propagated in a sequential manner to the nearest connected cells (cell 2, then 3, in triplets; cell 2, 3, then 4 in quadruplets). The sequential activation of the cells appeared to be an intrinsic property of multiplets of rat hepatocytes. (1) In the continued presence of hormones, the same sequential order was observed up to six times, i.e. at each train of oscillations occurring between the cells. (2) The order of [Ca2+]i responses was modified neither by the repeated addition of hormones nor by the hormonal dose. (3) The mechanical disruption of an intermediate cell slowed down the speed of the propagation, suggesting a role of gap junctions in the rapidity of the sequential activation of cells. (4) The same multiplet could have a different first-responding cell for vasopressin or noradrenaline, suggesting a role of the hormonal receptors in the sequentiality of cell responses. It is postulated that a functional heterogeneity of hormonal receptors, and the presence of functional gap junctions, are involved in the existence of sequentially ordered hormone-mediated [Ca2+]i rises in the multiplets of rat hepatocytes.
Project description:Many computational methods have been developed to infer cell type proportions from bulk transcriptomics data. However, an evaluation of the impact of data transformation, pre-processing, marker selection, cell type composition and choice of methodology on the deconvolution results is still lacking. Using five single-cell RNA-sequencing (scRNA-seq) datasets, we generate pseudo-bulk mixtures to evaluate the combined impact of these factors. Both bulk deconvolution methodologies and those that use scRNA-seq data as reference perform best when applied to data in linear scale and the choice of normalization has a dramatic impact on some, but not all methods. Overall, methods that use scRNA-seq data have comparable performance to the best performing bulk methods whereas semi-supervised approaches show higher error values. Moreover, failure to include cell types in the reference that are present in a mixture leads to substantially worse results, regardless of the previous choices. Altogether, we evaluate the combined impact of factors affecting the deconvolution task across different datasets and propose general guidelines to maximize its performance.
Project description:Complex interactions between different host immune cell types can determine the outcome of pathogen infections. Advances in single cell RNA-sequencing (scRNA-seq) allow probing of these immune interactions, such as cell-type compositions, which are then interpreted by deconvolution algorithms using bulk RNA-seq measurements. However, not all aspects of immune surveillance are represented by current algorithms. Here, using scRNA-seq of human peripheral blood cells infected with Salmonella, we develop a deconvolution algorithm for inferring cell-type specific infection responses from bulk measurements. We apply our dynamic deconvolution algorithm to a cohort of healthy individuals challenged ex vivo with Salmonella, and to three cohorts of tuberculosis patients during different stages of disease. We reveal cell-type specific immune responses associated not only with ex vivo infection phenotype but also with clinical disease stage. We propose that our approach provides a predictive power to identify risk for disease, and human infection outcomes.
Project description:Tissue fibrosis is a major cause of mortality that results from the deposition of matrix proteins by an activated mesenchyme. Macrophages accumulate in fibrosis, but the role of specific subgroups in supporting fibrogenesis has not been investigated in vivo. Here, we used single-cell RNA sequencing (scRNA-seq) to characterize the heterogeneity of macrophages in bleomycin-induced lung fibrosis in mice. A novel computational framework for the annotation of scRNA-seq by reference to bulk transcriptomes (SingleR) enabled the subclustering of macrophages and revealed a disease-associated subgroup with a transitional gene expression profile intermediate between monocyte-derived and alveolar macrophages. These CX3CR1+SiglecF+ transitional macrophages localized to the fibrotic niche and had a profibrotic effect in vivo. Human orthologs of genes expressed by the transitional macrophages were upregulated in samples from patients with idiopathic pulmonary fibrosis. Thus, we have identified a pathological subgroup of transitional macrophages that are required for the fibrotic response to injury.