Single-cell RNA-seq data of mammary gland epithelial cells from different gestational stages to detect and remove barcode swapping
ABSTRACT: Barcode swapping results in the mislabeling of sequencing reads between multiplexed samples on the new patterned flow cell Illumina sequencing machines. This may compromise the validity of numerous genomic assays, especially for single-cell studies where many samples are routinely multiplexed together. The severity and consequences of barcode swapping for single-cell transcriptomic studies remain poorly understood. We have used two statistical approaches to robustly quantify the fraction of swapped reads in each of two plate-based single-cell RNA sequencing datasets. We found that approximately 2.5% of reads were mislabeled between samples on the HiSeq 4000 machine, which is lower than previous reports. We observed no correlation between the swapped fraction of reads and the concentration of free barcode across plates. Further- more, we have demonstrated that barcode swapping may generate complex but artefactual cell libraries in droplet-based single-cell RNA sequencing studies. To eliminate these artefacts, we have developed an algorithm to exclude individual molecules that have swapped between samples in 10X Genomics experiments, exploiting the combinatorial complexity present in the data. This permits the continued use of cutting-edge sequencing machines for droplet-based experiments while avoiding the confounding effects of barcode swapping. This data repository contains the sequencing files associated with the droplet based scRNA-seq dataset in Griffiths et al. (2018). The data presented here should purely used for technical analysis, the biological motivation is nonetheless briefly described in the following: The mammary gland is a unique organ as it undergoes most of its development during puberty and adulthood. Characterising the hierarchy of the various mammary epithelial cells and how they are regulated in response to gestation, lactation and involution is important for understanding how breast cancer develops. Recent studies have used numerous markers to enrich, isolate and characterise the different epithelial cell compartments within the adult mammary gland. However, in all of these studies only a handful of markers were used to define and trace cell populations. Therefore, there is a need for an unbiased and comprehensive description of mammary epithelial cells within the gland at different developmental stages. To this end we used single cell RNA sequencing (scRNAseq) to determine the gene expression profile of individual mammary epithelial cells across four adult developmental stages; nulliparous, mid gestation, lactation and post weaning (full natural involution).
Project description:We combined CRISPR genome editing with single-cell RNA sequencing to assess complex phenotypes in pooled cellular screens. Our method for CRISPR droplet sequencing (CROP-seq) comprises four key components: a gRNA vector that makes individual gRNAs detectable in single-cell transcriptomes, a high-throughput assay for single-cell RNA-seq, a computational pipeline for assigning single-cell transcriptomes to gRNAs, and a bioinformatic method for analyzing and interpreting gRNA-induced transcriptional profiles. CROP-seq allowed us to link gRNA expression to the associated transcriptome responses in thousands of single cells using a straightforward and broadly applicable screening workflow. Additional information are available from the CROP-seq website http://crop-seq.computational-epigenetics.org Overall design: Drop-seq species mixing experiment was performed with human HEK293T and mouse 3T3 cells in a 1:1 proportion as described by Macosko et al. For CROP-seq, Jurkat cells were transduced with a gRNA library targeting high-level regulators of T cell receptor signaling and a set of transcription factors. After 10 days of antibiotic selection and expansion, cells were stimulated with anti-CD3 and anti-CD28 antibodies or left untreated. Both conditions were analyzed using CROP-seq, measuring TCR activation for each gene knockout. Our dataset comprises 5,905 high-quality single-cell transcriptomes with uniquely assigned gRNAs. All CROP-seq raw data files are multiplexed with single-cell reads. Each read 1 contains the cell barcode (12 bp) and a molecule barcode (8 bp) and read 2 contains the transcriptome read. The libraries are pooled by nature but also intrinsically labelled. The file CROP-seq_Jurkat_TCR.digital_expression.csv.gz contains gene level expression quantifications of each gene for each cell which corresponds to the cell barcode in read1. For the Drop-seq_HEK293T-3T3 sample (Drop-seq species mixing), reads aligning to two genomes were used to quantify for each cell barcode the amount of reads coming from each genome. In a similar way, in the CROP-seq_HEK293T sample (CROP-seq gRNA mixing), the number of gRNA molecules detected per cell barcode (which is possible due to the polyadenylation of these gRNA-containing transcripts when expressed from a Pol2 promoter as engineered) were counted.
| GSE92872 | GEO
Project description:Droplet Barcode Sequencing for Targeted Linked-Read Haplotyping of Single DNA Molecules
Project description:Using integrated genomics we identify a role for CLEC12A in antibacterial autophagy. Clec12a-/- mice are more susceptible to bacterial infection and CLEC12A deficient cells exhibit impaired antibacterial autophagy. We used transcriptional profilinf to understand the role of CLEC12A in the response to Salmonella and Listeria. Bone marrow-derived macrophages from WT or Clec12a-/- mice were infected with Salmonella enterica serovar Typhimurium or Listeria monocytogenes. Cells were harvested at 0,3,6, and 24hours post-infection for RNA analysis. Please note that single-end sequencing was performed but two files: R1 files that contained the sample barcodes (19 or 17bp reads) and R2 files that contained the single-end-sequenced 46bp cDNA reads were generated. Since the barcode info is mostly redundant, only R2 reads were submitted (described in 'raw_file_readme.txt').
Project description:Hepatitis C virus uniquely requires the liver specific microRNA-122 for replication, yet global effects on endogenous miRNA targets during infection are unexplored. Here, high-throughput sequencing and crosslinking immunoprecipitation (HITS-CLIP) experiments of human Argonaute (Ago) during HCV infection showed robust Ago binding on the HCV 5’UTR, at known and predicted miR-122 sites. On the human transcriptome, we observed reduced Ago binding and functional mRNA de-repression of miR-122 targets during virus infection. This miR-122 “sponge” effect could be relieved and redirected to miR-15 targets by swapping the miRNA tropism of the virus. Single-cell expression data from reporters containing miR-122 sites showed significant de-repression during HCV infection depending on expression level and number of sites. We describe a quantitative mathematical model of HCV induced miR-122 sequestration and propose that such miR-122 inhibition by HCV RNA may result in global de-repression of host miR-122 targets, providing an environment fertile for the long-term oncogenic potential of HCV. AGO HITS-CLIP libraries were generated from single cell clones of miR-122 deleted Huh7.5 cells using CRISPR (KO), unedited controls (WT), or cells transfected with GFP instead of CRISPR. Libraries were generated with a 4nt index read, a common priming sequence, followed by a 5nt degenerate barcode terminiating in a G. Files have been demultiplexed such that the 5nt degenerate barcode has been appended as the first 5 nucleotides of the read.
Project description:We use RNA-sequencing to generate gene expression profiles of fetal mammary cells with unique sorting strategies. These analyses reveal that sorting fetal mammary cells with Sox10 and EpCAM sorting markers provides a stroma-free fMaSC-enriched cell population. The gene expression profiling of these cells offers a resources to probe the molecular mechanisms that specify this unique cell state. Examination of 2 different sorting strategies for fetal mammary cells
Project description:We reasoned that by using a distinct set of oligo-tagged antibodies against ubiquitously expressed proteins, we could uniquely label multiple populations of cells, multiplex them together, and use the barcoded antibody signal as a fingerprint. We refer to this approach as cellular "hashing", as our set of oligos defines a "look up table" to assign each multiplexed cell to its original sample. We demonstrate application of the technique to combine eight samples and run them simultaneously in a single droplet based scRNA-seq run. We show that cell hashtags allow sample multiplexing, confident multiplet identification and super-loading in the context of a commonly used droplet-based scRNA-seq method to drive down the per-cell cost of large-scale scRNA-seq experiments Overall design: We chose a set of monoclonal antibodies directed against ubiquitously and highly expressed immune surface markers (CD45, CD98, CD44, and CD11a) and combined these antibodies into eight identical pools (pool A-H), and subsequently conjugated each pool to a distinct hashtag oligonucleotide (henceforth referred to as HTOs). The HTOs contain a unique 10- or 12-bp barcode that could be read out and linked to the cellular transcriptome, through minor modifications to standard scRNA-seq protocols.
Project description:RNA-Seq is a powerful tool for transcriptome profiling, but is hampered by sequence-dependent bias and inaccuracy at low copy numbers intrinsic to exponential PCR amplification. We developed a simple strategy for mitigating these complications, allowing truly digital RNA-Seq. Following reverse transcription, a large set of barcode sequences is added in excess, and nearly every cDNA molecule is uniquely labeled by random attachment of barcode sequences to both ends. After PCR, we applied paired-end deep sequencing to read the two barcodes and cDNA sequences. Rather than counting the number of reads, RNA abundance is measured based on the number of unique barcode sequences observed for a given cDNA sequence. We optimized the barcodes to be unambiguously identifiable even in the presence of multiple sequencing errors. This method allows counting with single copy resolution despite sequence-dependent bias and PCR amplification noise, and is analogous to digital PCR but amendable to quantifying a whole transcriptome. We demonstrated transcriptome profiling of E. coli with more accurate and reproducible quantification than conventional RNA-Seq. We analyzed two replicates of the same bulk E. coli transcriptome sample. In each sample, we included internal standards to demonstrate that the digital RNA-Seq system may accurately count fragments correctly.
Project description:In many gene expression studies, cells are extracted by tissue dissociation and Fluorescence-Activated Cell Sorting (FACS), but the effect of these protocols on cellular transcriptomes is not well characterized and often ignored. Here, we applied single-cell mRNA sequencing (scRNA-seq) to muscle stem cells, and unexpectedly found a subpopulation that is strongly affected by the widely-used dissociation protocol that we employed. One implication of this finding is that several published transcriptomics studies may need to be reinterpreted. Importantly, we detected similar subpopulations in other single-cell datasets, suggesting that cells from other tissues might be affected by this artefact as well. Overall design: Mouse satellite cells and zebrafish fin cells were extracted from Tibialis Anterior muscles of Pax7nGFP mice and wildtype zebrafish fins, respectively. For cell extraction, traditional (Supplementary Methods) dissociation protocols that combine mechanical and enzymatic dissociation were employed, and live cells were subsequently sorted into plates using FACS. Next, single-cell mRNA sequencing (CEL-Seq or SORT-Seq (robotized version of CEL-Seq2)) was applied, and data was analyzed with RaceID2 to identify clusters. CEL-Seq samples: Manual CEL-Seq; Satellite cells (unstained); Male Pax7nGFP mice (5-7 months old); 1h collagenase-treated (default dissociation protocol); 96 cells per plate with 96 different barcodes (see "Cel-seq_barcodes_96.csv"); some primes numbers are bulk samples (see "BulkSamples_BarcodesAndNrOfCellsUsed_perCEL-Seq1-library"); Spike-ins included (see "ERCC92.fa"); No mitochondrial reads in count tables; Sequencing lanes not concatenated in fastq-files uploaded here; "Merged_CEL-Seq_AllMiceAndLibrariesMerged.csv"-file is count table with reads from all mice and libraries merged (Annotation of columns: Zx.y, where Z = mouse, x = library and y = cell barcode; bulk samples are not included any more in this file); See Supplementary Methods for details. SORT-Seq 1h and 2h dissociated samples: Robotized CEL-Seq2; Satellite cells (unstained); Male Pax7nGFP mice (5-7 months old; 8 muscles from 4 mice); One plate of 1h (default dissociation protocol) and one plate of 2h collagenase-treated cells; 384 cells per plate with each of the 96 barcodes (see "Cel-seq_barcodes_96.csv") used 4 times per plate (therefore, each plate has 4 libraries); No bulk samples included; Spike-ins included (see "ERCC92.fa"); No mitochondrial reads in count table; In some wells, we sorted no cell (internal negative control; barcodes #95 and #96 were used for empty wells); Sequencing lanes not concatenated in fastq-files uploaded here; "Merged_SORT-Seq_DissociationTimecourse.csv"-file is count table were reads from all dissociation timepoints are merged (Annotation columns: DZhx_y, where Z = 1 or 2 hours collagenase-treated, x = library and y = cell barcode); See Supplementary Methods for details. SORT-Seq MitoTracker stained samples (pilot and repeat): Robotized CEL-Seq2 samples; Satellite cells stained with MitoTracker; Female Pax7nGFP mice (1 4.7-months old mouse for pilot experiment; 3 6-months old mice for repeat experiment); 1h collagenase-treated (default dissociation protocol); 384 cells per plate with each of the 384 barcodes (see "Cel-seq_barcodes_384.csv") used 1 times per plate (therefore, each plate has 1 library); Note: pilot experiment has 263 cells (so plate was partly empty), repeat experiment done with 4 full plates; No bulk samples included; Spike-ins included (see "ERCC92.fa"); Mitochondrial reads (rows named "*__chrM") included in count tables (these were removed prior to RaceID2); In some wells, we sorted no cell (barcodes #357-#360 and #381-#384 were used for empty wells); Sequencing lanes concatenated in fastq-files uploaded here; No merged file was generated for pilot experiment (as only one library), "Merged_MitoTracker_Repeat.csv"-file is count table were reads from all plates of repeat experiment were merged (Annotation of columns: Plx_Welly, where x = plate number (1-4) and x = cell barcode); See Supplementary Methods for details. SORT-Seq zebrafish fin samples: Robotized CEL-Seq2; Fin cells (unstained; all live cells); Wildtype zebrafish; Dissociated using default fin dissociation protocol (Supplementary Methods); 384 cells per plate with each of the 384 barcodes (see "Cel-seq_barcodes_384.csv") used 1 times per plate (therefore, each plate has 1 library); Note: only merged count table file ("fin_C_E_count_table.csv", Annotation columns: Xx.py.prim.finZ, where x = cell barcode, y = plate number and Z is fish (C or E)) and no individual library count table files were uploaded to GEO for zebrafish fin data; No bulk samples included; Spike-ins not included in merged count tables file; Mitochondrial reads not included in merged count tables file; In some wells, we sorted no cell (barcodes #357-#360 and #381-#384 were used for empty wells); Sequencing lanes concatenated in fastq-files uploaded here; See Supplementary Methods for details.