Project description:Single-cell data integration can provide a comprehensive molecular view of cells. However, how to integrate heterogeneous single-cell multi-omics as well as spatially resolved transcriptomic data remains a major challenge. Here we introduce uniPort, a unified single-cell data integration framework that combines a coupled variational autoencoder (coupled-VAE) and minibatch unbalanced optimal transport (Minibatch-UOT). It leverages both highly variable common and dataset-specific genes for integration to handle the heterogeneity across datasets, and it is scalable to large-scale datasets. uniPort jointly embeds heterogeneous single-cell multi-omics datasets into a shared latent space. It can further construct a reference atlas for gene imputation across datasets. Meanwhile, uniPort provides a flexible label transfer framework to deconvolute heterogeneous spatial transcriptomic data using an optimal transport plan, instead of embedding latent space. We demonstrate the capability of uniPort by applying it to integrate a variety of datasets, including single-cell transcriptomics, chromatin accessibility, and spatially resolved transcriptomic data.
Project description:Batch effects in single-cell RNA-seq data pose a significant challenge for comparative analyses across samples, individuals, and conditions. Although batch effect correction methods are routinely applied, data integration often leads to overcorrection and can result in the loss of biological variability. In this work we present STACAS, a batch correction method for scRNA-seq that leverages prior knowledge on cell types to preserve biological variability upon integration. Through an open-source benchmark, we show that semi-supervised STACAS outperforms state-of-the-art unsupervised methods, as well as supervised methods such as scANVI and scGen. STACAS scales well to large datasets and is robust to incomplete and imprecise input cell type labels, which are commonly encountered in real-life integration tasks. We argue that the incorporation of prior cell type information should be a common practice in single-cell data integration, and we provide a flexible framework for semi-supervised batch effect correction.
Project description:Data integration of single-cell RNA-seq (scRNA-seq) data describes the task of embedding datasets gathered from different sources or experiments into a common representation so that cells with similar types or states are embedded close to one another independently from their dataset of origin. Data integration is a crucial step in most scRNA-seq data analysis pipelines involving multiple batches. It improves data visualization, batch effect reduction, clustering, label transfer, and cell type inference. Many data integration tools have been proposed during the last decade, but a surge in the number of these methods has made it difficult to pick one for a given use case. Furthermore, these tools are provided as rigid pieces of software, making it hard to adapt them to various specific scenarios. In order to address both of these issues at once, we introduce the transmorph framework. It allows the user to engineer powerful data integration pipelines and is supported by a rich software ecosystem. We demonstrate transmorph usefulness by solving a variety of practical challenges on scRNA-seq datasets including joint datasets embedding, gene space integration, and transfer of cycle phase annotations. transmorph is provided as an open source python package.
Project description:Polyploidization drives regulatory and phenotypic innovation. How the merger of different genomes contributes to polyploid development is a fundamental issue in evolutionary developmental biology and breeding research. Clarifying this issue is challenging because of genome complexity and the difficulty in tracking stochastic subgenome divergence during development. Recent single-cell sequencing techniques enabled probing subgenome-divergent regulation in the context of cellular differentiation. However, analyzing single-cell data suffers from high error rates due to high dimensionality, noise, and sparsity, and the errors stack up in polyploid analysis due to the increased dimensionality of comparisons between subgenomes of each cell, hindering deeper mechanistic understandings. In this study, we develop a quantitative computational framework, called "pseudo-genome divergence quantification" (pgDQ), for quantifying and tracking subgenome divergence directly at the cellular level. Further comparing with cellular differentiation trajectories derived from single-cell RNA sequencing data allows for an examination of the relationship between subgenome divergence and the progression of development. pgDQ produces robust results and is insensitive to data dropout and noise, avoiding high error rates due to multiple comparisons of genes, cells, and subgenomes. A statistical diagnostic approach is proposed to identify genes that are central to subgenome divergence during development, which facilitates the integration of different data modalities, enabling the identification of factors and pathways that mediate subgenome-divergent activity during development. Case studies have demonstrated that applying pgDQ to single-cell and bulk tissue transcriptomic data promotes a systematic and deeper understanding of how dynamic subgenome divergence contributes to developmental trajectories in polyploid evolution.
Project description:Polyploidization drives regulatory and phenotypic innovation. How the merger of different genomes contributes to polyploid development is a fundamental issue in evolutionary developmental biology and breeding research. Clarifying this issue is challenging because of genome complexity and the difficulty in tracking stochastic subgenome divergence during development. Recent single-cell sequencing techniques enabled probing subgenome divergent regulation in the context of cellular differentiation. However, analyzing single-cell data suffers from high error rates due to high-dimensionality, noise, and sparsity, and the errors stack up in polyploid analysis due to the increased dimensionality of comparisons between subgenomes of each cell, hindering deeper mechanistic understandings. Here, we developed a quantitative computational framework, pseudo-genome divergence quantification (pgDQ), for quantifying and tracking subgenome divergence directly at the cellular level. Further comparing with cellular differentiation trajectories derived from scRNA-seq data allowed for an examination of the relationship between subgenome divergence and the progression of development. pgDQ produces robust results and is insensitive to data dropout and noise, avoiding high error rates due to multiple comparisons of genes, cells, and subgenomes. A statistical diagonostic approach is proposed to identify genes that are central to subgenome divergence during development, which facilitates the integration of different data modalities, enabling the identification of factors and pathways that mediate subgenome-divergent activity during development. Case studies demonstrated that applying pgDQ to single cell and bulk tissue transcriptome data promotes a systematic and deeper understanding of how dynamic subgenome divergence contributes to developmental trajectories in polyploid evolution.
Project description:Single-cell transcriptomics datasets from the same anatomical sites generated by different research labs are becoming increasingly common. However, fast and computationally inexpensive tools for standardization of cell-type annotation and data integration are still needed in order to increase research inclusivity. To standardize cell-type annotation and integrate single-cell transcriptomics datasets, we have built a fast model-free integration method, named MASI (Marker-Assisted Standardization and Integration). MASI first identifies putative cell-type markers from reference data through an ensemble approach. Then, it converts gene expression matrix to cell-type score matrix with the identified putative cell-type markers for the purpose of cell-type annotation and data integration. Because of integration through cell-type markers instead of model inference, MASI can annotate approximately one million cells on a personal laptop, which provides a cheap computational alternative for the single-cell community. We benchmark MASI with other well-established methods and demonstrate that MASI outperforms other methods based on speed. Its performance for both tasks of data integration and cell-type annotation are comparable or even superior to these existing methods. To harness knowledge from single-cell atlases, we demonstrate three case studies that cover integration across biological conditions, surveyed participants, and research groups, respectively.
Project description:MotivationSingle-cell RNA sequencing (scRNA-seq) technologies enable the study of transcriptional heterogeneity at the resolution of individual cells and have an increasing impact on biomedical research. However, it is known that these methods sometimes wrongly consider two or more cells as single cells, and that a number of so-called doublets is present in the output of such experiments. Treating doublets as single cells in downstream analyses can severely bias a study's conclusions, and therefore computational strategies for the identification of doublets are needed.ResultsWith scds, we propose two new approaches for in silico doublet identification: Co-expression based doublet scoring (cxds) and binary classification based doublet scoring (bcds). The co-expression based approach, cxds, utilizes binarized (absence/presence) gene expression data and, employing a binomial model for the co-expression of pairs of genes, yields interpretable doublet annotations. bcds, on the other hand, uses a binary classification approach to discriminate artificial doublets from original data. We apply our methods and existing computational doublet identification approaches to four datasets with experimental doublet annotations and find that our methods perform at least as well as the state of the art, at comparably little computational cost. We observe appreciable differences between methods and across datasets and that no approach dominates all others. In summary, scds presents a scalable, competitive approach that allows for doublet annotation of datasets with thousands of cells in a matter of seconds.Availability and implementationscds is implemented as a Bioconductor R package (doi: 10.18129/B9.bioc.scds).Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:As the number of single-cell transcriptomics datasets grows, the natural next step is to integrate the accumulating data to achieve a common ontology of cell types and states. However, it is not straightforward to compare gene expression levels across datasets and to automatically assign cell type labels in a new dataset based on existing annotations. In this manuscript, we demonstrate that our previously developed method, scVI, provides an effective and fully probabilistic approach for joint representation and analysis of scRNA-seq data, while accounting for uncertainty caused by biological and measurement noise. We also introduce single-cell ANnotation using Variational Inference (scANVI), a semi-supervised variant of scVI designed to leverage existing cell state annotations. We demonstrate that scVI and scANVI compare favorably to state-of-the-art methods for data integration and cell state annotation in terms of accuracy, scalability, and adaptability to challenging settings. In contrast to existing methods, scVI and scANVI integrate multiple datasets with a single generative model that can be directly used for downstream tasks, such as differential expression. Both methods are easily accessible through scvi-tools.
Project description:Integrating single cell omics and single cell imaging allows for a more effective characterisation of the underlying mechanisms that drive a phenotype at the tissue level, creating a comprehensive profile at the cellular level. Although the use of imaging data is well established in biomedical research, its primary application has been to observe phenotypes at the tissue or organ level, often using medical imaging techniques such as MRI, CT, and PET. These imaging technologies complement omics-based data in biomedical research because they are helpful for identifying associations between genotype and phenotype, along with functional changes occurring at the tissue level. Single cell imaging can act as an intermediary between these levels. Meanwhile new technologies continue to arrive that can be used to interrogate the genome of single cells and its related omics datasets. As these two areas, single cell imaging and single cell omics, each advance independently with the development of novel techniques, the opportunity to integrate these data types becomes more and more attractive. This review outlines some of the technologies and methods currently available for generating, processing, and analysing single-cell omics- and imaging data, and how they could be integrated to further our understanding of complex biological phenomena like ageing. We include an emphasis on machine learning algorithms because of their ability to identify complex patterns in large multidimensional data.