Dataset Information

Feature selection followed by a residuals-based normalization simplifies and improves single-cell gene expression analysis.

ABSTRACT: Normalization is a critical step in the computational analysis of single-cell RNA-sequencing (scRNA-seq) counts data. The objective is to reduce systematic biases introduced by technical sources that can obscure underlying biological differences. This is typically accomplished by re-scaling the observed counts to reduce the differences in total counts between the cells and then transforming the scaled counts to stabilize the variances. In the standard scRNA-seq workflow, this is followed by feature selection to identify genes that capture most of the biologically meaningful variation across the cells. Here, we propose a simple feature selection method and show that we can perform feature selection before normalization. We also propose a novel residuals-based normalization method that includes a monotonic non-linear transformation to ensure effective variance stabilization of the residuals. We demonstrate significant improvements in downstream clustering analyses through the application of our feature selection and normalization methods to truth-known biological as well as simulated counts data sets. Based on these results, we make the case for a revised scRNA-seq analysis workflow wherein feature selection precedes and in fact informs our residuals-based normalization. This novel workflow has been implemented in an R package called Piccolo.

SUBMITTER: Singh A

PROVIDER: S-EPMC10849523 | biostudies-literature | 2024 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Feature selection followed by a novel residuals-based normalization simplifies and improves single-cell gene expression analysis.

Singh Amartya A Khiabanian Hossein H

bioRxiv : the preprint server for biology 20240509

Normalization is a crucial step in the analysis of single-cell RNA-sequencing (scRNA-seq) counts data. Its principal objectives are to reduce the systematic biases primarily introduced through technical sources and to transform the data to make it more amenable for application of established statistical frameworks. In the standard workflows, normalization is followed by feature selection to identify highly variable genes (HVGs) that capture most of the biologically meaningful variation across th ...[more]

PMID: 38328133

Similar Datasets

Project description:Salsola ferganica is a natural desert herbaceous plant in the arid area of western and northwestern China. Because of its salt tolerance and drought resistance, it is of great significance in desert afforestation and sand-fixing capacity. There has been much research on the genes involved in plants under desert stresses in recent years. The application of the best internal reference genes for standardization was a critical procedure in analyzing the gene expression under different types. Even so, the reference gene has not been reported in the application of gene expression normalization of S. ferganica. In this study, nine reference genes (TUA-1726, TUA-1760, TUB, GAPDH, ACT, 50S, HSC70, APT, and U-box) in S. ferganica were adopted and analyzed under six different treatments (ABA, heat, cold, NaCl, methyl viologen (MV), and PEG). The applicability of candidate genes was evaluated by statistical software, including geNorm, NormFinder, BestKeeper, and RefFinder, based on their stability values in all the treatments. These results indicated that the simultaneous selection of two stable reference genes would fully standardize the optimization of the normalization research. To verify the feasibility of the above internal reference genes, the CT values of AP2/ERF transcription factor family genes were standardized using the most (ACT) and least (GAPDH) stable reference genes in S. ferganica seedlings under six abiotic stresses. The research showed that HSC70 and U-box were the most appropriate reference genes in ABA stressed samples, and ACT and U-box genes were the optimal references for heat-stressed samples. TUA-1726 and U-box showed the smallest value in gene expression levels of cold treatment. The internal reference groups of the best applicability for the other samples were U-box and ACT under NaCl treatment, ACT and TUA-1726 under MV stress, HSC70 and TUB under PEG treatment, and ACT in all samples. ACT and U-box showed higher stability than the other genes based on the comprehensive stability ranking of RefFinder, as determined by the geometric mean in this study. These results will contribute to later gene expression studies in other closely related species and provide an important foundation for gene expression analysis in S. ferganica.

Dataset Information

Feature selection followed by a residuals-based normalization simplifies and improves single-cell gene expression analysis.

Publications

Feature selection followed by a novel residuals-based normalization simplifies and improves single-cell gene expression analysis.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets