Transcriptomics

Dataset Information

0

Batch-normalization of cerebellar and medulloblastoma gene expression datasets utilizing empirically defined negative control genes


ABSTRACT: Medulloblastoma (MB) is a brain cancer predominantly arising in children. Roughly 70% of patients are cured today, but survivors often suffer from severe sequelae. MB has been extensively studied by molecular profiling, but often in small and scattered cohorts. To improve cure rates and reduce treatment side effects, accurate integration of such data to increase analytical power will be important, if not essential. We have integrated 23 transcription datasets, spanning 1350 MB and 291 normal brain samples. To remove batch effects, we combined the Removal of Unwanted Variation (RUV) method with a novel pipeline for determining empirical negative control genes and a panel of metrics to evaluate normalization performance. The documented approach enabled the removal of a majority of batch effects, producing a large-scale, integrative dataset of MB and cerebellar expression data. The proposed strategy will be broadly applicable for accurate integration of data and incorporation of normal reference samples for studies of various diseases. We hope that the integrated dataset will improve current research in the field of MB by allowing more large-scale gene expression analyses. For all selected samples, raw CEL files were downloaded from GEO or AE. Subsequently, all raw CEL files from the same platform were processed together using the R/Bioconductor package oligo in conjunction with the RMA algorithm. The Human Gene 1.0 ST and Human Gene 1.1 ST arrays were analysed at the core level, while the Human Exon 1.0 ST arrays were processed at the extended level. Subsequently, we mapped the identifiers of the HG-U133 Plus 2 and Human Exon 1.0 ST to Human Gene 1.0/1.1 ST identifiers using `Best Match' information from Affymetrix (https://www.affymetrix.com/support/technical/byproduct.affx?product=hugene-1_0-st-v1). In addition, to increase the overlap between the Human Exon 1.0 ST and Human Gene 1.0/1.1 ST data we also inspected and added probe mappings from the `Good Match' and `Complex Match' files, including probes for the genes MYCN, PTCH1, NPR3, UNC5D, DKK2, and GABRA5. After mapping of probe identifiers within each platform, multiple rows mapping to the same identifier were collapsed using the mean value. Subsequently, all platform datasets were merged on probe identifiers, and gene symbols were assigned using the hugene11sttranscriptcluster.db package. Multiple rows mapping to the same gene or multiple columns mapping to the same patient were collapsed using the mean value. Finally, the resulting gene expression matrix was quantile normalized using the respective function in the preprocessCore package.

ORGANISM(S): Homo sapiens

PROVIDER: GSE124814 | GEO | 2019/02/06

REPOSITORIES: GEO

Similar Datasets

2017-11-22 | GSE71648 | GEO
2015-11-01 | GSE52244 | GEO
2014-05-21 | GSE44185 | GEO
2008-06-20 | GSE11832 | GEO
2008-06-20 | GSE11833 | GEO
2016-06-15 | E-GEOD-83353 | biostudies-arrayexpress
2012-07-27 | GSE37382 | GEO
2015-12-31 | GSE62803 | GEO
2015-12-31 | GSE50765 | GEO
2017-06-13 | GSE85217 | GEO