Unknown

Dataset Information

0

A pairwise strategy for imputing predictive features when combining multiple datasets.


ABSTRACT:

Motivation

In the training of predictive models using high-dimensional genomic data, multiple studies' worth of data are often combined to increase sample size and improve generalizability. A drawback of this approach is that there may be different sets of features measured in each study due to variations in expression measurement platform or technology. It is often common practice to work only with the intersection of features measured in common across all studies, which results in the blind discarding of potentially useful feature information that is measured in individual or subsets of studies.

Results

We characterize the loss in predictive performance incurred by using only the intersection of feature information available across all studies when training predictors using gene expression data from microarray and sequencing datasets. We study the properties of linear and polynomial regression for imputing discarded features and demonstrate improvements in the external performance of prediction functions through simulation and in gene expression data collected on breast cancer patients. To improve this process, we propose a pairwise strategy that applies any imputation algorithm to two studies at a time and averages imputed features across pairs. We demonstrate that the pairwise strategy is preferable to first merging all datasets together and imputing any resulting missing features. Finally, we provide insights on which subsets of intersected and study-specific features should be used so that missing-feature imputation best promotes cross-study replicability.

Availability and implementation

The code is available at https://github.com/YujieWuu/Pairwise_imputation.

Supplementary information

Supplementary information is available at Bioinformatics online.

SUBMITTER: Wu Y 

PROVIDER: S-EPMC9835467 | biostudies-literature | 2023 Jan

REPOSITORIES: biostudies-literature

altmetric image

Publications

A pairwise strategy for imputing predictive features when combining multiple datasets.

Wu Yujie Y   Ren Boyu B   Patil Prasad P  

Bioinformatics (Oxford, England) 20230101 1


<h4>Motivation</h4>In the training of predictive models using high-dimensional genomic data, multiple studies' worth of data are often combined to increase sample size and improve generalizability. A drawback of this approach is that there may be different sets of features measured in each study due to variations in expression measurement platform or technology. It is often common practice to work only with the intersection of features measured in common across all studies, which results in the  ...[more]

Similar Datasets

| S-EPMC4597059 | biostudies-literature
| S-EPMC10413136 | biostudies-literature
| S-EPMC4263197 | biostudies-literature
| S-EPMC6765422 | biostudies-literature
| S-EPMC9908066 | biostudies-literature
| S-EPMC6883070 | biostudies-literature
| S-EPMC6376058 | biostudies-literature
| S-EPMC3546004 | biostudies-literature
| S-EPMC8776850 | biostudies-literature
| S-EPMC3721170 | biostudies-literature