Unknown

Dataset Information

0

SAT: a Surrogate-Assisted Two-wave case boosting sampling method, with application to EHR-based association studies.


ABSTRACT:

Objectives

Electronic health records (EHRs) enable investigation of the association between phenotypes and risk factors. However, studies solely relying on potentially error-prone EHR-derived phenotypes (ie, surrogates) are subject to bias. Analyses of low prevalence phenotypes may also suffer from poor efficiency. Existing methods typically focus on one of these issues but seldom address both. This study aims to simultaneously address both issues by developing new sampling methods to select an optimal subsample to collect gold standard phenotypes for improving the accuracy of association estimation.

Materials and methods

We develop a surrogate-assisted two-wave (SAT) sampling method, where a surrogate-guided sampling (SGS) procedure and a modified optimal subsampling procedure motivated from A-optimality criterion (OSMAC) are employed sequentially, to select a subsample for outcome validation through manual chart review subject to budget constraints. A model is then fitted based on the subsample with the true phenotypes. Simulation studies and an application to an EHR dataset of breast cancer survivors are conducted to demonstrate the effectiveness of SAT.

Results

We found that the subsample selected with the proposed method contains informative observations that effectively reduce the mean squared error of the resultant estimator of the association.

Conclusions

The proposed approach can handle the problem brought by the rarity of cases and misclassification of the surrogate in phenotype-absent EHR-based association studies. With a well-behaved surrogate, SAT successfully boosts the case prevalence in the subsample and improves the efficiency of estimation.

SUBMITTER: Liu X 

PROVIDER: S-EPMC9714591 | biostudies-literature | 2022 Apr

REPOSITORIES: biostudies-literature

altmetric image

Publications

SAT: a Surrogate-Assisted Two-wave case boosting sampling method, with application to EHR-based association studies.

Liu Xiaokang X   Chubak Jessica J   Hubbard Rebecca A RA   Chen Yong Y  

Journal of the American Medical Informatics Association : JAMIA 20220401 5


<h4>Objectives</h4>Electronic health records (EHRs) enable investigation of the association between phenotypes and risk factors. However, studies solely relying on potentially error-prone EHR-derived phenotypes (ie, surrogates) are subject to bias. Analyses of low prevalence phenotypes may also suffer from poor efficiency. Existing methods typically focus on one of these issues but seldom address both. This study aims to simultaneously address both issues by developing new sampling methods to se  ...[more]

Similar Datasets

2016-08-08 | GSE76728 | GEO
2016-08-08 | E-GEOD-76728 | biostudies-arrayexpress
2016-08-08 | GSE76727 | GEO
2016-08-08 | GSE76726 | GEO
2016-08-08 | E-GEOD-76726 | biostudies-arrayexpress
2016-08-08 | E-GEOD-76727 | biostudies-arrayexpress
2016-05-22 | GSE76000 | GEO
2020-03-31 | GSE117968 | GEO
| S-EPMC8546093 | biostudies-literature
| S-EPMC2648900 | biostudies-literature