Dataset Information

Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets.

ABSTRACT:

Background

Missing data is frequently an inevitable issue in cohort studies and it can adversely affect the study's findings. We assess the effectiveness of eight frequently utilized statistical and machine learning (ML) imputation methods for dealing with missing data in predictive modelling of cohort study datasets. This evaluation is based on real data and predictive models for cardiovascular disease (CVD) risk.

Methods

The data is from a real-world cohort study in Xinjiang, China. It includes personal information, physical examination data, questionnaires, and laboratory biochemical results from 10,164 subjects with a total of 37 variables. Simple imputation (Simple), regression imputation (Regression), expectation-maximization(EM), multiple imputation (MICE) , K nearest neighbor classification (KNN), clustering imputation (Cluster), random forest (RF), and decision tree (Cart) were the chosen imputation methods. Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) are utilised to assess the performance of different methods for missing data imputation at a missing rate of 20%. The datasets processed with different missing data imputation methods were employed to construct a CVD risk prediction model utilizing the support vector machine (SVM). The predictive performance was then compared using the area under the curve (AUC).

Results

The most effective imputation results were attained by KNN (MAE: 0.2032, RMSE: 0.7438, AUC: 0.730, CI: 0.719-0.741) and RF (MAE: 0.3944, RMSE: 1.4866, AUC: 0.777, CI: 0.769-0.785). The subsequent best performances were achieved by EM, Cart, and MICE, while Simple, Regression, and Cluster attained the worst performances. The CVD risk prediction model was constructed using the complete data (AUC:0.804, CI:0.796-0.812) in comparison with all other models with p<0.05.

Conclusion

KNN and RF exhibit superior performance and are more adept at imputing missing data in predictive modelling of cohort study datasets.

SUBMITTER: Li J

PROVIDER: S-EPMC10870437 | biostudies-literature | 2024 Feb

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets.

Li JiaHang J Guo ShuXia S Ma RuLin R He Jia J Zhang XiangHui X Rui DongSheng D Ding YuSong Y Li Yu Y Jian LeYao L Cheng Jing J Guo Heng H

BMC medical research methodology 20240216 1

<h4>Background</h4>Missing data is frequently an inevitable issue in cohort studies and it can adversely affect the study's findings. We assess the effectiveness of eight frequently utilized statistical and machine learning (ML) imputation methods for dealing with missing data in predictive modelling of cohort study datasets. This evaluation is based on real data and predictive models for cardiovascular disease (CVD) risk.<h4>Methods</h4>The data is from a real-world cohort study in Xinjiang, Ch ...[more]

PMID: 38365610

Similar Datasets

Project description:In various missing data problems, values are not entirely missing, but are coarsened. For coarsened observations, instead of observing the true value, a subset of values - strictly smaller than the full sample space of the variable - is observed to which the true value belongs. In our motivating example for patients with endometrial carcinoma, the degree of lymphovascular space invasion (LVSI) can be either absent, focally present, or substantially present. For a subset of individuals, however, LVSI is reported as being present, which includes both non-absent options. In the analysis of such a dataset, difficulties arise when coarsened observations are to be used in an imputation procedure. To our knowledge, no clear-cut method has been described in the literature on how to handle an observed subset of values, and treating them as entirely missing could lead to biased estimates. Therefore, in this paper, we evaluated the best strategy to deal with coarsened and missing data in multiple imputation. We tested a number of plausible ad hoc approaches, possibly already in use by statisticians. Additionally, we propose a principled approach to this problem, consisting of an adaptation of the SMC-FCS algorithm (SMC-FCS CoCo$$ {}_{\mathrm{CoCo}} $$ : Coarsening compatible), that ensures that imputed values adhere to the coarsening information. These methods were compared in a simulation study. This comparison shows that methods that prevent imputations of incompatible values, like the SMC-FCS CoCo$$ {}_{\mathrm{CoCo}} $$ method, perform consistently better in terms of a lower bias and RMSE, and achieve better coverage than methods that ignore coarsening or handle it in a more naïve way. The analysis of the motivating example shows that the way the coarsening information is handled can matter substantially, leading to different conclusions across methods. Overall, our proposed SMC-FCS CoCo$$ {}_{\mathrm{CoCo}} $$ method outperforms other methods in handling coarsened data, requires limited additional computation cost and is easily extendable to other scenarios.

Project description:BackgroundMissing participant outcome data (MOD) are ubiquitous in systematic reviews with network meta-analysis (NMA) as they invade from the inclusion of clinical trials with reported participant losses. There are available strategies to address aggregate MOD, and in particular binary MOD, while considering the missing at random (MAR) assumption as a starting point. Little is known about their performance though regarding the meta-analytic parameters of a random-effects model for aggregate binary outcome data as obtained from trial-reports (i.e. the number of events and number of MOD out of the total randomised per arm).MethodsWe used four strategies to handle binary MOD under MAR and we classified these strategies to those modelling versus excluding/imputing MOD and to those accounting for versus ignoring uncertainty about MAR. We investigated the performance of these strategies in terms of core NMA estimates by performing both an empirical and simulation study using random-effects NMA based on electrical network theory. We used Bland-Altman plots to illustrate the agreement between the compared strategies, and we considered the mean bias, coverage probability and width of the confidence interval to be the frequentist measures of performance.ResultsModelling MOD under MAR agreed with exclusion and imputation under MAR in terms of estimated log odds ratios and inconsistency factor, whereas accountability or not of the uncertainty regarding MOD affected intervention hierarchy and precision around the NMA estimates: strategies that ignore uncertainty about MOD led to more precise NMA estimates, and increased between-trial variance. All strategies showed good performance for low MOD (<5%), consistent evidence and low between-trial variance, whereas performance was compromised for large informative MOD (> 20%), inconsistent evidence and substantial between-trial variance, especially for strategies that ignore uncertainty due to MOD.ConclusionsThe analysts should avoid applying strategies that manipulate MOD before analysis (i.e. exclusion and imputation) as they implicate the inferences negatively. Modelling MOD, on the other hand, via a pattern-mixture model to propagate the uncertainty about MAR assumption constitutes both conceptually and statistically proper strategy to address MOD in a systematic review.

Project description:BackgroundMissing observations within the univariate time series are common in real-life and cause analytical problems in the flow of the analysis. Imputation of missing values is an inevitable step in every incomplete univariate time series. Most of the existing studies focus on comparing the distributions of imputed data. There is a gap of knowledge on how different imputation methods for univariate time series affect the forecasting performance of time series models. We evaluated the prediction performance of autoregressive integrated moving average (ARIMA) and long short-term memory (LSTM) network models on imputed time series data using ten different imputation techniques.MethodsMissing values were generated under missing completely at random (MCAR) mechanism at 10%, 15%, 25%, and 35% rates of missingness using complete data of 24-h ambulatory diastolic blood pressure readings. The performance of the mean, Kalman filtering, linear, spline, and Stineman interpolations, exponentially weighted moving average (EWMA), simple moving average (SMA), k-nearest neighborhood (KNN), and last-observation-carried-forward (LOCF) imputation techniques on the time series structure and the prediction performance of the LSTM and ARIMA models were compared on imputed and original data.ResultsAll imputation techniques either increased or decreased the data autocorrelation and with this affected the forecasting performance of the ARIMA and LSTM algorithms. The best imputation technique did not guarantee better predictions obtained on the imputed data. The mean imputation, LOCF, KNN, Stineman, and cubic spline interpolations methods performed better for a small rate of missingness. Interpolation with EWMA and Kalman filtering yielded consistent performances across all scenarios of missingness. Disregarding the imputation methods, the LSTM resulted with a slightly better predictive accuracy among the best performing ARIMA and LSTM models; otherwise, the results varied. In our small sample, ARIMA tended to perform better on data with higher autocorrelation.ConclusionsWe recommend to the researchers that they consider Kalman smoothing techniques, interpolation techniques (linear, spline, and Stineman), moving average techniques (SMA and EWMA) for imputing univariate time series data as they perform well on both data distribution and forecasting with ARIMA and LSTM models. The LSTM slightly outperforms ARIMA models, however, for small samples, ARIMA is simpler and faster to execute.

Dataset Information

Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets.

Background

Methods

Results

Conclusion

Publications

Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets