Simple descriptive missing data indicators in longitudinal studies with attrition, intermittent missing data and a high number of follow-ups.
ABSTRACT: Missing data in longitudinal studies may constitute a source of bias. We suggest three simple missing data indicators for the initial phase of getting an overview of the missingness pattern in a dataset with a high number of follow-ups. Possible use of the indicators is exemplified in two datasets allowing wave nonresponse; a Norwegian dataset of 420 subjects examined at 21 occasions during 6.5 years and a Dutch dataset of 350 subjects with ten repeated measurements over a period of 35 years.The indicators Last response (the timing of last response), Retention (the number of responded follow-ups), and Dispersion (the evenness of the distribution of responses) are introduced. The proposed indicators reveal different aspects of the missing data pattern, and may give the researcher a better insight into the pattern of missingness in a study with several follow-ups, as a starting point for analyzing possible bias. Although the indicators are positively correlated to each other, potential predictors of missingness can have a different relationship with different indicators leading to a better understanding of the missing data mechanism in longitudinal studies. These indictors may be useful descriptive tools when starting to look into a longitudinal dataset with many follow-ups.
Project description:Longitudinal studies are highly valuable in pediatrics because they provide useful data about developmental patterns of child health and behavior over time. When data are missing, the value of the research is impacted. The study's purpose was to (1) introduce a three-step approach to assess and address missing data and (2) illustrate this approach using categorical and continuous-level variables from a longitudinal study of premature infants.A three-step approach with simulations was followed to assess the amount and pattern of missing data and to determine the most appropriate imputation method for the missing data. Patterns of missingness were Missing Completely at Random, Missing at Random, and Not Missing at Random. Missing continuous-level data were imputed using mean replacement, stochastic regression, multiple imputation, and fully conditional specification (FCS). Missing categorical-level data were imputed using last value carried forward, hot-decking, stochastic regression, and FCS. Simulations were used to evaluate these imputation methods under different patterns of missingness at different levels of missing data.The rate of missingness was 16-23% for continuous variables and 1-28% for categorical variables. FCS imputation provided the least difference in mean and standard deviation estimates for continuous measures. FCS imputation was acceptable for categorical measures. Results obtained through simulation reinforced and confirmed these findings.Significant investments are made in the collection of longitudinal data. The prudent handling of missing data can protect these investments and potentially improve the scientific information contained in pediatric longitudinal studies.
Project description:Missing covariate data often arise in biomedical studies, and analysis of such data that ignores subjects with incomplete information may lead to inefficient and possibly biased estimates. A great deal of attention has been paid to handling a single missing covariate or a monotone pattern of missing data when the missingness mechanism is missing at random. In this article, we propose a semiparametric method for handling non-monotone patterns of missing data. The proposed method relies on the assumption that the missingness mechanism of a variable does not depend on the missing variable itself but may depend on the other missing variables. This mechanism is somewhat less general than the completely non-ignorable mechanism but is sometimes more flexible than the missing at random mechanism where the missingness mechansim is allowed to depend only on the completely observed variables. The proposed approach is robust to misspecification of the distribution of the missing covariates, and the proposed mechanism helps to nullify (or reduce) the problems due to non-identifiability that result from the non-ignorable missingness mechanism. The asymptotic properties of the proposed estimator are derived. Finite sample performance is assessed through simulation studies. Finally, for the purpose of illustration we analyze an endometrial cancer dataset and a hip fracture dataset.
Project description:BACKGROUND:Imperfect follow-up in longitudinal studies commonly leads to missing outcome data that can potentially bias the inference when the missingness is nonignorable; that is, the propensity of missingness depends on missing values in the data. In the Upstate KIDS Study, we seek to determine if the missingness of child development outcomes is nonignorable, and how a simple model assuming ignorable missingness would compare with more complicated models for a nonignorable mechanism. METHODS:To correct for nonignorable missingness, the shared random effects model (SREM) jointly models the outcome and the missing mechanism. However, the computational complexity and lack of software packages has limited its practical applications. This paper proposes a novel two-step approach to handle nonignorable missing outcomes in generalized linear mixed models. We first analyse the missing mechanism with a generalized linear mixed model and predict values of the random effects; then, the outcome model is fitted adjusting for the predicted random effects to account for heterogeneity in the missingness propensity. RESULTS:Extensive simulation studies suggest that the proposed method is a reliable approximation to SREM, with a much faster computation. The nonignorability of missing data in the Upstate KIDS Study is estimated to be mild to moderate, and the analyses using the two-step approach or SREM are similar to the model assuming ignorable missingness. CONCLUSIONS:The two-step approach is a computationally straightforward method that can be conducted as sensitivity analyses in longitudinal studies to examine violations to the ignorable missingness assumption and the implications relative to health outcomes.
Project description:<h4>Background</h4>Within routinely collected health data, missing data for an individual might provide useful information in itself. This occurs, for example, in the case of electronic health records, where the presence or absence of data is informative. While the naive use of missing indicators to try to exploit such information can introduce bias, its use in conjunction with multiple imputation may unlock the potential value of missingness to reduce bias in causal effect estimation, particularly in missing not at random scenarios and where missingness might be associated with unmeasured confounders.<h4>Methods</h4>We conducted a simulation study to determine when the use of a missing indicator, combined with multiple imputation, would reduce bias for causal effect estimation, under a range of scenarios including unmeasured variables, missing not at random, and missing at random mechanisms. We use directed acyclic graphs and structural models to elucidate a variety of causal structures of interest. We handled missing data using complete case analysis, and multiple imputation with and without missing indicator terms.<h4>Results</h4>We find that multiple imputation combined with a missing indicator gives minimal bias for causal effect estimation in most scenarios. In particular the approach: 1) does not introduce bias in missing (completely) at random scenarios; 2) reduces bias in missing not at random scenarios where the missing mechanism depends on the missing variable itself; and 3) may reduce or increase bias when unmeasured confounding is present.<h4>Conclusion</h4>In the presence of missing data, careful use of missing indicators, combined with multiple imputation, can improve causal effect estimation when missingness is informative, and is not detrimental when missingness is at random.
Project description:To investigate predictors of missing data in a longitudinal study of Alzheimer disease (AD).The Alzheimer's Disease Neuroimaging Initiative (ADNI) is a clinic-based, multicenter, longitudinal study with blood, CSF, PET, and MRI scans repeatedly measured in 229 participants with normal cognition (NC), 397 with mild cognitive impairment (MCI), and 193 with mild AD during 2005-2007. We used univariate and multivariable logistic regression models to examine the associations between baseline demographic/clinical features and loss of biomarker follow-ups in ADNI.CSF studies tended to recruit and retain patients with MCI with more AD-like features, including lower levels of baseline CSF A?(42). Depression was the major predictor for MCI dropouts, while family history of AD kept more patients with AD enrolled in PET and MRI studies. Poor cognitive performance was associated with loss of follow-up in most biomarker studies, even among NC participants. The presence of vascular risk factors seemed more critical than cognitive function for predicting dropouts in AD.The missing data are not missing completely at random in ADNI and likely conditional on certain features in addition to cognitive function. Missing data predictors vary across biomarkers and even MCI and AD groups do not share the same missing data pattern. Understanding the missing data structure may help in the design of future longitudinal studies and clinical trials in AD.
Project description:BACKGROUND:The data missing from patient profiles in intensive care units (ICUs) are substantial and unavoidable. However, this incompleteness is not always random or because of imperfections in the data collection process. OBJECTIVE:This study aimed to investigate the potential hidden information in data missing from electronic health records (EHRs) in an ICU and examine whether the presence or missingness of a variable itself can convey information about the patient health status. METHODS:Daily retrieval of laboratory test (LT) measurements from the Medical Information Mart for Intensive Care III database was set as our reference for defining complete patient profiles. Missingness indicators were introduced as a way of representing presence or absence of the LTs in a patient profile. Thereafter, various feature selection methods (filter and embedded feature selection methods) were used to examine the predictive power of missingness indicators. Finally, a set of well-known prediction models (logistic regression [LR], decision tree, and random forest) were used to evaluate whether the absence status itself of a variable recording can provide predictive power. We also examined the utility of missingness indicators in improving predictive performance when used with observed laboratory measurements as model input. The outcome of interest was in-hospital mortality and mortality at 30 days after ICU discharge. RESULTS:Regardless of mortality type or ICU day, more than 40% of the predictors selected by feature selection methods were missingness indicators. Notably, employing missingness indicators as the only predictors achieved reasonable mortality prediction on all days and for all mortality types (for instance, in 30-day mortality prediction with LR, we achieved area under the curve of the receiver operating characteristic [AUROC] of 0.6836±0.012). Including indicators with observed measurements in the prediction models also improved the AUROC; the maximum improvement was 0.0426. Indicators also improved the AUROC for Simplified Acute Physiology Score II model-a well-known ICU severity of illness score-confirming the additive information of the indicators (AUROC of 0.8045±0.0109 for 30-day mortality prediction for LR). CONCLUSIONS:Our study demonstrated that the presence or absence of LT measurements is informative and can be considered a potential predictor of in-hospital and 30-day mortality. The comparative analysis of prediction models also showed statistically significant prediction improvement when indicators were included. Moreover, missing data might reflect the opinions of examining clinicians. Therefore, the absence of measurements can be informative in ICUs and has predictive power beyond the measured data themselves. This initial case study shows promise for more in-depth analysis of missing data and its informativeness in ICUs. Future studies are needed to generalize these results.
Project description:<h4>Background</h4>The appropriate handling of missing covariate data in prognostic modelling studies is yet to be conclusively determined. A resampling study was performed to investigate the effects of different missing data methods on the performance of a prognostic model.<h4>Methods</h4>Observed data for 1000 cases were sampled with replacement from a large complete dataset of 7507 patients to obtain 500 replications. Five levels of missingness (ranging from 5% to 75%) were imposed on three covariates using a missing at random (MAR) mechanism. Five missing data methods were applied; a) complete case analysis (CC) b) single imputation using regression switching with predictive mean matching (SI), c) multiple imputation using regression switching imputation, d) multiple imputation using regression switching with predictive mean matching (MICE-PMM) and e) multiple imputation using flexible additive imputation models. A Cox proportional hazards model was fitted to each dataset and estimates for the regression coefficients and model performance measures obtained.<h4>Results</h4>CC produced biased regression coefficient estimates and inflated standard errors (SEs) with 25% or more missingness. The underestimated SE after SI resulted in poor coverage with 25% or more missingness. Of the MI approaches investigated, MI using MICE-PMM produced the least biased estimates and better model performance measures. However, this MI approach still produced biased regression coefficient estimates with 75% missingness.<h4>Conclusions</h4>Very few differences were seen between the results from all missing data approaches with 5% missingness. However, performing MI using MICE-PMM may be the preferred missing data approach for handling between 10% and 50% MAR missingness.
Project description:In longitudinal research, interest often centers on individual trajectories of change over time. When there is missing data, a concern is whether data are systematically missing as a function of the individual trajectories. Such a missing data process, termed random coefficient-dependent missingness, is statistically non-ignorable and can bias parameter estimates obtained from conventional growth models that assume missing data are missing at random. This paper describes a shared-parameter mixture model (SPMM) for testing the sensitivity of growth model parameter estimates to a random coefficient-dependent missingness mechanism. Simulations show that the SPMM recovers trajectory estimates as well as or better than a standard growth model across a range of missing data conditions. The paper concludes with practical advice for longitudinal data analysts.
Project description:Linear increments (LI) are used to analyse repeated outcome data with missing values. Previously, two LI methods have been proposed, one allowing non-monotone missingness but not independent measurement error and one allowing independent measurement error but only monotone missingness. In both, it was suggested that the expected increment could depend on current outcome. We show that LI can allow non-monotone missingness and either independent measurement error of unknown variance or dependence of expected increment on current outcome but not both. A popular alternative to LI is a multivariate normal model ignoring the missingness pattern. This gives consistent estimation when data are normally distributed and missing at random (MAR). We clarify the relation between MAR and the assumptions of LI and show that for continuous outcomes multivariate normal estimators are also consistent under (non-MAR and non-normal) assumptions not much stronger than those of LI. Moreover, when missingness is non-monotone, they are typically more efficient.
Project description:Missing data are frequently encountered in longitudinal clinical trials. To better monitor and understand the progress over time, one must handle the missing data appropriately and examine whether the missing data mechanism is ignorable or nonignorable. In this article, we develop a new probit model for longitudinal binary response data. It resolves a challenging issue for estimating the variance of the random effects, and substantially improves the convergence and mixing of the Gibbs sampling algorithm. We show that when improper uniform priors are specified for the regression coefficients of the joint multinomial model via a sequence of one-dimensional conditional distributions for the missing data indicators under nonignorable missingness, the joint posterior distribution is improper. A variation of Jeffreys prior is thus established as a remedy for the improper posterior distribution. In addition, an efficient Gibbs sampling algorithm is developed using a collapsing technique. Two model assessment criteria, the deviance information criterion (DIC) and the logarithm of the pseudomarginal likelihood (LPML), are used to guide the choices of prior specifications and to compare the models under different missing data mechanisms. We report on extensive simulations conducted to investigate the empirical performance of the proposed methods. The proposed methodology is further illustrated using data from an HIV prevention clinical trial.