Improving upon the efficiency of complete case analysis when covariates are MNAR.
ABSTRACT: Missing values in covariates of regression models are a pervasive problem in empirical research. Popular approaches for analyzing partially observed datasets include complete case analysis (CCA), multiple imputation (MI), and inverse probability weighting (IPW). In the case of missing covariate values, these methods (as typically implemented) are valid under different missingness assumptions. In particular, CCA is valid under missing not at random (MNAR) mechanisms in which missingness in a covariate depends on the value of that covariate, but is conditionally independent of outcome. In this paper, we argue that in some settings such an assumption is more plausible than the missing at random assumption underpinning most implementations of MI and IPW. When the former assumption holds, although CCA gives consistent estimates, it does not make use of all observed information. We therefore propose an augmented CCA approach which makes the same conditional independence assumption for missingness as CCA, but which improves efficiency through specification of an additional model for the probability of missingness, given the fully observed variables. The new method is evaluated using simulations and illustrated through application to data on reported alcohol consumption and blood pressure from the US National Health and Nutrition Examination Survey, in which data are likely MNAR independent of outcome.
Project description:<h4>Background</h4>Multiple imputation by chained equations (MICE) requires specifying a suitable conditional imputation model for each incomplete variable and then iteratively imputes the missing values. In the presence of missing not at random (MNAR) outcomes, valid statistical inference often requires joint models for missing observations and their indicators of missingness. In this study, we derived an imputation model for missing binary data with MNAR mechanism from Heckman's model using a one-step maximum likelihood estimator. We applied this approach to improve a previously developed approach for MNAR continuous outcomes using Heckman's model and a two-step estimator. These models allow us to use a MICE process and can thus also handle missing at random (MAR) predictors in the same MICE process.<h4>Methods</h4>We simulated 1000 datasets of 500 cases. We generated the following missing data mechanisms on 30% of the outcomes: MAR mechanism, weak MNAR mechanism, and strong MNAR mechanism. We then resimulated the first three cases and added an additional 30% of MAR data on a predictor, resulting in 50% of complete cases. We evaluated and compared the performance of the developed approach to that of a complete case approach and classical Heckman's model estimates.<h4>Results</h4>With MNAR outcomes, only methods using Heckman's model were unbiased, and with a MAR predictor, the developed imputation approach outperformed all the other approaches.<h4>Conclusions</h4>In the presence of MAR predictors, we proposed a simple approach to address MNAR binary or continuous outcomes under a Heckman assumption in a MICE procedure.
Project description:In this article, we first review the literature on dealing with missing values on a covariate in randomized studies and summarize what has been done and what is lacking to date. We then investigate the situation with a continuous outcome and a missing binary covariate in more details through simulations, comparing the performance of multiple imputation (MI) with various simple alternative methods. This is finally extended to the case of time-to-event outcome. The simulations consider five different missingness scenarios: missing completely at random (MCAR), at random (MAR) with missingness depending only on the treatment, and missing not at random (MNAR) with missingness depending on the covariate itself (MNAR1), missingness depending on both the treatment and covariate (MNAR2), and missingness depending on the treatment, covariate and their interaction (MNAR3). Here, we distinguish two different cases: (1) when the covariate is measured before randomization (best practice), where only MCAR and MNAR1 are plausible, and (2) when it is measured after randomization but before treatment (which sometimes occurs in nonpharmaceutical research), where the other three missingness mechanisms can also occur. The proposed methods are compared based on the treatment effect estimate and its standard error. The simulation results suggest that the patterns of results are very similar for all missingness scenarios in case (1) and also in case (2) except for MNAR3. Furthermore, in each scenario for continuous outcome, there is at least one simple method that performs at least as well as MI, while for time-to-event outcome MI is best.
Project description:In population pharmacokinetic analyses, missing categorical data are often encountered. We evaluated several methods of performing covariate analyses with partially missing categorical covariate data. Missing data methods consisted of discarding data (DROP), additional effect parameter for the group with missing data (EXTRA), and mixture methods in which the mixing probability was fixed to the observed fraction of categories (MIX(obs)), based on the likelihood of the concentration data (MIX(conc)), or combined likelihood of observed covariate data and concentration data (MIX(joint)). Simulations were implemented to study bias and imprecision of the methods in datasets with equal-sized and unbalanced category ratios for a binary covariate as well as datasets with non-random missingness (MNAR). Additionally, the performance and feasibility of implementation was assessed in two real datasets. At either low (10%) or high (50%) levels of missingness, all methods performed similarly well. Performance was similar for situations with unbalanced datasets (3:1 covariate distribution) and balanced datasets. In the MNAR scenario, the MIX methods showed a higher bias in the estimation of CL and covariate effect than EXTRA. All methods could be applied to real datasets, except DROP. All methods perform similarly at the studied levels of missingness, but the DROP and EXTRA methods provided less bias than the mixture methods in the case of MNAR. However, EXTRA was associated with inflated type I error rates of covariate selection, while DROP handled data inefficiently.
Project description:In individually matched case-control studies, when some covariates are incomplete, an analysis based on the complete data may result in a large loss of information both in the missing and completely observed variables. This usually results in a bias and loss of efficiency. In this article, we propose a new method for handling the problem of missing covariate data based on a missing-data-induced intensity approach when the missingness mechanism does not depend on case-control status and show that this leads to a generalization of the missing indicator method. We derive the asymptotic properties of the estimates from the proposed method and, using an extensive simulation study, assess the finite sample performance in terms of bias, efficiency, and 95% confidence coverage under several missing data scenarios. We also make comparisons with complete-case analysis (CCA) and some missing data methods that have been proposed previously. Our results indicate that, under the assumption of predictable missingness, the suggested method provides valid estimation of parameters, is more efficient than CCA, and is competitive with other, more complex methods of analysis. A case-control study of multiple myeloma risk and a polymorphism in the receptor Inter-Leukin-6 (IL-6-?) is used to illustrate our findings.
Project description:BACKGROUND:Missing data are unavoidable in epidemiological research, potentially leading to bias and loss of precision. Multiple imputation (MI) is widely advocated as an improvement over complete case analysis (CCA). However, contrary to widespread belief, CCA is preferable to MI in some situations. METHODS:We provide guidance on choice of analysis when data are incomplete. Using causal diagrams to depict missingness mechanisms, we describe when CCA will not be biased by missing data and compare MI and CCA, with respect to bias and efficiency, in a range of missing data situations. We illustrate selection of an appropriate method in practice. RESULTS:For most regression models, CCA gives unbiased results when the chance of being a complete case does not depend on the outcome after taking the covariates into consideration, which includes situations where data are missing not at random. Consequently, there are situations in which CCA analyses are unbiased while MI analyses, assuming missing at random (MAR), are biased. By contrast MI, unlike CCA, is valid for all MAR situations and has the potential to use information contained in the incomplete cases and auxiliary variables to reduce bias and/or improve precision. For this reason, MI was preferred over CCA in our real data example. CONCLUSIONS:Choice of method for dealing with missing data is crucial for validity of conclusions, and should be based on careful consideration of the reasons for the missing data, missing data patterns and the availability of auxiliary information.
Project description:Cohort data are often incomplete because some subjects drop out of the study, and inverse probability weighting (IPW), multiple imputation (MI), and linear increments (LI) are methods that deal with such missing data. In cohort studies of ageing, missing data can arise from dropout or death. Methods that do not distinguish between these reasons for missingness typically provide inference about a hypothetical cohort where no one can die (immortal cohort). It has been suggested that inference about the cohort composed of those who are still alive at any time point (partly conditional inference) may be more meaningful. MI, LI, and IPW can all be adapted to provide partly conditional inference. In this article, we clarify and compare the assumptions required by these MI, LI, and IPW methods for partly conditional inference on continuous outcomes. We also propose augmented IPW estimators for making partly conditional inference. These are more efficient than IPW estimators and more robust to model misspecification. Our simulation studies show that the methods give approximately unbiased estimates of partly conditional estimands when their assumptions are met, but may be biased otherwise. We illustrate the application of the missing data methods using data from the 'Origins of Variance in the Old-old' Twin study.
Project description:With incomplete data, the "missing at random" (MAR) assumption is widely understood to enable unbiased estimation with appropriate methods. While the need to assess the plausibility of MAR and to perform sensitivity analyses considering "missing not at random" (MNAR) scenarios has been emphasized, the practical difficulty of these tasks is rarely acknowledged. With multivariable missingness, what MAR means is difficult to grasp, and in many MNAR scenarios unbiased estimation is possible using methods commonly associated with MAR. Directed acyclic graphs (DAGs) have been proposed as an alternative framework for specifying practically accessible assumptions beyond the MAR-MNAR dichotomy. However, there is currently no general algorithm for deciding how to handle the missing data given a specific DAG. Here we construct "canonical" DAGs capturing typical missingness mechanisms in epidemiologic studies with incomplete data on exposure, outcome, and confounding factors. For each DAG, we determine whether common target parameters are "recoverable," meaning that they can be expressed as functions of the available data distribution and thus estimated consistently, or whether sensitivity analyses are necessary. We investigate the performance of available-case and multiple-imputation procedures. Using data from waves 1-3 of the Longitudinal Study of Australian Children (2004-2008), we illustrate how our findings can guide the treatment of missing data in point-exposure studies.
Project description:Missing health-related quality of life (HRQOL) data in longitudinal studies can reduce precision and power and bias results. Using INTERMACS (Interagency Registry for Mechanically Assisted Circulatory Support), we sought to identify factors associated with missing HRQOL data, examine the impact of these factors on estimated HRQOL assuming missing at random missingness, and perform sensitivity analyses to examine missing not at random (MNAR) missingness because of illness severity.INTERMACS patients (n=3248) with a preimplantation profile of 1 (critical cardiogenic shock) or 2 (progressive decline) were assessed with the EQ-5D-3L visual analog scale and Kansas City Cardiomyopathy Questionnaire-12 summary scores pre-implantation and 3 months postoperatively. Mean and median observed and missing at random-imputed HRQOL scores were calculated, followed by sensitivity analyses. Independent factors associated with HRQOL scores and missing HRQOL assessments were determined using multivariable regression. Independent factors associated with preimplantation and 3-month HRQOL scores, and with the likelihood of missing HRQOL assessments, revealed few correlates of HRQOL and missing assessments (R2 range, 4.7%-11.9%). For patients with INTERMACS profiles 1 and 2 and INTERMACS profile 1 alone, missing at random-imputed mean and median HRQOL scores were similar to observed scores, before and 3 months after implantation, whereas MNAR-imputed mean scores were lower (?5 points) at baseline but not at 3 months.We recommend use of sensitivity analyses using an MNAR imputation strategy for longitudinal studies when missingness is attributable to illness severity. Conduct of MNAR sensitivity analyses may be less critical after mechanical circulatory support implant, when there are likely fewer MNAR data.
Project description:Semi-parametric methods are often used for the estimation of intervention effects on correlated outcomes in cluster-randomized trials (CRTs). When outcomes are missing at random (MAR), Inverse Probability Weighted (IPW) methods incorporating baseline covariates can be used to deal with informative missingness. Also, augmented generalized estimating equations (AUG) correct for imbalance in baseline covariates but need to be extended for MAR outcomes. However, in the presence of interactions between treatment and baseline covariates, neither method alone produces consistent estimates for the marginal treatment effect if the model for interaction is not correctly specified. We propose an AUG-IPW estimator that weights by the inverse of the probability of being a complete case and allows different outcome models in each intervention arm. This estimator is doubly robust (DR); it gives correct estimates whether the missing data process or the outcome model is correctly specified. We consider the problem of covariate interference which arises when the outcome of an individual may depend on covariates of other individuals. When interfering covariates are not modeled, the DR property prevents bias as long as covariate interference is not present simultaneously for the outcome and the missingness. An R package is developed implementing the proposed method. An extensive simulation study and an application to a CRT of HIV risk reduction-intervention in South Africa illustrate the method.
Project description:Most epidemiological studies have missing information, leading to reduced power and potential bias. Estimates of exposure-outcome associations will generally be biased if the outcome variable is missing not at random (MNAR). Linkage to administrative data containing a proxy for the missing study outcome allows assessment of whether this outcome is MNAR and the evaluation of bias. We examined this in relation to the association between infant breastfeeding and IQ at 15 years, where a proxy for IQ was available through linkage to school attainment data.Subjects were those who enrolled in the Avon Longitudinal Study of Parents and Children in 1990-91 (n = 13 795), of whom 5023 had IQ measured at age 15. For those with missing IQ, 7030 (79%) had information on educational attainment at age 16 obtained through linkage to the National Pupil Database. The association between duration of breastfeeding and IQ was estimated using a complete case analysis, multiple imputation and inverse probability-of-missingness weighting; these estimates were then compared with those derived from analyses informed by the linkage.IQ at 15 was MNAR-individuals with higher attainment were less likely to have missing IQ data, even after adjusting for socio-demographic factors. All the approaches underestimated the association between breastfeeding and IQ compared with analyses informed by linkage.Linkage to administrative data containing a proxy for the outcome variable allows the MNAR assumption to be tested and more efficient analyses to be performed. Under certain circumstances, this may produce unbiased results.