A general instrumental variable framework for regression analysis with outcome missing not at random.
ABSTRACT: The instrumental variable (IV) design is a well-known approach for unbiased evaluation of causal effects in the presence of unobserved confounding. In this article, we study the IV approach to account for selection bias in regression analysis with outcome missing not at random. In such a setting, a valid IV is a variable which (i) predicts the nonresponse process, and (ii) is independent of the outcome in the underlying population. We show that under the additional assumption (iii) that the IV is independent of the magnitude of selection bias due to nonresponse, the population regression in view is nonparametrically identified. For point estimation under (i)-(iii), we propose a simple complete-case analysis which modifies the regression of primary interest by carefully incorporating the IV to account for selection bias. The approach is developed for the identity, log and logit link functions. For inferences about the marginal mean of a binary outcome assuming (i) and (ii) only, we describe novel and approximately sharp bounds which unlike Robins-Manski bounds, are smooth in model parameters, therefore allowing for a straightforward approach to account for uncertainty due to sampling variability. These bounds provide a more honest account of uncertainty and allows one to assess the extent to which a violation of the key identifying condition (iii) might affect inferences. For illustration, the methods are used to account for selection bias induced by HIV testing nonparticipation in the evaluation of HIV prevalence in the Zambian Demographic and Health Surveys.
Project description:Instrumental variables are routinely used to recover a consistent estimator of an exposure causal effect in the presence of unmeasured confounding. Instrumental variable approaches to account for nonignorable missing data also exist but are less familiar to epidemiologists. Like instrumental variables for exposure causal effects, instrumental variables for missing data rely on exclusion restriction and instrumental variable relevance assumptions. Yet these two conditions alone are insufficient for point identification. For estimation, researchers have invoked a third assumption, typically involving fairly restrictive parametric constraints. Inferences can be sensitive to these parametric assumptions, which are typically not empirically testable. The purpose of our article is to discuss another approach for leveraging a valid instrumental variable. Although the approach is insufficient for nonparametric identification, it can nonetheless provide informative inferences about the presence, direction, and magnitude of selection bias, without invoking a third untestable parametric assumption. An important contribution of this article is an Excel spreadsheet tool that can be used to obtain empirical evidence of selection bias and calculate bounds and corresponding Bayesian 95% credible intervals for a nonidentifiable population proportion. For illustrative purposes, we used the spreadsheet tool to analyze HIV prevalence data collected by the 2007 Zambia Demographic and Health Survey (DHS).
Project description:We evaluated alternative approaches to assessing and correcting for nonresponse bias in a longitudinal survey. We considered the changes in substance-use outcomes over a 3-year period among young adults aged 18-24 years (n = 5,199) in the United States, analyzing data from the National Epidemiologic Survey on Alcohol and Related Conditions. This survey collected a variety of substance-use information from a nationally representative sample of US adults in 2 waves: 2001-2002 and 2004-2005. We first considered nonresponse rates in the second wave as a function of key substance-use outcomes in wave 1. We then evaluated 5 alternative approaches designed to correct for nonresponse bias under different attrition mechanisms, including weighting adjustments, multiple imputation, selection models, and pattern-mixture models. Nonignorable attrition in a longitudinal survey can lead to bias in estimates of change in certain health behaviors over time, and only selected procedures enable analysts to assess the sensitivity of their inferences to different assumptions about the extent of nonignorability. We compared estimates based on these 5 approaches, and we suggest a road map for assessing the risk of nonresponse bias in longitudinal studies. We conclude with directions for future research in this area given the results of our evaluations.
Project description:We present a framework for generating multiple imputations for continuous data when the missing data mechanism is unknown. Imputations are generated from more than one imputation model in order to incorporate uncertainty regarding the missing data mechanism. Parameter estimates based on the different imputation models are combined using rules for nested multiple imputation. Through the use of simulation, we investigate the impact of missing data mechanism uncertainty on post-imputation inferences and show that incorporating this uncertainty can increase the coverage of parameter estimates. We apply our method to a longitudinal clinical trial of low-income women with depression where nonignorably missing data were a concern. We show that different assumptions regarding the missing data mechanism can have a substantial impact on inferences. Our method provides a simple approach for formalizing subjective notions regarding nonresponse so that they can be easily stated, communicated, and compared.
Project description:BACKGROUND:The selection of variable sites for inclusion in genomic analyses can influence results, especially when exemplar populations are used to determine polymorphic sites. We tested the impact of ascertainment bias on the inference of population genetic parameters using empirical and simulated data representing the three major continental groups of cattle: European, African, and Indian. We simulated data under three demographic models. Each simulated data set was subjected to three ascertainment schemes: (I) random selection; (II) geographically biased selection; and (III) selection biased toward loci polymorphic in multiple groups. Empirical data comprised samples of 25 individuals representing each continental group. These cattle were genotyped for 47,506 loci from the bovine 50 K SNP panel. We compared the inference of population histories for the empirical and simulated data sets across different ascertainment conditions using F ST and principal components analysis (PCA). RESULTS:Bias toward shared polymorphism across continental groups is apparent in the empirical SNP data. Bias toward uneven levels of within-group polymorphism decreases estimates of F ST between groups. Subpopulation-biased selection of SNPs changes the weighting of principal component axes and can affect inferences about proportions of admixture and population histories using PCA. PCA-based inferences of population relationships are largely congruent across types of ascertainment bias, even when ascertainment bias is strong. CONCLUSIONS:Analyses of ascertainment bias in genomic data have largely been conducted on human data. As genomic analyses are being applied to non-model organisms, and across taxa with deeper divergences, care must be taken to consider the potential for bias in ascertainment of variation to affect inferences. Estimates of F ST , time of separation, and population divergence as estimated by principal components analysis can be misleading if this bias is not taken into account.
Project description:Estimation of treatment effects in randomized studies is often hampered by possible selection bias induced by conditioning on or adjusting for a variable measured post-randomization. One approach to obviate such selection bias is to consider inference about treatment effects within principal strata, that is, principal effects. A challenge with this approach is that without strong assumptions principal effects are not identifiable from the observable data. In settings where such assumptions are dubious, identifiable large sample bounds may be the preferred target of inference. In practice these bounds may be wide and not particularly informative. In this work we consider whether bounds on principal effects can be improved by adjusting for a categorical baseline covariate. Adjusted bounds are considered which are shown to never be wider than the unadjusted bounds. Necessary and sufficient conditions are given for which the adjusted bounds will be sharper (i.e., narrower) than the unadjusted bounds. The methods are illustrated using data from a recent, large study of interventions to prevent mother-to-child transmission of HIV through breastfeeding. Using a baseline covariate indicating low birth weight, the estimated adjusted bounds for the principal effect of interest are 63% narrower than the estimated unadjusted bounds.
Project description:Randomization can be used as an instrumental variable (IV) to account for unmeasured confounding when seeking to assess the impact of noncompliance with treatment allocation in a randomized trial. We present and compare different methods to calculate the treatment effect on a binary outcome as a rate ratio in a randomized surgical trial.The effectiveness of peeling versus not peeling the internal limiting membrane of the retina as part of the surgery for a full thickness macular hole. We compared the IV-based estimates (nonparametric causal bound and two-stage residual inclusion approach [2SRI]) with standard treatment effect measures (intention to treat, per protocol and treatment received [TR]). Compliance was defined in two ways (initial and up to the time point of interest). Poisson regression was used for the model-based approaches with robust standard errors to calculate the risk ratio (RR) with 95% confidence intervals.Results were similar for 1-month macular hole status across methods. For 3- and 6-month macular hole status, nonparametric causal bounds provided a narrower range of uncertainty than other methods, though still had substantial imprecision. For 3-month macular hole status, the TR estimate was substantially different from the other point estimates.Nonparametric causal bound approaches are a useful addition to an IV estimation approach, which tend to have large levels of uncertainty. Methods which allow RRs to be calculated when addressing noncompliance in randomized trials exist and may be superior to standard estimates. Further research is needed to explore the properties of different IV methods in a broad range of randomized controlled trial scenarios.
Project description:We develop sample size formulas for studies aiming to test mean differences between a treatment and control group when all-or-none nonadherence (noncompliance) and selection bias are expected. Recent work by Fay, Halloran, and Follmann (2007, Biometrics 63, 465-474) addressed the increased variances within groups defined by treatment assignment when nonadherence occurs, compared to the scenario of full adherence, under the assumption of no selection bias. In this article, we extend the authors' approach to allow selection bias in the form of systematic differences in means and variances among latent adherence subgroups. We illustrate the approach by performing sample size calculations to plan clinical trials with and without pilot adherence data. Sample size formulas and tests for normally distributed outcomes are also developed in a Web Appendix that account for uncertainty of estimates from external or internal pilot data.
Project description:Phylogenetic tests of adaptive evolution, such as the widely used branch-site test (BST), assume that nucleotide substitutions occur singly and independently. Recent research has shown that errors at adjacent sites often occur during DNA replication, and the resulting multinucleotide mutations (MNMs) are overwhelmingly likely to be non-synonymous. To evaluate whether the BST misinterprets sequence patterns produced by MNMs as false support for positive selection, we analysed two genome-scale datasets-one from mammals and one from flies. We found that codons with multiple differences account for virtually all the support for lineage-specific positive selection in the BST. Simulations under conditions derived from these alignments but without positive selection show that realistic rates of MNMs cause a strong and systematic bias towards false inferences of selection. This bias is sufficient under empirically derived conditions to produce false positive inferences as often as the BST infers positive selection from the empirical data. Although some genes with BST-positive results may have evolved adaptively, the test cannot distinguish sequence patterns produced by authentic positive selection from those caused by neutral fixation of MNMs. Many published inferences of adaptive evolution using this technique may therefore be artefacts of model violation caused by unincorporated neutral mutational processes. We introduce a model that incorporates MNMs and may help to ameliorate this bias.
Project description:In longitudinal studies, nonresponse to follow-up surveys poses a major threat to validity, interpretability and generalisation of results. The problem of nonresponse is further complicated by the possibility that nonresponse may depend on the outcome of interest. We identified sociodemographic, general health and wellbeing characteristics associated with nonresponse to the follow-up questionnaire and assessed the extent and effect of nonresponse on statistical inference in a large-scale population cohort study.We obtained the data from the baseline and first wave of the follow-up survey of the 45 and Up Study. Of those who were invited to participate in the follow-up survey, 65.2% responded. Logistic regression model was used to identify baseline characteristics associated with follow-up response. A Bayesian selection model approach with sensitivity analysis was implemented to model nonignorable nonresponse.Characteristics associated with a higher likelihood of responding to the follow-up survey include female gender, age categories 55-74, high educational qualification, married/de facto, worked part or partially or fully retired and higher household income. Parameter estimates and conclusions are generally consistent across different assumptions on the missing data mechanism. However, we observed some sensitivity for variables that are strong predictors for both the outcome and nonresponse.Results indicated in the context of the binary outcome under study, nonresponse did not result in substantial bias and did not alter the interpretation of results in general. Conclusions were still largely robust under nonignorable missing data mechanism. Use of a Bayesian selection model is recommended as a useful strategy for assessing potential sensitivity of results to missing data.
Project description:The site frequency spectrum (SFS) is of primary interest in population genetic studies, because the SFS compresses variation data into a simple summary from which many population genetic inferences can proceed. However, inferring the SFS from sequencing data is challenging because genotype calls from sequencing data are often inaccurate due to high error rates and if not accounted for, this genotype uncertainty can lead to serious bias in downstream analysis based on the inferred SFS. Here, we compare two approaches to estimate the SFS from sequencing data: one approach infers individual genotypes from aligned sequencing reads and then estimates the SFS based on the inferred genotypes (call-based approach) and the other approach directly estimates the SFS from aligned sequencing reads by maximum likelihood (direct estimation approach). We find that the SFS estimated by the direct estimation approach is unbiased even at low coverage, whereas the SFS by the call-based approach becomes biased as coverage decreases. The direction of the bias in the call-based approach depends on the pipeline to infer genotypes. Estimating genotypes by pooling individuals in a sample (multisample calling) results in underestimation of the number of rare variants, whereas estimating genotypes in each individual and merging them later (single-sample calling) leads to overestimation of rare variants. We characterize the impact of these biases on downstream analyses, such as demographic parameter estimation and genome-wide selection scans. Our work highlights that depending on the pipeline used to infer the SFS, one can reach different conclusions in population genetic inference with the same data set. Thus, careful attention to the analysis pipeline and SFS estimation procedures is vital for population genetic inferences.