Kernel-based covariate functional balancing for observational studies.
ABSTRACT: Covariate balance is often advocated for objective causal inference since it mimics randomization in observational data. Unlike methods that balance specific moments of covariates, our proposal attains uniform approximate balance for covariate functions in a reproducing-kernel Hilbert space. The corresponding infinite-dimensional optimization problem is shown to have a finite-dimensional representation in terms of an eigenvalue optimization problem. Large-sample results are studied, and numerical examples show that the proposed method achieves better balance with smaller sampling variability than existing methods.
Project description:Many scientific studies collect data where the response and predictor variables are both functions of time, location, or some other covariate. Understanding the relationship between these functional variables is a common goal in these studies. Motivated from two real-life examples, we present in this paper a function-on-function regression model that can be used to analyze such kind of functional data. Our estimator of the 2D coefficient function is the optimizer of a form of penalized least squares where the penalty enforces a certain level of smoothness on the estimator. Our first result is the Representer Theorem which states that the exact optimizer of the penalized least squares actually resides in a data-adaptive finite dimensional subspace although the optimization problem is defined on a function space of infinite dimensions. This theorem then allows us an easy incorporation of the Gaussian quadrature into the optimization of the penalized least squares, which can be carried out through standard numerical procedures. We also show that our estimator achieves the minimax convergence rate in mean prediction under the framework of function-on-function regression. Extensive simulation studies demonstrate the numerical advantages of our method over the existing ones, where a sparse functional data extension is also introduced. The proposed method is then applied to our motivating examples of the benchmark Canadian weather data and a histone regulation study.
Project description:To assess the covariate balancing properties of propensity score-based algorithms in which covariates affecting treatment choice are both measured and unmeasured.A simulation model of treatment choice and outcome.Simulation.Eight simulation scenarios varied with the values placed on measured and unmeasured covariates and the strength of the relationships between the measured and unmeasured covariates. The balance of both measured and unmeasured covariates was compared across patients either grouped or reweighted by propensity scores methods.Propensity score algorithms require unmeasured covariate variation that is unrelated to measured covariates, and they exacerbate the imbalance in this variation between treated and untreated patients relative to the full unweighted sample.The balance of measured covariates between treated and untreated patients has opposite implications for unmeasured covariates in randomized and observational studies. Measured covariate balance between treated and untreated patients in randomized studies reinforces the notion that all covariates are balanced. In contrast, forced balance of measured covariates using propensity score methods in observational studies exacerbates the imbalance in the independent portion of the variation in the unmeasured covariates, which can be likened to squeezing a balloon. If the unmeasured covariates affecting treatment choice are confounders, propensity score methods can exacerbate the bias in treatment effect estimates.
Project description:We consider the problem of dichotomizing a continuous covariate when performing a regression analysis based on a generalized estimation approach. The problem involves estimation of the cutpoint for the covariate and testing the hypothesis that the binary covariate constructed from the continuous covariate has a significant impact on the outcome. Due to the multiple testing used to find the optimal cutpoint, we need to make an adjustment to the usual significance test to preserve the type-I error rates. We illustrate the techniques on one data set of patients given unrelated hematopoietic stem cell transplantation. Here the question is whether the CD34 cell dose given to patient affects the outcome of the transplant and what is the smallest cell dose which is needed for good outcomes.
Project description:We introduce a non-myopic, covariate-adjusted response adaptive (CARA) allocation design for multi-armed clinical trials. The allocation scheme is a computationally tractable procedure based on the Gittins index solution to the classic multi-armed bandit problem and extends the procedure recently proposed in Villar et al. (2015). Our proposed CARA randomization procedure is defined by reformulating the bandit problem with covariates into a classic bandit problem in which there are multiple combination arms, considering every arm per each covariate category as a distinct treatment arm. We then apply a heuristically modified Gittins index rule to solve the problem and define allocation probabilities from the resulting solution. We report the efficiency, balance, and ethical performance of our approach compared to existing CARA methods using a recently published clinical trial as motivation. The net savings in terms of expected number of treatment failures is considerably larger and probably enough to make this design attractive for certain studies where known covariates are expected to be important, stratification is not desired, treatment failures have a high ethical cost, and the disease under study is rare. In a two-armed context, this patient benefit advantage comes at the expense of increased variability in the allocation proportions and a reduction in statistical power. However, in a multi-armed context, simple modifications of the proposed CARA rule can be incorporated so that an ethical advantage can be offered without sacrificing power in comparison with balanced designs.
Project description:Covariate shift is a prevalent setting for supervised learning in the wild when the training and test data are drawn from different time periods, different but related domains, or via different sampling strategies. This paper addresses a transfer learning setting, with covariate shift between source and target domains. Most existing methods for correcting covariate shift exploit density ratios of the features to reweight the source-domain data, and when the features are high-dimensional, the estimated density ratios may suffer large estimation variances, leading to poor prediction performance. In this work, we investigate the dependence of covariate shift correction performance on the dimensionality of the features, and propose a correction method that finds a low-dimensional representation of the features, which takes into account feature relevant to the target Y, and exploits the density ratio of this representation for importance reweighting. We discuss the factors affecting the performance of our method and demonstrate its capabilities on both pseudo-real and real-world data.
Project description:This paper studies inference for the average treatment effect in randomized controlled trials with covariate-adaptive randomization. Here, by covariate-adaptive randomization, we mean randomization schemes that first stratify according to baseline covariates and then assign treatment status so as to achieve "balance" within each stratum. Our main requirement is that the randomization scheme assigns treatment status within each stratum so that the fraction of units being assigned to treatment within each stratum has a well behaved distribution centered around a proportion ? as the sample size tends to infinity. Such schemes include, for example, Efron's biased-coin design and stratified block randomization. When testing the null hypothesis that the average treatment effect equals a pre-specified value in such settings, we first show the usual two-sample t-test is conservative in the sense that it has limiting rejection probability under the null hypothesis no greater than and typically strictly less than the nominal level. We show, however, that a simple adjustment to the usual standard error of the two-sample t-test leads to a test that is exact in the sense that its limiting rejection probability under the null hypothesis equals the nominal level. Next, we consider the usual t-test (on the coefficient on treatment assignment) in a linear regression of outcomes on treatment assignment and indicators for each of the strata. We show that this test is exact for the important special case of randomization schemes with ?=12 , but is otherwise conservative. We again provide a simple adjustment to the standard errors that yields an exact test more generally. Finally, we study the behavior of a modified version of a permutation test, which we refer to as the covariate-adaptive permutation test, that only permutes treatment status for units within the same stratum. When applied to the usual two-sample t-statistic, we show that this test is exact for randomization schemes with ?=12 and that additionally achieve what we refer to as "strong balance." For randomization schemes with ??12 , this test may have limiting rejection probability under the null hypothesis strictly greater than the nominal level. When applied to a suitably adjusted version of the two-sample t-statistic, however, we show that this test is exact for all randomization schemes that achieve "strong balance," including those with ??12 . A simulation study confirms the practical relevance of our theoretical results. We conclude with recommendations for empirical practice and an empirical illustration·.
Project description:We consider quantile regression for partially linear models where an outcome of interest is related to covariates and a marker set (e.g., gene or pathway). The covariate effects are modeled parametrically and the marker set effect of multiple loci is modeled using kernel machine. We propose an efficient algorithm to solve the corresponding optimization problem for estimating the effects of covariates and also introduce a powerful test for detecting the overall effect of the marker set. Our test is motivated by traditional score test, and borrows the idea of permutation test. Our estimation and testing procedures are evaluated numerically and applied to assess genetic association of change in fasting homocysteine level using the Vitamin Intervention for Stroke Prevention Trial data.
Project description:We consider nonparametric regression of a scalar outcome on a covariate when the outcome is missing at random (MAR) given the covariate and other observed auxiliary variables. We propose a class of augmented inverse probability weighted (AIPW) kernel estimating equations for nonparametric regression under MAR. We show that AIPW kernel estimators are consistent when the probability that the outcome is observed, that is, the selection probability, is either known by design or estimated under a correctly specified model. In addition, we show that a specific AIPW kernel estimator in our class that employs the fitted values from a model for the conditional mean of the outcome given covariates and auxiliaries is double-robust, that is, it remains consistent if this model is correctly specified even if the selection probabilities are modeled or specified incorrectly. Furthermore, when both models happen to be right, this double-robust estimator attains the smallest possible asymptotic variance of all AIPW kernel estimators and maximally extracts the information in the auxiliary variables. We also describe a simple correction to the AIPW kernel estimating equations that while preserving double-robustness it ensures efficiency improvement over nonaugmented IPW estimation when the selection model is correctly specified regardless of the validity of the second model used in the augmentation term. We perform simulations to evaluate the finite sample performance of the proposed estimators, and apply the methods to the analysis of the AIDS Costs and Services Utilization Survey data. Technical proofs are available online.
Project description:Eigenvalue problems are ubiquitous in computer vision, covering a very broad spectrum of applications ranging from estimation problems in multi-view geometry to image segmentation. Few other linear algebra problems have a more mature set of numerical routines available and many computer vision libraries leverage such tools extensively. However, the ability to call the underlying solver only as a "black box" can often become restrictive. Many 'human in the loop' settings in vision frequently exploit supervision from an expert, to the extent that the user can be considered a subroutine in the overall system. In other cases, there is additional domain knowledge, side or even partial information that one may want to incorporate within the formulation. In general, regularizing a (generalized) eigenvalue problem with such side information remains difficult. Motivated by these needs, this paper presents an optimization scheme to solve generalized eigenvalue problems (GEP) involving a (nonsmooth) regularizer. We start from an alternative formulation of GEP where the feasibility set of the model involves the Stiefel manifold. The core of this paper presents an end to end stochastic optimization scheme for the resultant problem. We show how this general algorithm enables improved statistical analysis of brain imaging data where the regularizer is derived from other 'views' of the disease pathology, involving clinical measurements and other image-derived representations.
Project description:Under covariate adaptive randomization, the covariate is tied to both randomization and analysis. Misclassification of such covariate will impact the intended treatment assignment; further, it is unclear what the appropriate analysis strategy should be. We explore the impact of such misclassification on the trial's statistical operating characteristics. Simulation scenarios were created based on the misclassification rate and the covariate effect on the outcome. Models including unadjusted, adjusted for the misclassified, or adjusted for the corrected covariate were compared using logistic regression for a binary outcome and Poisson regression for a count outcome. For the binary outcome using logistic regression, type I error can be maintained in the adjusted model, but the test is conservative using an unadjusted model. Power decreased with both increasing covariate effect on the outcome as well as the misclassification rate. Treatment effect estimates were biased towards the null for both the misclassified and unadjusted models. For the count outcome using a Poisson model, covariate misclassification led to inflated type I error probabilities and reduced power in the misclassified and the unadjusted model. The impact of covariate misclassification under covariate-adaptive randomization differs depending on the underlying distribution of the outcome.