Low-Dimensional Density Ratio Estimation for Covariate Shift Correction.
ABSTRACT: Covariate shift is a prevalent setting for supervised learning in the wild when the training and test data are drawn from different time periods, different but related domains, or via different sampling strategies. This paper addresses a transfer learning setting, with covariate shift between source and target domains. Most existing methods for correcting covariate shift exploit density ratios of the features to reweight the source-domain data, and when the features are high-dimensional, the estimated density ratios may suffer large estimation variances, leading to poor prediction performance. In this work, we investigate the dependence of covariate shift correction performance on the dimensionality of the features, and propose a correction method that finds a low-dimensional representation of the features, which takes into account feature relevant to the target Y, and exploits the density ratio of this representation for importance reweighting. We discuss the factors affecting the performance of our method and demonstrate its capabilities on both pseudo-real and real-world data.
Project description:Three (3) different methods (logistic regression, covariate shift and k-NN) were applied to five (5) internal datasets and one (1) external, publically available dataset where covariate shift existed. In all cases, k-NN's performance was inferior to either logistic regression or covariate shift. Surprisingly, there was no obvious advantage for using covariate shift to reweight the training data in the examined datasets.
Project description:In ROC analysis, covariate adjustment is advocated when the covariates impact the magnitude or accuracy of the test under study. Meanwhile, for many large scale screening tests, the true condition status may be subject to missingness because it is expensive and/or invasive to ascertain the disease status. The complete-case analysis may end up with a biased inference, also known as "verification bias." To address the issue of covariate adjustment with verification bias in ROC analysis, we propose several estimators for the area under the covariate-specific and covariate-adjusted ROC curves (AUCx and AAUC). The AUCx is directly modeled in the form of binary regression, and the estimating equations are based on the U statistics. The AAUC is estimated from the weighted average of AUCx over the covariate distribution of the diseased subjects. We employ reweighting and imputation techniques to overcome the verification bias problem. Our proposed estimators are initially derived assuming that the true disease status is missing at random (MAR), and then with some modification, the estimators can be extended to the not missing at random (NMAR) situation. The asymptotic distributions are derived for the proposed estimators. The finite sample performance is evaluated by a series of simulation studies. Our method is applied to a data set in Alzheimer's disease research.
Project description:Automatic emotion recognition is one of the most challenging tasks. To detect emotion from nonstationary EEG signals, a sophisticated learning algorithm that can represent high-level abstraction is required. This study proposes the utilization of a deep learning network (DLN) to discover unknown feature correlation between input signals that is crucial for the learning task. The DLN is implemented with a stacked autoencoder (SAE) using hierarchical feature learning approach. Input features of the network are power spectral densities of 32-channel EEG signals from 32 subjects. To alleviate overfitting problem, principal component analysis (PCA) is applied to extract the most important components of initial input features. Furthermore, covariate shift adaptation of the principal components is implemented to minimize the nonstationary effect of EEG signals. Experimental results show that the DLN is capable of classifying three different levels of valence and arousal with accuracy of 49.52% and 46.03%, respectively. Principal component based covariate shift adaptation enhances the respective classification accuracy by 5.55% and 6.53%. Moreover, DLN provides better performance compared to SVM and naive Bayes classifiers.
Project description:Using an asymmetrical set of vernier stimuli (-15?, -10?, -5?, +10?, +15?) together with reverse feedback on the small subthreshold offset stimulus (-5?) induces response bias in performance (Aberg & Herzog, 2012; Herzog, Eward, Hermens, & Fahle, 2006; Herzog & Fahle, 1999). These conditions are of interest for testing models of perceptual learning because the world does not always present balanced stimulus frequencies or accurate feedback. Here we provide a comprehensive model for the complex set of asymmetric training results using the augmented Hebbian reweighting model (Liu, Dosher, & Lu, 2014; Petrov, Dosher, & Lu, 2005, 2006) and the multilocation integrated reweighting theory (Dosher, Jeter, Liu, & Lu, 2013). The augmented Hebbian learning algorithm incorporates trial-by-trial feedback, when present, as another input to the decision unit and uses the observer's internal response to update the weights otherwise; block feedback alters the weights on bias correction (Liu et al., 2014). Asymmetric training with reversed feedback incorporates biases into the weights between representation and decision. The model correctly predicts the basic induction effect, its dependence on trial-by-trial feedback, and the specificity of bias to stimulus orientation and spatial location, extending the range of augmented Hebbian reweighting accounts of perceptual learning.
Project description:In studies that compare several diagnostic groups, subjects can be measured on certain features and classification trees can be used to identify which of them best characterize the differences among groups. However, subjects may also be measured on additional covariates whose ability to characterize group differences is not meaningful or of interest, but may still have an impact on the examined features. Therefore, it is important to adjust for the effects of covariates on these features. We present a new semi-parametric approach to adjust for covariate effects when constructing classification trees based on the features of interest that is readily implementable. An application is given for postmortem brain tissue data to compare the neurobiological characteristics of subjects with schizophrenia to those of normal controls. We also evaluate the performance of our approach using a simulation study.
Project description:Covariate balance is often advocated for objective causal inference since it mimics randomization in observational data. Unlike methods that balance specific moments of covariates, our proposal attains uniform approximate balance for covariate functions in a reproducing-kernel Hilbert space. The corresponding infinite-dimensional optimization problem is shown to have a finite-dimensional representation in terms of an eigenvalue optimization problem. Large-sample results are studied, and numerical examples show that the proposed method achieves better balance with smaller sampling variability than existing methods.
Project description:An important feature of Bayesian statistics is the opportunity to do sequential inference: the posterior distribution obtained after seeing a dataset can be used as prior for a second inference. However, when Monte Carlo sampling methods are used for inference, we only have a set of samples from the posterior distribution. To do sequential inference, we then either have to evaluate the second posterior at only these locations and reweight the samples accordingly, or we can estimate a functional description of the posterior probability distribution from the samples and use that as prior for the second inference. Here, we investigated to what extent we can obtain an accurate joint posterior from two datasets if the inference is done sequentially rather than jointly, under the condition that each inference step is done using Monte Carlo sampling. To test this, we evaluated the accuracy of kernel density estimates, Gaussian mixtures, mixtures of factor analyzers, vine copulas and Gaussian processes in approximating posterior distributions, and then tested whether these approximations can be used in sequential inference. In low dimensionality, Gaussian processes are more accurate, whereas in higher dimensionality Gaussian mixtures, mixtures of factor analyzers or vine copulas perform better. In our test cases of sequential inference, using posterior approximations gives more accurate results than direct sample reweighting, but joint inference is still preferable over sequential inference whenever possible. Since the performance is case-specific, we provide an R package mvdens with a unified interface for the density approximation methods.
Project description:In the development of risk prediction models, predictors are often measured with error. In this paper, we investigate the impact of covariate measurement error on risk prediction. We compare the prediction performance using a costly variable measured without error, along with error-free covariates, to that of a model based on an inexpensive surrogate along with the error-free covariates. We consider continuous error-prone covariates with homoscedastic and heteroscedastic errors, and also a discrete misclassified covariate. Prediction performance is evaluated by the area under the receiver operating characteristic curve (AUC), the Brier score (BS), and the ratio of the observed to the expected number of events (calibration). In an extensive numerical study, we show that (i) the prediction model with the error-prone covariate is very well calibrated, even when it is mis-specified; (ii) using the error-prone covariate instead of the true covariate can reduce the AUC and increase the BS dramatically; (iii) adding an auxiliary variable, which is correlated with the error-prone covariate but conditionally independent of the outcome given all covariates in the true model, can improve the AUC and BS substantially. We conclude that reducing measurement error in covariates will improve the ensuing risk prediction, unless the association between the error-free and error-prone covariates is very high. Finally, we demonstrate how a validation study can be used to assess the effect of mismeasured covariates on risk prediction. These concepts are illustrated in a breast cancer risk prediction model developed in the Nurses' Health Study.
Project description:Receiver operating characteristic (ROC) curves are widely used to measure the discriminating power of medical tests and other classification procedures. In many practical applications, the performance of these procedures can depend on covariates such as age, naturally leading to a collection of curves associated with different covariate levels. This paper develops a Bayesian heteroscedastic semiparametric regression model and applies it to the estimation of covariate-dependent ROC curves. More specifically, our approach uses Gaussian process priors to model the conditional mean and conditional variance of the biomarker of interest for each of the populations under study. The model is illustrated through an application to the evaluation of prostate-specific antigen for the diagnosis of prostate cancer, which contrasts the performance of our model against alternative models.