ABSTRACT: In studies that compare several diagnostic groups, subjects can be measured on certain features and classification trees can be used to identify which of them best characterize the differences among groups. However, subjects may also be measured on additional covariates whose ability to characterize group differences is not meaningful or of interest, but may still have an impact on the examined features. Therefore, it is important to adjust for the effects of covariates on these features. We present a new semi-parametric approach to adjust for covariate effects when constructing classification trees based on the features of interest that is readily implementable. An application is given for postmortem brain tissue data to compare the neurobiological characteristics of subjects with schizophrenia to those of normal controls. We also evaluate the performance of our approach using a simulation study.
Project description:BACKGROUND:A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis. RESULTS:We developed an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids 'leakage' during the cross-validation training procedure. We describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj . CONCLUSIONS:In this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field.
Project description:Receiver operating characteristic (ROC) curves are widely used to measure the discriminating power of medical tests and other classification procedures. In many practical applications, the performance of these procedures can depend on covariates such as age, naturally leading to a collection of curves associated with different covariate levels. This paper develops a Bayesian heteroscedastic semiparametric regression model and applies it to the estimation of covariate-dependent ROC curves. More specifically, our approach uses Gaussian process priors to model the conditional mean and conditional variance of the biomarker of interest for each of the populations under study. The model is illustrated through an application to the evaluation of prostate-specific antigen for the diagnosis of prostate cancer, which contrasts the performance of our model against alternative models.
Project description:In several common study designs, regression modeling is complicated by the presence of censored covariates. Examples of such covariates include maternal age of onset of dementia that may be right censored in an Alzheimer's amyloid imaging study of healthy subjects, metabolite measurements that are subject to limit of detection censoring in a case-control study of cardiovascular disease, and progressive biomarkers whose baseline values are of interest, but are measured post-baseline in longitudinal neuropsychological studies of Alzheimer's disease. We propose threshold regression approaches for linear regression models with a covariate that is subject to random censoring. Threshold regression methods allow for immediate testing of the significance of the effect of a censored covariate. In addition, they provide for unbiased estimation of the regression coefficient of the censored covariate. We derive the asymptotic properties of the resulting estimators under mild regularity conditions. Simulations demonstrate that the proposed estimators have good finite-sample performance, and often offer improved efficiency over existing methods. We also derive a principled method for selection of the threshold. We illustrate the approach in application to an Alzheimer's disease study that investigated brain amyloid levels in older individuals, as measured through positron emission tomography scans, as a function of maternal age of dementia onset, with adjustment for other covariates. We have developed an R package, censCov, for implementation of our method, available at CRAN.
Project description:The covariate-balancing propensity score (CBPS) extends logistic regression to simultaneously optimize covariate balance and treatment prediction. Although the CBPS has been shown to perform well in certain settings, its performance has not been evaluated in settings specific to pharmacoepidemiology and large database research. In this study, we use both simulations and empirical data to compare the performance of the CBPS with logistic regression and boosted classification and regression trees. We simulated various degrees of model misspecification to evaluate the robustness of each propensity score (PS) estimation method. We then applied these methods to compare the effect of initiating glucagonlike peptide-1 agonists versus sulfonylureas on cardiovascular events and all-cause mortality in the US Medicare population in 2007-2009. In simulations, the CBPS was generally more robust in terms of balancing covariates and reducing bias compared with misspecified logistic PS models and boosted classification and regression trees. All PS estimation methods performed similarly in the empirical example. For settings common to pharmacoepidemiology, logistic regression with balance checks to assess model specification is a valid method for PS estimation, but it can require refitting multiple models until covariate balance is achieved. The CBPS is a promising method to improve the robustness of PS models.
Project description:Many statistical tests rely on the assumption that the residuals of a model are normally distributed. Rank-based inverse normal transformation (INT) of the dependent variable is one of the most popular approaches to satisfy the normality assumption. When covariates are included in the analysis, a common approach is to first adjust for the covariates and then normalize the residuals. This study investigated the effect of regressing covariates against the dependent variable and then applying rank-based INT to the residuals. The correlation between the dependent variable and covariates at each stage of processing was assessed. An alternative approach was tested in which rank-based INT was applied to the dependent variable before regressing covariates. Analyses based on both simulated and real data examples demonstrated that applying rank-based INT to the dependent variable residuals after regressing out covariates re-introduces a linear correlation between the dependent variable and covariates, increasing type-I errors and reducing power. On the other hand, when rank-based INT was applied prior to controlling for covariate effects, residuals were normally distributed and linearly uncorrelated with covariates. This latter approach is therefore recommended in situations were normality of the dependent variable is required.
Project description:In their recent Health Services Research article titled "Squeezing the Balloon: Propensity Scores and Unmeasured Covariate Balance," Brooks and Ohsfeldt (2013) addressed an important topic on the balancing property of the propensity score (PS) with respect to unmeasured covariates. They concluded that PS methods that balance measured covariates between treated and untreated subjects exacerbate imbalance in unmeasured covariates that are unrelated to measured covariates. Furthermore, they emphasized that for PS algorithms, an imbalance on unmeasured covariates between treatment and untreated subjects is a necessary condition to achieve balance on measured covariates between the groups. We argue that these conclusions are the results of their assumptions on the mechanism of treatment allocation. In addition, we discuss the underlying assumptions of PS methods, their advantages compared with multivariate regression methods, as well as the interpretation of the effect estimates from PS methods.
Project description:It is desirable to adjust Spearman's rank correlation for covariates, yet existing approaches have limitations. For example, the traditionally defined partial Spearman's correlation does not have a sensible population parameter, and the conditional Spearman's correlation defined with copulas cannot be easily generalized to discrete variables. We define population parameters for both partial and conditional Spearman's correlation through concordance-discordance probabilities. The definitions are natural extensions of Spearman's rank correlation in the presence of covariates and are general for any orderable random variables. We show that they can be neatly expressed using probability-scale residuals (PSRs). This connection allows us to derive simple estimators. Our partial estimator for Spearman's correlation between X and Y adjusted for Z is the correlation of PSRs from models of X on Z and of Y on Z, which is analogous to the partial Pearson's correlation derived as the correlation of observed-minus-expected residuals. Our conditional estimator is the conditional correlation of PSRs. We describe estimation and inference, and highlight the use of semiparametric cumulative probability models, which allow preservation of the rank-based nature of Spearman's correlation. We conduct simulations to evaluate the performance of our estimators and compare them with other popular measures of association, demonstrating their robustness and efficiency. We illustrate our method in two applications, a biomarker study and a large survey.
Project description:We consider a logistic regression model for a binary response where part of its covariates are subject-specific random intercepts and slopes from a large number of longitudinal covariates. These random effect covariates must be estimated from the observed data, and therefore, the model essentially involves errors in covariates. Because of high dimension and high correlation of the random effects, we employ longitudinal principal component analysis to reduce the total number of random effects to some manageable number of random effects. To deal with errors in covariates, we extend the conditional-score equation approach to this moderate dimensional logistic regression model with random effect covariates. To reliably solve the conditional-score equations in moderate/high dimension, we apply a majorization on the first derivative of the conditional-score functions and a penalized estimation by the smoothly clipped absolute deviation. The method was evaluated through a set of simulation studies and applied to a data set with longitudinal cortical thickness of 68 regions of interest to identify biomarkers that are related to dementia transition.
Project description:We propose a subgroup identification approach for inferring optimal and interpretable personalized treatment rules with high-dimensional covariates. Our approach is based on a two-step greedy tree algorithm to pursue signals in a high-dimensional space. In the first step, we transform the treatment selection problem into a weighted classification problem that can utilize tree-based methods. In the second step, we adopt a newly proposed tree-based method, known as reinforcement learning trees, to detect features involved in the optimal treatment rules and to construct binary splitting rules. The method is further extended to right censored survival data by using the accelerated failure time model and introducing double weighting to the classification trees. The performance of the proposed method is demonstrated via simulation studies, as well as analyses of the Cancer Cell Line Encyclopedia (CCLE) data and the Tamoxifen breast cancer data.