Validity of using ad hoc methods to analyze secondary traits in case-control association studies.
ABSTRACT: Case-control association studies often collect from their subjects information on secondary phenotypes. Reusing the data and studying the association between genes and secondary phenotypes provide an attractive and cost-effective approach that can lead to discovery of new genetic associations. A number of approaches have been proposed, including simple and computationally efficient ad hoc methods that ignore ascertainment or stratify on case-control status. Justification for these approaches relies on the assumption of no covariates and the correct specification of the primary disease model as a logistic model. Both might not be true in practice, for example, in the presence of population stratification or the primary disease model following a probit model. In this paper, we investigate the validity of ad hoc methods in the presence of covariates and possible disease model misspecification. We show that in taking an ad hoc approach, it may be desirable to include covariates that affect the primary disease in the secondary phenotype model, even though these covariates are not necessarily associated with the secondary phenotype. We also show that when the disease is rare, ad hoc methods can lead to severely biased estimation and inference if the true disease model follows a probit model instead of a logistic model. Our results are justified theoretically and via simulations. Applied to real data analysis of genetic associations with cigarette smoking, ad hoc methods collectively identified as highly significant (P<10-5) single nucleotide polymorphisms from over 10 genes, genes that were identified in previous studies of smoking cessation.
Project description:We are interested in developing integrative approaches for variable selection problems that incorporate external knowledge on a set of predictors of interest. In particular, we have developed an integrative Bayesian model uncertainty (iBMU) method, which formally incorporates multiple sources of data via a second-stage probit model on the probability that any predictor is associated with the outcome of interest. Using simulations, we demonstrate that iBMU leads to an increase in power to detect true marginal associations over more commonly used variable selection techniques, such as least absolute shrinkage and selection operator and elastic net. In addition, iBMU leads to a more efficient model search algorithm over the basic BMU method even when the predictor-level covariates are only modestly informative. The increase in power and efficiency of our method becomes more substantial as the predictor-level covariates become more informative. Finally, we demonstrate the power and flexibility of iBMU for integrating both gene structure and functional biomarker information into a candidate gene study investigating over 50 genes in the brain reward system and their role with smoking cessation from the Pharmacogenetics of Nicotine Addiction and Treatment Consortium.
Project description:INTRODUCTION: The identification of early, preferably presymptomatic, biomarkers and true etiologic factors for Alzheimer's disease (AD) is the first step toward establishing effective primary and secondary prevention programs. Consequently, the search for a relatively inexpensive and harmless biomarker for AD continues. Despite intensive research worldwide, to date there is no definitive plasma or blood biomarker indicating high or low risk of conversion to AD. METHODS: Magnetic resonance imaging and ?-amyloid (A?) levels in three blood compartments (diluted in plasma, undiluted in plasma and cell-bound) were measured in 96 subjects (33 with mild cognitive impairment, 14 with AD and 49 healthy controls). Pearson correlations were completed between 113 regions of interest (ROIs) (45 subcortical and 68 cortical) and A? levels. Pearson correlation analyses adjusted for the covariates age, sex, apolipoprotein E (ApoE), education and creatinine levels showed neuroimaging ROIs were associated with A? levels. Two statistical methods were applied to study the major relationships identified: (1) Pearson correlation with phenotype added as a covariate and (2) a meta-analysis stratified by phenotype. Neuroimaging data and plasma A? measurements were taken from 630 Alzheimer's Disease Neuroimaging Initiative (ADNI) subjects to be compared with our results. RESULTS: The left hippocampus was the brain region most correlated with A?(1-40) bound to blood cell pellets (partial correlation (pcor)?=?-0.37, P?=?0.0007) after adjustment for the covariates age, gender and education, ApoE and creatinine levels. The correlation remained almost the same (pcor?=?-0.35, P?=?0.002) if phenotype is also added as a covariate. The association between both measurements was independent of cognitive status. The left hemisphere entorhinal cortex also correlated with A?(1-40) cell-bound fraction. AB128 and ADNI plasma A? measurements were not related to any brain morphometric measurement. CONCLUSIONS: Association of cell-bound A?(1-40) in blood with left hippocampal volume was much stronger than previously observed in A? plasma fractions. If confirmed, this observation will require careful interpretation and must be taken into account for blood amyloid-based biomarker development.
Project description:Randomly censored covariates arise frequently in epidemiologic studies. The most commonly used methods, including complete case and single imputation or substitution, suffer from inefficiency and bias. They make strong parametric assumptions or they consider limit of detection censoring only. We employ multiple imputation, in conjunction with semi-parametric modeling of the censored covariate, to overcome these shortcomings and to facilitate robust estimation. We develop a multiple imputation approach for randomly censored covariates within the framework of a logistic regression model. We use the non-parametric estimate of the covariate distribution or the semiparametric Cox model estimate in the presence of additional covariates in the model. We evaluate this procedure in simulations, and compare its operating characteristics to those from the complete case analysis and a survival regression approach. We apply the procedures to an Alzheimer's study of the association between amyloid positivity and maternal age of onset of dementia. Multiple imputation achieves lower standard errors and higher power than the complete case approach under heavy and moderate censoring and is comparable under light censoring. The survival regression approach achieves the highest power among all procedures, but does not produce interpretable estimates of association. Multiple imputation offers a favorable alternative to complete case analysis and ad hoc substitution methods in the presence of randomly censored covariates within the framework of logistic regression.
Project description:Partially linear models provide a useful class of tools for modeling complex data by naturally incorporating a combination of linear and nonlinear effects within one framework. One key question in partially linear models is the choice of model structure, that is, how to decide which covariates are linear and which are nonlinear. This is a fundamental, yet largely unsolved problem for partially linear models. In practice, one often assumes that the model structure is given or known and then makes estimation and inference based on that structure. Alternatively, there are two methods in common use for tackling the problem: hypotheses testing and visual screening based on the marginal fits. Both methods are quite useful in practice but have their drawbacks. First, it is difficult to construct a powerful procedure for testing multiple hypotheses of linear against nonlinear fits. Second, the screening procedure based on the scatterplots of individual covariate fits may provide an educated guess on the regression function form, but the procedure is ad hoc and lacks theoretical justifications. In this article, we propose a new approach to structure selection for partially linear models, called the LAND (Linear And Nonlinear Discoverer). The procedure is developed in an elegant mathematical framework and possesses desired theoretical and computational properties. Under certain regularity conditions, we show that the LAND estimator is able to identify the underlying true model structure correctly and at the same time estimate the multivariate regression function consistently. The convergence rate of the new estimator is established as well. We further propose an iterative algorithm to implement the procedure and illustrate its performance by simulated and real examples. Supplementary materials for this article are available online.
Project description:Sensitivity and specificity are common measures of the accuracy of a diagnostic test. The usual estimators of these quantities are unbiased if data on the diagnostic test result and the true disease status are obtained from all subjects in an appropriately selected sample. In some studies, verification of the true disease status is performed only for a subset of subjects, possibly depending on the result of the diagnostic test and other characteristics of the subjects. Estimators of sensitivity and specificity based on this subset of subjects are typically biased; this is known as verification bias. Methods have been proposed to correct verification bias under the assumption that the missing data on disease status are missing at random (MAR), that is, the probability of missingness depends on the true (missing) disease status only through the test result and observed covariate information. When some of the covariates are continuous, or the number of covariates is relatively large, the existing methods require parametric models for the probability of disease or the probability of verification (given the test result and covariates), and hence are subject to model misspecification. We propose a new method for correcting verification bias based on the propensity score, defined as the predicted probability of verification given the test result and observed covariates. This is estimated separately for those with positive and negative test results. The new method classifies the verified sample into several subsamples that have homogeneous propensity scores and allows correction for verification bias. Simulation studies demonstrate that the new estimators are more robust to model misspecification than existing methods, but still perform well when the models for the probability of disease and probability of verification are correctly specified.
Project description:Algorithmic prediction of RNA secondary structure has been an area of active inquiry since the 1970s. Despite many innovations since then, our best techniques are not yet perfect. The workhorses of the RNA secondary structure prediction engine are recursions first described by Zuker and Stiegler in 1981. These have well understood caveats; a notable flaw is the ad-hoc treatment of multi-loops, also called helical-junctions, that persists today. While several advanced models for multi-loops have been proposed, it seems to have been assumed that incorporating them into the recursions would lead to intractability, and so no algorithms for these models exist. Some of these models include the classical model based on Jacobson-Stockmayer polymer theory, and another by Aalberts and Nadagopal that incorporates two-length-scale polymer physics. We have realized practical, tractable algorithms for each of these models. However, after implementing these algorithms, we found that no advanced model was better than the original, ad-hoc model used for multi-loops. While this is unexpected, it supports the praxis of the current model.
Project description:Significance testing one SNP at a time has proven useful for identifying genomic regions that harbor variants affecting human disease. But after an initial genome scan has identified a "hit region" of association, single-locus approaches can falter. Local linkage disequilibrium (LD) can make both the number of underlying true signals and their identities ambiguous. Simultaneous modeling of multiple loci should help. However, it is typically applied ad hoc: conditioning on the top SNPs, with limited exploration of the model space and no assessment of how sensitive model choice was to sampling variability. Formal alternatives exist but are seldom used. Bayesian variable selection is coherent but requires specifying a full joint model, including priors on parameters and the model space. Penalized regression methods (e.g., LASSO) appear promising but require calibration, and, once calibrated, lead to a choice of SNPs that can be misleadingly decisive. We present a general method for characterizing uncertainty in model choice that is tailored to reprioritizing SNPs within a hit region under strong LD. Our method, LASSO local automatic regularization resample model averaging (LLARRMA), combines LASSO shrinkage with resample model averaging and multiple imputation, estimating for each SNP the probability that it would be included in a multi-SNP model in alternative realizations of the data. We apply LLARRMA to simulations based on case-control genome-wide association studies data, and find that when there are several causal loci and strong LD, LLARRMA identifies a set of candidates that is enriched for true signals relative to single locus analysis and to the recently proposed method of Stability Selection.
Project description:BACKGROUND:Bryostatin-activated PKC epsilon pre-clinically induces synaptogenesis, anti-apoptosis, anti-amyloid-β oligomers, and anti-hyperphosphorylated tau. OBJECTIVES:To investigate bryostatin safety, tolerability, and efficacy to improve cognition in advanced Alzheimer's disease (AD) patients. METHODS:A double-blind, randomized, placebo-controlled Phase II, 12-week trial of i.v. bryostatin for 150 advanced AD patients (55-85) with MMSE-2 of 4-15, randomized 1:1:1 into 20 μg and 40 μg bryostatin, and placebo arms. The Full Analysis Set (FAS) and the Completer Analysis Set (CAS) were pre-specified alternative assessments (1-sided, p < 0.1 for primary efficacy, and 2-sided, p < 0.05 for pre-specified and post hoc exploratory analyses). RESULTS:The safety profile was similar for 20 μg treatment and placebo patients. The 40 μg patients showed safety and drop-out issues, but no efficacy. Primary improvement of Severe Impairment Battery (SIB) scores at 13 weeks was not significant (p = 0.134) in the FAS, although in the CAS, the SIB comparison favored 20 μg bryostatin compared to placebo patients (p < 0.07). Secondary analyses at weeks 5 and 15 (i.e., 30 days post-final dosing) also favored 20 μg bryostatin compared to placebo patients. A pre-specified ANCOVA for baseline memantine blocking bryostatin and positive post-hoc trend analyses were statistically significant (2-sided, p < 0.05). CONCLUSION:Although the primary endpoint was not significant in the FAS, primary and secondary analyses in the CAS, and pre-specified and post-hoc exploratory analyses did favor bryostatin 20 μg compared to the placebo cohort. These promising Phase II results support further trials of 20 μg bryostatin- without memantine- to treat AD.
Project description:Introduction:The objective of this study was to estimate longitudinal changes in disease progression (measured by Alzheimer's disease assessment scale-cognitive 11-item [ADAS-cog/11] scale) after bapineuzumab treatment and to identify covariates (demographics or baseline characteristics) contributing to the variability in disease progression rate and baseline disease status. Methods:A population-based disease progression model was developed using pooled placebo and bapineuzumab data from two phase-3 studies in APOE ?4 noncarrier and carrier Alzheimer's disease (AD) patients. Results:A beta regression model with the Richard's function as the structural component best described ADAS-cog/11 disease progression for mild-to-moderate AD population. This analysis confirmed no effect of bapineuzumab exposure on ADAS-cog/11 progression rate, consistent with the lack of clinical efficacy observed in the statistical analysis of ADAS-cog/11 data in both studies. Assessment of covariates affecting baseline severity revealed that men had a 6% lower baseline ADAS-cog/11 score than women; patients who took two AD concomitant medications had a 19% higher (worse) baseline score; APOE ?4 noncarriers had a 5% lower baseline score; and patients who had AD for a longer duration had a higher baseline score. Furthermore, shorter AD duration, younger age, APOE ?4 carrier status, and use of two AD concomitant medications were associated with faster disease progression rates. Patients who had an ADAS-cog/11 score progression rate that was not statistically significantly different from 0 typically took no AD concomitant medications. Discussion:The beta regression model is a sensible modeling approach to characterize cognitive decline in AD patients. The influence of bapineuzumab exposure on disease progression measured by ADAS-cog/11 was not significant. Trial Registration:ClinicalTrials.gov identifier: NCT00575055 and NCT00574132.
Project description:BACKGROUND:In the Phase 3 REFLECT trial in patients with unresectable hepatocellular carcinoma (uHCC), the multitargeted tyrosine kinase inhibitor, lenvatinib, was noninferior to sorafenib in the primary outcome of overall survival. Post-hoc review revealed imbalances in prognostic variables between treatment arms. Here, we re-analyse overall survival data from REFLECT to adjust for the imbalance in covariates. METHODS:Univariable and multivariable adjustments were undertaken for a candidate set of covariate values that a physician panel indicated could be prognostically associated with overall survival in uHCC. The values included baseline variables observed pre- and post-randomisation. Univariable analyses were based on a stratified Cox model. The multivariable analysis used a "forwards stepwise" Cox model. RESULTS:Univariable analysis identified alpha-fetoprotein (AFP) as the most influential variable. The chosen multivariable Cox model analysis resulted in an estimated adjusted hazard ratio for lenvatinib of 0.814 (95% CI: 0.699-0.948) when only baseline variables were included. Adjusting for post-randomisation treatment variables further increased the estimated superiority of lenvatinib. CONCLUSIONS:Covariate adjustment of REFLECT suggests that the original noninferiority trial likely underestimated the true effect of lenvatinib on overall survival due to an imbalance in baseline prognostic covariates and the greater use of post-treatment therapies in the sorafenib arm. TRIAL REGISTRATION:Trial number: NCT01761266 (Submitted January 2, 2013).