A partition-based approach to identify gene-environment interactions in genome wide association studies.
ABSTRACT: It is believed that almost all common diseases are the consequence of complex interactions between genetic markers and environmental factors. However, few such interactions have been documented to date. Conventional statistical methods for detecting gene and environmental interactions are often based on the linear regression model, which assumes a linear interaction effect. In this study, we propose a nonparametric partition-based approach that is able to capture complex interaction patterns. We apply this method to the real data set of hypertension provided by Genetic Analysis Workshop 18. Compared with the linear regression model, the proposed approach is able to identify many additional variants with significant gene-environmental interaction effects. We further investigate one single-nucleotide polymorphism identified by our method and show that its gene-environmental interaction effect is, indeed, nonlinear. To adjust for the family dependence of phenotypes, we apply different permutation strategies and investigate their effects on the outcomes.
Project description:Gene-environment (G×E) interactions play key roles in many complex diseases. An increasing number of epidemiological studies have shown the combined effect of multiple environmental exposures on disease risk. However, no appropriate statistical models have been developed to conduct a rigorous assessment of such combined effects when G×E interactions are considered. In this paper, we propose a partial linear varying multi-index coefficient model (PLVMICM) to assess how multiple environmental factors act jointly to modify individual genetic risk on complex disease. Our model includes the varying-index coefficient model as a special case, where discrete variables are admitted as the linear part. Thus PLVMICM allows one to study nonlinear interaction effects between genes and continuous environments as well as linear interactions between genes and discrete environments, simultaneously. We derive a profile method to estimate parametric parameters and a B-spline backfitted kernel method to estimate nonlinear interaction functions. Consistency and asymptotic normality of the parametric and nonparametric estimates are established under some regularity conditions. Hypothesis testing for the parametric coefficients and nonparametric functions are conducted. Results show that the statistics for testing the parametric coefficients and the non-parametric functions asymptotically follow a ?2-distribution with different degrees of freedom. The utility of the method is demonstrated through extensive simulations and a case study.
Project description:Statistical inference in neuroimaging research often involves testing the significance of regression coefficients in a general linear model. In many applications, the researcher assumes a model of the form Y=?+X?+Z?+?, where Y is the observed brain signal, and X and Z contain explanatory variables that are thought to be related to the brain signal. The goal is to test the null hypothesis H0:?=0 with the nuisance parameters ? included in the model. Several nonparametric (permutation) methods have been proposed for this problem, and each method uses some variant of the F ratio as the test statistic. However, recent research suggests that the F ratio can produce invalid permutation tests of H0:?=0 when the ? terms are heteroscedastic (i.e., have non-constant variance), which can occur for a variety of reasons. This study compares the classic F test statistic to the robust W (Wald) test statistic using eight different permutation methods. The results reveal that permutation tests using the F ratio can produce accurate results when the errors are homoscedastic, but high false positive rates when the errors are heteroscedastic. In contrast, permutation tests using the W test statistic produced valid results when the errors were homoscedastic, and asymptotically valid results when the errors were heteroscedastic. In the situation with homoscedastic errors, permutation tests using the W statistic showed slightly reduced power compared to the F statistic, but the difference disappeared as the sample size n increased. Consequently, the W test statistic is recommended for robust nonparametric hypothesis tests of regression coefficients in neuroimaging research.
Project description:Biomedical studies have a common interest in assessing relationships between multiple related health outcomes and high-dimensional predictors. For example, in reproductive epidemiology, one may collect pregnancy outcomes such as length of gestation and birth weight and predictors such as single nucleotide polymorphisms in multiple candidate genes and environmental exposures. In such settings, there is a need for simple yet flexible methods for selecting true predictors of adverse health responses from a high-dimensional set of candidate predictors. To address this problem, one may either consider linear regression models for the continuous outcomes or convert these outcomes into binary indicators of adverse responses using predefined cutoffs. The former strategy has the disadvantage of often leading to a poorly fitting model that does not predict risk well, whereas the latter approach can be very sensitive to the cutoff choice. As a simple yet flexible alternative, we propose a method for adverse subpopulation regression, which relies on a two-component latent class model, with the dominant component corresponding to (presumed) healthy individuals and the risk of falling in the minority component characterized via a logistic regression. The logistic regression model is designed to accommodate high-dimensional predictors, as occur in studies with a large number of gene by environment interactions, through the use of a flexible nonparametric multiple shrinkage approach. The Gibbs sampler is developed for posterior computation. We evaluate the methods with the use of simulation studies and apply these to a genetic epidemiology study of pregnancy outcomes.
Project description:BACKGROUND:The interactive effect of the IGF pathway genes with the environment may contribute to childhood obesity. Such gene-environment interactions can take on complex forms. Detecting those relationships using longitudinal family studies requires simultaneously accounting for correlations within individuals and families. METHODS:We studied three methods for detecting interaction effects in longitudinal family studies. The twin model and the nonparametric partition-based score test utilized individual outcome averages, whereas the linear mixed model used all available longitudinal data points. Simulation experiments were performed to evaluate the methods' power to detect different gene-environment interaction relationships. These methods were applied to the Quebec Newborn Twin Study data to test for interaction effects between the IGF pathway genes (IGF-1, IGFALS) and environmental factors (physical activity, daycare attendance and sleep duration) on body mass index outcomes. RESULTS:For the simulated data, the twin model with the mean time summary statistic yielded good performance overall. Modelling an interaction as linear when the true model had a different relationship influenced power; for certain non-linear interactions, none of the three methods were effective. Our analysis of the IGF pathway genes showed suggestive association for the joint effect of IGF-1 variant at position 102,791,894 of chromosome 12 and physical activity. However, this association was not statistically significant after multiple testing correction. CONCLUSIONS:The analytical approaches considered in this study were not robust to different gene-environment interactions. Methodological innovations are needed to improve the current methods' performances for detecting non-linear interactions. More studies are needed in order to better understand the IGF pathway's role in childhood obesity development.
Project description:The central aim in this paper is to address variable selection questions in nonlinear and nonparametric regression. Motivated by statistical genetics, where nonlinear interactions are of particular interest, we introduce a novel and interpretable way to summarize the relative importance of predictor variables. Methodologically, we develop the "RelATive cEntrality" (RATE) measure to prioritize candidate genetic variants that are not just marginally important, but whose associations also stem from significant covarying relationships with other variants in the data. We illustrate RATE through Bayesian Gaussian process regression, but the methodological innovations apply to other "black box" methods. It is known that nonlinear models often exhibit greater predictive accuracy than linear models, particularly for phenotypes generated by complex genetic architectures. With detailed simulations and two real data association mapping studies, we show that applying RATE enables an explanation for this improved performance.
Project description:We consider in this paper testing for interactions between a genetic marker set and an environmental variable. A common practice in studying gene-environment (GE) interactions is to analyze one single-nucleotide polymorphism (SNP) at a time. It is of significant interest to analyze SNPs in a biologically defined set simultaneously, e.g. gene or pathway. In this paper, we first show that if the main effects of multiple SNPs in a set are associated with a disease/trait, the classical single SNP-GE interaction analysis can be biased. We derive the asymptotic bias and study the conditions under which the classical single SNP-GE interaction analysis is unbiased. We further show that, the simple minimum p-value-based SNP-set GE analysis, can be biased and have an inflated Type 1 error rate. To overcome these difficulties, we propose a computationally efficient and powerful gene-environment set association test (GESAT) in generalized linear models. Our method tests for SNP-set by environment interactions using a variance component test, and estimates the main SNP effects under the null hypothesis using ridge regression. We evaluate the performance of GESAT using simulation studies, and apply GESAT to data from the Harvard lung cancer genetic study to investigate GE interactions between the SNPs in the 15q24-25.1 region and smoking on lung cancer risk.
Project description:Complex interplay between genetic and environmental factors characterizes the etiology of many diseases. Modeling gene-environment (GxE) interactions is often challenged by the unknown functional form of the environment term in the true data-generating mechanism. We study the impact of misspecification of the environmental exposure effect on inference for the GxE interaction term in linear and logistic regression models. We first examine the asymptotic bias of the GxE interaction regression coefficient, allowing for confounders as well as arbitrary misspecification of the exposure and confounder effects. For linear regression, we show that under gene-environment independence and some confounder-dependent conditions, when the environment effect is misspecified, the regression coefficient of the GxE interaction can be unbiased. However, inference on the GxE interaction is still often incorrect. In logistic regression, we show that the regression coefficient is generally biased if the genetic factor is associated with the outcome directly or indirectly. Further, we show that the standard robust sandwich variance estimator for the GxE interaction does not perform well in practical GxE studies, and we provide an alternative testing procedure that has better finite sample properties.
Project description:The genetic basis of complex traits often involves the function of multiple genetic factors, their interactions and the interaction between the genetic and environmental factors. Gene-environment (G×E) interaction is considered pivotal in determining trait variations and susceptibility of many genetic disorders such as neurodegenerative diseases or mental disorders. Regression-based methods assuming a linear relationship between a disease response and the genetic and environmental factors as well as their interaction is the commonly used approach in detecting G×E interaction. The linearity assumption, however, could be easily violated due to non-linear genetic penetrance which induces non-linear G×E interaction.In this work, we propose to relax the linear G×E assumption and allow for non-linear G×E interaction under a varying coefficient model framework. We propose to estimate the varying coefficients with regression spline technique. The model allows one to assess the non-linear penetrance of a genetic variant under different environmental stimuli, therefore help us to gain novel insights into the etiology of a complex disease. Several statistical tests are proposed for a complete dissection of G×E interaction. A wild bootstrap method is adopted to assess the statistical significance. Both simulation and real data analysis demonstrate the power and utility of the proposed method. Our method provides a powerful and testable framework for assessing non-linear G×E interaction.
Project description:Random forest (RF) analysis of genetic data does not require specification of the mode of inheritance, and provides measures of variable importance that incorporate interaction effects. In this paper we describe RF-based approaches for assessment of gene and haplotype importance, and apply these approaches to a subset of the North American Rheumatoid Arthritis Consortium case-control data provided by Genetic Analysis Workshop 16. The RF analyses of 37 genes identified many of the same genes as logistic regression, but also suggested importance of certain single-nucleotide polymorphism and genes that were not ranked highly by logistic regression. A new permutation method did not reveal strong evidence of gene-gene interaction effects in these data. Although RFs are a promising approach for genetic data analysis, extensions beyond simple single-nucleotide polymorphism analyses and modifications to improve computational feasibility are needed.
Project description:INTRODUCTION: Many studies examine gene expression data that has been obtained under the influence of multiple factors, such as genetic background, environmental conditions, or exposure to diseases. The interplay of multiple factors may lead to effect modification and confounding. Higher order linear regression models can account for these effects. We present a new methodology for linear model selection and apply it to microarray data of bone marrow-derived macrophages. This experiment investigates the influence of three variable factors: the genetic background of the mice from which the macrophages were obtained, Yersinia enterocolitica infection (two strains, and a mock control), and treatment/non-treatment with interferon-?. RESULTS: We set up four different linear regression models in a hierarchical order. We introduce the eruption plot as a new practical tool for model selection complementary to global testing. It visually compares the size and significance of effect estimates between two nested models. Using this methodology we were able to select the most appropriate model by keeping only relevant factors showing additional explanatory power. Application to experimental data allowed us to qualify the interaction of factors as either neutral (no interaction), alleviating (co-occurring effects are weaker than expected from the single effects), or aggravating (stronger than expected). We find a biologically meaningful gene cluster of putative C2TA target genes that appear to be co-regulated with MHC class II genes. CONCLUSIONS: We introduced the eruption plot as a tool for visual model comparison to identify relevant higher order interactions in the analysis of expression data obtained under the influence of multiple factors. We conclude that model selection in higher order linear regression models should generally be performed for the analysis of multi-factorial microarray data.