ACAT: A Fast and Powerful p Value Combination Method for Rare-Variant Analysis in Sequencing Studies.
ABSTRACT: Set-based analysis that jointly tests the association of variants in a group has emerged as a popular tool for analyzing rare and low-frequency variants in sequencing studies. The existing set-based tests can suffer significant power loss when only a small proportion of variants are causal, and their powers can be sensitive to the number, effect sizes, and effect directions of the causal variants and the choices of weights. Here we propose an aggregated Cauchy association test (ACAT), a general, powerful, and computationally efficient p value combination method for boosting power in sequencing studies. First, by combining variant-level p values, we use ACAT to construct a set-based test (ACAT-V) that is particularly powerful in the presence of only a small number of causal variants in a variant set. Second, by combining different variant-set-level p values, we use ACAT to construct an omnibus test (ACAT-O) that combines the strength of multiple complimentary set-based tests, including the burden test, sequence kernel association test (SKAT), and ACAT-V. Through analysis of extensively simulated data and the whole-genome sequencing data from the Atherosclerosis Risk in Communities (ARIC) study, we demonstrate that ACAT-V complements the SKAT and the burden test, and that ACAT-O has a substantially more robust and higher power than those of the alternative tests.
Project description:There is an emerging interest in sequencing-based association studies of multiple rare variants. Most association tests suggested in the literature involve collapsing rare variants with or without weighting. Recently, a variance-component score test [sequence kernel association test (SKAT)] was proposed to address the limitations of collapsing method. Although SKAT was shown to outperform most of the alternative tests, its applications and power might be restricted and influenced by missing genotypes. In this paper, we suggest a new method based on testing whether the fraction of causal variants in a region is zero. The new association test, T REM , is derived from a random-effects model and allows for missing genotypes, and the choice of weighting function is not required when common and rare variants are analyzed simultaneously. We performed simulations to study the type I error rates and power of four competing tests under various conditions on the sample size, genotype missing rate, variant frequency, effect directionality, and the number of non-causal rare variant and/or causal common variant. The simulation results showed that T REM was a valid test and less sensitive to the inclusion of non-causal rare variants and/or low effect common variants or to the presence of missing genotypes. When the effects were more consistent in the same direction, T REM also had better power performance. Finally, an application to the Shanghai Breast Cancer Study showed that rare causal variants at the FGFR2 gene were detected by T REM and SKAT, but T REM produced more consistent results for different sets of rare and common variants.
Project description:In association studies of complex traits, fixed-effect regression models are usually used to test for association between traits and major gene loci. In recent years, variance-component tests based on mixed models were developed for region-based genetic variant association tests. In the mixed models, the association is tested by a null hypothesis of zero variance via a sequence kernel association test (SKAT), its optimal unified test (SKAT-O), and a combined sum test of rare and common variant effect (SKAT-C). Although there are some comparison studies to evaluate the performance of mixed and fixed models, there is no systematic analysis to determine when the mixed models perform better and when the fixed models perform better. Here we evaluated, based on extensive simulations, the performance of the fixed and mixed model statistics, using genetic variants located in 3, 6, 9, 12, and 15 kb simulated regions. We compared the performance of three models: (i) mixed models that lead to SKAT, SKAT-O, and SKAT-C, (ii) traditional fixed-effect additive models, and (iii) fixed-effect functional regression models. To evaluate the type I error rates of the tests of fixed models, we generated genotype data by two methods: (i) using all variants, (ii) using only rare variants. We found that the fixed-effect tests accurately control or have low false positive rates. We performed simulation analyses to compare power for two scenarios: (i) all causal variants are rare, (ii) some causal variants are rare and some are common. Either one or both of the fixed-effect models performed better than or similar to the mixed models except when (1) the region sizes are 12 and 15 kb and (2) effect sizes are small. Therefore, the assumption of mixed models could be satisfied and SKAT/SKAT-O/SKAT-C could perform better if the number of causal variants is large and each causal variant contributes a small amount to the traits (i.e., polygenes). In major gene association studies, we argue that the fixed-effect models perform better or similarly to mixed models in most cases because some variants should affect the traits relatively large. In practice, it makes sense to perform analysis by both the fixed and mixed effect models and to make a comparison, and this can be readily done using our R codes and the SKAT packages.
Project description:With development of massively parallel sequencing technologies, there is a substantial need for developing powerful rare variant association tests. Common approaches include burden and non-burden tests. Burden tests assume all rare variants in the target region have effects on the phenotype in the same direction and of similar magnitude. The recently proposed sequence kernel association test (SKAT) (Wu, M. C., and others, 2011. Rare-variant association testing for sequencing data with the SKAT. The American Journal of Human Genetics 89, 82-93], an extension of the C-alpha test (Neale, B. M., and others, 2011. Testing for an unusual distribution of rare variants. PLoS Genetics 7, 161-165], provides a robust test that is particularly powerful in the presence of protective and deleterious variants and null variants, but is less powerful than burden tests when a large number of variants in a region are causal and in the same direction. As the underlying biological mechanisms are unknown in practice and vary from one gene to another across the genome, it is of substantial practical interest to develop a test that is optimal for both scenarios. In this paper, we propose a class of tests that include burden tests and SKAT as special cases, and derive an optimal test within this class that maximizes power. We show that this optimal test outperforms burden tests and SKAT in a wide range of scenarios. The results are illustrated using simulation studies and triglyceride data from the Dallas Heart Study. In addition, we have derived sample size/power calculation formula for SKAT with a new family of kernels to facilitate designing new sequence association studies.
Project description:By using functional data analysis techniques, we developed generalized functional linear models for testing association between a dichotomous trait and multiple genetic variants in a genetic region while adjusting for covariates. Both fixed and mixed effect models are developed and compared. Extensive simulations show that Rao's efficient score tests of the fixed effect models are very conservative since they generate lower type I errors than nominal levels, and global tests of the mixed effect models generate accurate type I errors. Furthermore, we found that the Rao's efficient score test statistics of the fixed effect models have higher power than the sequence kernel association test (SKAT) and its optimal unified version (SKAT-O) in most cases when the causal variants are both rare and common. When the causal variants are all rare (i.e., minor allele frequencies less than 0.03), the Rao's efficient score test statistics and the global tests have similar or slightly lower power than SKAT and SKAT-O. In practice, it is not known whether rare variants or common variants in a gene region are disease related. All we can assume is that a combination of rare and common variants influences disease susceptibility. Thus, the improved performance of our models when the causal variants are both rare and common shows that the proposed models can be very useful in dissecting complex traits. We compare the performance of our methods with SKAT and SKAT-O on real neural tube defects and Hirschsprung's disease datasets. The Rao's efficient score test statistics and the global tests are more sensitive than SKAT and SKAT-O in the real data analysis. Our methods can be used in either gene-disease genome-wide/exome-wide association studies or candidate gene analyses.
Project description:In genetic association analysis, a joint test of multiple distinct phenotypes can increase power to identify sets of trait-associated variants within genes or regions of interest. Existing multiphenotype tests for rare variants make specific assumptions about the patterns of association with underlying causal variants, and the violation of these assumptions can reduce power to detect association. Here, we develop a general framework for testing pleiotropic effects of rare variants on multiple continuous phenotypes using multivariate kernel regression (Multi-SKAT). Multi-SKAT models affect sizes of variants on the phenotypes through a kernel matrix and perform a variance component test of association. We show that many existing tests are equivalent to specific choices of kernel matrices with the Multi-SKAT framework. To increase power of detecting association across tests with different kernel matrices, we developed a fast and accurate approximation of the significance of the minimum observed P value across tests. To account for related individuals, our framework uses random effects for the kinship matrix. Using simulated data and amino acid and exome-array data from the METabolic Syndrome In Men (METSIM) study, we show that Multi-SKAT can improve power over single-phenotype SKAT-O test and existing multiple-phenotype tests, while maintaining Type I error rate.
Project description:We propose in this paper a unified approach for testing the association between rare variants and phenotypes in sequencing association studies. This approach maximizes power by adaptively using the data to optimally combine the burden test and the nonburden sequence kernel association test (SKAT). Burden tests are more powerful when most variants in a region are causal and the effects are in the same direction, whereas SKAT is more powerful when a large fraction of the variants in a region are noncausal or the effects of causal variants are in different directions. The proposed unified test maintains the power in both scenarios. We show that the unified test corresponds to the optimal test in an extended family of SKAT tests, which we refer to as SKAT-O. The second goal of this paper is to develop a small-sample adjustment procedure for the proposed methods for the correction of conservative type I error rates of SKAT family tests when the trait of interest is dichotomous and the sample size is small. Both small-sample-adjusted SKAT and the optimal unified test (SKAT-O) are computationally efficient and can easily be applied to genome-wide sequencing association studies. We evaluate the finite sample performance of the proposed methods using extensive simulation studies and illustrate their application using the acute-lung-injury exome-sequencing data of the National Heart, Lung, and Blood Institute Exome Sequencing Project.
Project description:Rare-variant association testing usually requires some method of aggregation. The next important step is to pinpoint individual rare causal variants among a large number of variants within a genetic region. Recently Ionita-Laza et al. propose a backward elimination (BE) procedure that can identify individual causal variants among the many variants in a gene. The BE procedure removes a variant if excluding this variant can lead to a smaller P-value for the BURDEN test (referred to as "BE-BURDEN") or the SKAT test (referred to as "BE-SKAT"). We here use the adaptive combination of P-values (ADA) method to pinpoint causal variants. Unlike most gene-based association tests, the ADA statistic is built upon per-site P-values of individual variants. It is straightforward to select important variants given the optimal P-value truncation threshold found by ADA. We performed comprehensive simulations to compare ADA with BE-SKAT and BE-BURDEN. Ranking these three approaches according to positive predictive values (PPVs), the percentage of truly causal variants among the total selected variants, we found ADA > BE-SKAT?> BE-BURDEN across all simulation scenarios. We therefore recommend using ADA to pinpoint plausible rare causal variants in a gene.
Project description:<h4>Background</h4>With the development of next generation sequencing (NGS) technology and genotype imputation methods, statistical methods have been proposed to test a set of genomic variants together to detect if any of them is associated with the phenotype or disease. In practice, within the set, there is an unknown proportion of variants truly causal or associated with the disease. There is a demand for statistical methods with high power in both dense and sparse scenarios, where the proportion of causal or associated variants is large or small respectively.<h4>Results</h4>We propose a new association test - weighted Adaptive Fisher (wAF) that can adapt to both dense and sparse scenarios by adding weights to the Adaptive Fisher (AF) method we developed before. Using simulation, we show that wAF enjoys comparable or better power to popular methods such as sequence kernel association tests (SKAT and SKAT-O) and adaptive SPU (aSPU) test. We apply wAF to a publicly available schizophrenia dataset, and successfully detect thirteen genes. Among them, three genes are supported by existing literature; six are plausible as they either relate to other neurological diseases or have relevant biological functions.<h4>Conclusions</h4>The proposed wAF method is a powerful disease-variants association test in both dense and sparse scenarios. Both simulation studies and real data analysis indicate the potential of wAF for new biological findings.
Project description:Single nucleotide polymorphism (SNP) set tests have been a powerful method in analyzing next-generation sequencing (NGS) data. The popular sequence kernel association test (SKAT) method tests a set of variants as random effects in the linear mixed model setting. Its P-value is calculated based on asymptotic theory that requires a large sample size. Therefore, it is known that SKAT is conservative and can lose power at small or moderate sample sizes. Given the current cost of sequencing technology, scales of NGS are still limited. In this report, we derive and implement computationally efficient, exact (nonasymptotic) score (eScore), likelihood ratio (eLRT), and restricted likelihood ratio (eRLRT) tests, ExactVCTest, that can achieve high power even when sample sizes are small. We perform simulation studies under various genetic scenarios. Our ExactVCTest (i.e., eScore, eLRT, eRLRT) exhibits well-controlled type I error. Under the alternative model, eScore P-values are universally smaller than those from SKAT. eLRT and eRLRT demonstrate significantly higher power than eScore, SKAT, and SKAT optimal (SKAT-o) across all scenarios and various samples sizes. We applied these tests to an exome sequencing study. Our findings replicate previous results and shed light on rare variant effects within genes. The software package is implemented in the open source, high-performance technical computing language Julia, and is freely available at https://github.com/Tao-Hu/VarianceComponentTest.jl Analysis of each trait in the exome sequencing data set with 399 individuals and 16,619 genes takes around 1 min on a desktop computer.
Project description:We apply a family-based extension of the sequence kernel association test (SKAT) to 93 trios extracted from the 20 pedigrees in the Genetic Analysis Workshop 18 simulated data. Each extracted trio includes a unique set of parents to ensure conditionally independent trios are sampled. We compare the empirical type I error and power between the family-based SKAT and the burden test under varying percentages of causal single-nucleotide polymorphisms included in the analysis. Our investigation using simulated data suggests that, under the setting used for Genetic Analysis Workshop 18 data, both the family-based SKAT and the burden test have limited power, and that there is no substantial impact of percentage of signal on the power of either test. The low power is partially a result of the small sample size. However, we find that both the family-based SKAT and the burden test are more powerful when we use only rare variants, rather than common variants, to test the association.