On multi-marker tests for association in case-control studies.
ABSTRACT: Genome-wide association studies (GWAs) have identified thousands of DNA loci associated with a variety of traits. Statistical inference is almost always based on single marker hypothesis tests of association and the respective p-values with Bonferroni correction. Since commercially available genomic arrays interrogate hundreds of thousands or even millions of loci simultaneously, many causal yet undetected loci are believed to exist because the conditional power to achieve a genome-wide significance level can be low, in particular for markers with small effect sizes and low minor allele frequencies and in studies with modest sample size. However, the correlation between neighboring markers in the human genome due to linkage disequilibrium (LD) resulting in correlated marker test statistics can be incorporated into multi-marker hypothesis tests, thereby increasing power to detect association. Herein, we establish a theoretical benchmark by quantifying the maximum power achievable for multi-marker tests of association in case-control studies, achievable only when the causal marker is known. Using that genotype correlations within an LD block translate into an asymptotically multivariate normal distribution for score test statistics, we develop a set of weights for the markers that maximize the non-centrality parameter, and assess the relative loss of power for other approaches. We find that the method of Conneely and Boehnke (2007) based on the maximum absolute test statistic observed in an LD block is a practical and powerful method in a variety of settings. We also explore the effect on the power that prior biological or functional knowledge used to narrow down the locus of the causal marker can have, and conclude that this prior knowledge has to be very strong and specific for the power to approach the maximum achievable level, or even beat the power observed for methods such as the one proposed by Conneely and Boehnke (2007).
Project description:In genome-wide association studies hundreds of thousands of loci are scanned in thousands of cases and controls, with the goal of identifying genomic loci underpinning disease. This is a challenging statistical problem requiring strong evidence. Only a small proportion of the heritability of common diseases has so far been explained. This "dark matter of the genome" is a subject of much discussion. It is critical to have experimental design criteria that ensure that associations between genomic loci and phenotypes are robustly detected. To ensure associations are robustly detected we require good power (e.g., 0.8) and sufficiently strong evidence [i.e., a high Bayes factor (e.g., 10(6), meaning the data are 1 million times more likely if the association is real than if there is no association)] to overcome the low prior odds for any given marker in a genome scan to be associated with a causal locus. Power calculations are given for determining the sample sizes necessary to detect effects with the required power and Bayes factor for biallelic markers in linkage disequilibrium with causal loci in additive, dominant, and recessive genetic models. Significantly stronger evidence and larger sample sizes are required than indicated by traditional hypothesis tests and power calculations. Many reported putative effects are not robustly detected and many effects including some large moderately low-frequency effects may remain undetected. These results may explain the dark matter in the genome. The power calculations have been implemented in R and will be available in the R package ldDesign.
Project description:Gene-based tests of association are frequently applied to common SNPs (MAF>5%) as an alternative to single-marker tests. In this analysis we conduct a variety of simulation studies applied to five popular gene-based tests investigating general trends related to their performance in realistic situations. In particular, we focus on the impact of non-causal SNPs and a variety of LD structures on the behavior of these tests. Ultimately, we find that non-causal SNPs can significantly impact the power of all gene-based tests. On average, we find that the "noise" from 6-12 non-causal SNPs will cancel out the "signal" of one causal SNP across five popular gene-based tests. Furthermore, we find complex and differing behavior of the methods in the presence of LD within and between non-causal and causal SNPs. Ultimately, better approaches for a priori prioritization of potentially causal SNPs (e.g., predicting functionality of non-synonymous SNPs), application of these methods to sequenced or fully imputed datasets, and limited use of window-based methods for assigning inter-genic SNPs to genes will improve power. However, significant power loss from non-causal SNPs may remain unless alternative statistical approaches robust to the inclusion of non-causal SNPs are developed.
Project description:In genetic association studies of rare variants, low statistical power and potential violations of established estimator properties are among the main challenges of association tests. Multi-marker tests (MMTs) have been proposed to target these challenges, but any comparison with single-marker tests (SMTs) has to consider that their aim is to identify causal genomic regions instead of variants. Valid power comparisons have been performed for the analysis of binary traits indicating that MMTs have higher power, but there is a lack of conclusive studies for quantitative traits. The aim of our study was therefore to fairly compare SMTs and MMTs in their empirical power to identify the same causal loci associated with a quantitative trait. The results of extensive simulation studies indicate that previous results for binary traits cannot be generalized. First, we show that for the analysis of quantitative traits, conventional estimation methods and test statistics of single-marker approaches have valid properties yielding association tests with valid type I error, even when investigating singletons or doubletons. Furthermore, SMTs lead to more powerful association tests for identifying causal genes than MMTs when the effect sizes of causal variants are large, and less powerful tests when causal variants have small effect sizes. For moderate effect sizes, whether SMTs or MMTs have higher power depends on the sample size and percentage of causal SNVs. For a more complete picture, we also compare the power in studies of quantitative and binary traits, and the power to identify causal genes with the power to identify causal rare variants. In a genetic association analysis of systolic blood pressure in the Genetic Analysis Workshop 19 data, SMTs yielded smaller p-values compared to MMTs for most of the investigated blood pressure genes, and were least influenced by the definition of gene regions.
Project description:Genome-wide association studies (GWAS) have revealed thousands of genetic loci that underpin the complex biology of many human traits. However, the strength of GWAS - the ability to detect genetic association by linkage disequilibrium (LD) - is also its limitation. Whilst the ever-increasing study size and improved design have augmented the power of GWAS to detect effects, differentiation of causal variants or genes from other highly correlated genes associated by LD remains the real challenge. This has severely hindered the biological insights and clinical translation of GWAS findings. Although thousands of disease susceptibility loci have been reported, causal genes at these loci remain elusive. Machine learning (ML) techniques offer an opportunity to dissect the heterogeneity of variant and gene signals in the post-GWAS analysis phase. ML models for GWAS prioritization vary greatly in their complexity, ranging from relatively simple logistic regression approaches to more complex ensemble models such as random forests and gradient boosting, as well as deep learning models, i.e., neural networks. Paired with functional validation, these methods show important promise for clinical translation, providing a strong evidence-based approach to direct post-GWAS research. However, as ML approaches continue to evolve to meet the challenge of causal gene identification, a critical assessment of the underlying methodologies and their applicability to the GWAS prioritization problem is needed. This review investigates the landscape of ML applications in three parts: selected models, input features, and output model performance, with a focus on prioritizations of complex disease associated loci. Overall, we explore the contributions ML has made towards reaching the GWAS end-game with consequent wide-ranging translational impact.
Project description:Genome-wide association studies (GWAS) are a widely used study design for detecting genetic causes of complex diseases. Current studies provide good coverage of common causal SNPs, but not rare ones. A popular method to detect rare causal variants is haplotype testing. A disadvantage of this approach is that many parameters are estimated simultaneously, which can mean a loss of power and slower fitting to large datasets.Haplotype testing effectively tests both the allele frequencies and the linkage disequilibrium (LD) structure of the data. LD has previously been shown to be mostly attributable to LD between adjacent SNPs. We propose a generalised linear model (GLM) which models the effects of each SNP in a region as well as the statistical interactions between adjacent pairs. This is compared to two other commonly used multimarker GLMs: one with a main-effect parameter for each SNP; one with a parameter for each haplotype.We show the haplotype model has higher power for rare untyped causal SNPs, the main-effects model has higher power for common untyped causal SNPs, and the proposed model generally has power in between the two others. We show that the relative power of the three methods is dependent on the number of marker haplotypes the causal allele is present on, which depends on the age of the mutation. Except in the case of a common causal variant in high LD with markers, all three multimarker models are superior in power to single-SNP tests.Including the adjacent statistical interactions results in lower inflation in test statistics when a realistic level of population stratification is present in a dataset.Using the multimarker models, we analyse data from the Molecular Genetics of Schizophrenia study. The multimarker models find potential associations that are not found by single-SNP tests. However, multimarker models also require stricter control of data quality since biases can have a larger inflationary effect on multimarker test statistics than on single-SNP test statistics.Analysing a GWAS with multimarker models can yield candidate regions which may contain rare untyped causal variants. This is useful for increasing prior odds of association in future whole-genome sequence analyses.
Project description:The understanding of non-random association between loci, termed linkage disequilibrium (LD), plays a central role in genomic research. Since causal mutations are generally not included in genomic marker data, LD between those and available markers is essential for capturing the effects of causal loci on localizing genes responsible for traits. Thus, the interpretation of association studies requires a detailed knowledge of LD patterns. It is well known that most LD measures depend on minor allele frequencies (MAF) of the considered loci and the magnitude of LD is influenced by the physical distances between loci. In the present study, a procedure to compare the LD structure between genomic regions comprising several markers each is suggested. The approach accounts for different scaling factors, namely the distribution of MAF, the distribution of pair-wise differences in MAF, and the physical extent of compared regions, reflected by the distribution of pair-wise physical distances. In the first step, genomic regions are matched based on similarity in these scaling factors. In the second step, chromosome- and genome-wide significance tests for differences in medians of LD measures in each pair are performed. The proposed framework was applied to test the hypothesis that the average LD is different in genic and non-genic regions. This was tested with a genome-wide approach with data sets for humans (Homo sapiens), a highly selected chicken line (Gallus gallus domesticus) and the model plant Arabidopsis thaliana. In all three data sets we found a significantly higher level of LD in genic regions compared to non-genic regions. About 31% more LD was detected genome-wide in genic compared to non-genic regions in Arabidopsis thaliana, followed by 13.6% in human and 6% chicken. Chromosome-wide comparison discovered significant differences on all 5 chromosomes in Arabidopsis thaliana and on one third of the human and of the chicken chromosomes.
Project description:Integrating genome-wide association (GWAS) and expression quantitative trait locus (eQTL) data into transcriptome-wide association studies (TWAS) based on predicted expression can boost power to detect novel disease loci or pinpoint the susceptibility gene at a known disease locus. However, it is often the case that multiple eQTL genes colocalize at disease loci, making the identification of the true susceptibility gene challenging, due to confounding through linkage disequilibrium (LD). To distinguish between true susceptibility genes (where the genetic effect on phenotype is mediated through expression) and colocalization due to LD, we examine an extension of the Mendelian randomization (MR) egger regression method that allows for LD while only requiring summary association data for both GWAS and eQTL. We derive the standard TWAS approach in the context of MR and show in simulations that the standard TWAS does not control type I error for causal gene identification when eQTLs have pleiotropic or LD-confounded effects on disease. In contrast, LD-aware MR-Egger (LDA MR-Egger) regression can control type I error in this case while attaining similar power as other methods in situations where these provide valid tests. However, when the direct effects of genetic variants on traits are correlated with the eQTL associations, all of the methods we examined including LDA MR-Egger regression can have inflated type I error. We illustrate these methods by integrating gene expression within a recent large-scale breast cancer GWAS to provide guidance on susceptibility gene identification.
Project description:Detecting uncommon causal variants (minor allele frequency [MAF] < 5%) is difficult with commercial single-nucleotide polymorphism (SNP) arrays that are designed to capture common variants (MAF > 5%). Haplotypes can provide insights into underlying linkage disequilibrium (LD) structure and can tag uncommon variants that are not well tagged by common variants. In this work, we propose a wei-SIMc-matching test that inversely weights haplotype similarities with the estimated standard deviation of haplotype counts to boost the power of similarity-based approaches for detecting uncommon causal variants. We then compare the power of the wei-SIMc-matching test with that of several popular haplotype-based tests, including four other similarity-based tests, a global score test for haplotypes (global), a test based on the maximum score statistic over all haplotypes (max), and two newly proposed haplotype-based tests for rare variant detection. With systematic simulations under a wide range of LD patterns, the results show that wei-SIMc-matching and global are the two most powerful tests. Among these two tests, wei-SIMc-matching has reliable asymptotic P-values, whereas global needs permutations to obtain reliable P-values when the frequencies of some haplotype categories are low or when the trait is skewed. Therefore, we recommend wei-SIMc-matching for detecting uncommon causal variants with surrounding common SNPs, in light of its power and computational feasibility.
Project description:While genome-wide association studies (GWAS) have primarily examined populations of European ancestry, more recent studies often involve additional populations, including admixed populations such as African Americans and Latinos. In admixed populations, linkage disequilibrium (LD) exists both at a fine scale in ancestral populations and at a coarse scale (admixture-LD) due to chromosomal segments of distinct ancestry. Disease association statistics in admixed populations have previously considered SNP association (LD mapping) or admixture association (mapping by admixture-LD), but not both. Here, we introduce a new statistical framework for combining SNP and admixture association in case-control studies, as well as methods for local ancestry-aware imputation. We illustrate the gain in statistical power achieved by these methods by analyzing data of 6,209 unrelated African Americans from the CARe project genotyped on the Affymetrix 6.0 chip, in conjunction with both simulated and real phenotypes, as well as by analyzing the FGFR2 locus using breast cancer GWAS data from 5,761 African-American women. We show that, at typed SNPs, our method yields an 8% increase in statistical power for finding disease risk loci compared to the power achieved by standard methods in case-control studies. At imputed SNPs, we observe an 11% increase in statistical power for mapping disease loci when our local ancestry-aware imputation framework and the new scoring statistic are jointly employed. Finally, we show that our method increases statistical power in regions harboring the causal SNP in the case when the causal SNP is untyped and cannot be imputed. Our methods and our publicly available software are broadly applicable to GWAS in admixed populations.
Project description:BACKGROUND:Generally, populations differ in terms of environmental and genetic factors, which can create differences in allele substitution effects between populations. Therefore, a single genotype may have different additive genetic values in different populations. The correlation between the two additive genetic values of a single genotype in two populations is known as the additive genetic correlation between populations and thus, can differ from 1. Our objective was to investigate whether differences in linkage disequilibrium (LD) and allele frequencies of markers and causal loci between populations affect the bias of the estimated genetic correlation. We simulated two populations that were separated by 50 generations and differed in LD pattern between markers and causal loci, as measured by the LD-statistic r. We used a high marker density to represent a high consistency of LD between populations, and lower marker densities to represent situations with a lower consistency of LD between populations. Markers and causal loci were selected to have either similar or different allele frequencies in the two populations. RESULTS:Our results show that genetic correlations were underestimated only slightly when the difference in allele frequencies between the two populations was similar for the markers and the causal loci. A lower marker density, representing a lower consistency of LD between populations, had only a minor effect on the underestimation of the genetic correlation. When the difference in allele frequencies between the two populations was not similar for markers and causal loci, genetic correlations were severely underestimated. This bias occurred because the markers did not predict accurately the relationships at causal loci. CONCLUSIONS:For an unbiased estimation of the genetic correlation between populations, the markers should accurately predict the relationships at the causal loci. To achieve this, it is essential that the difference in allele frequencies between populations is similar for markers and causal loci. Our results show that differences in LD phase between causal loci and markers across populations have little effect on the estimated genetic correlation.