The accuracy of LD Score regression as an estimator of confounding and genetic correlations in genome-wide association studies.
ABSTRACT: To infer that a single-nucleotide polymorphism (SNP) either affects a phenotype or is linkage disequilibrium with a causal site, we must have some assurance that any SNP-phenotype correlation is not the result of confounding with environmental variables that also affect the trait. In this study, we study the properties of linkage disequilibrium (LD) Score regression, a recently developed method for using summary statistics from genome-wide association studies to ensure that confounding does not inflate the number of false positives. We do not treat the effects of genetic variation as a random variable and thus are able to obtain results about the unbiasedness of this method. We demonstrate that LD Score regression can produce estimates of confounding at null SNPs that are unbiased or conservative under fairly general conditions. This robustness holds in the case of the parent genotype affecting the offspring phenotype through some environmental mechanism, despite the resulting correlation over SNPs between LD Scores and the degree of confounding. Additionally, we demonstrate that LD Score regression can produce reasonably robust estimates of the genetic correlation, even when its estimates of the genetic covariance and the two univariate heritabilities are substantially biased.
Project description:Both polygenicity (many small genetic effects) and confounding biases, such as cryptic relatedness and population stratification, can yield an inflated distribution of test statistics in genome-wide association studies (GWAS). However, current methods cannot distinguish between inflation from a true polygenic signal and bias. We have developed an approach, LD Score regression, that quantifies the contribution of each by examining the relationship between test statistics and linkage disequilibrium (LD). The LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control. We find strong evidence that polygenicity accounts for the majority of the inflation in test statistics in many GWAS of large sample size.
Project description:LD score regression is a reliable and efficient method of using genome-wide association study (GWAS) summary-level results data to estimate the SNP heritability of complex traits and diseases, partition this heritability into functional categories, and estimate the genetic correlation between different phenotypes. Because the method relies on summary level results data, LD score regression is computationally tractable even for very large sample sizes. However, publicly available GWAS summary-level data are typically stored in different databases and have different formats, making it difficult to apply LD score regression to estimate genetic correlations across many different traits simultaneously.In this manuscript, we describe LD Hub - a centralized database of summary-level GWAS results for 173 diseases/traits from different publicly available resources/consortia and a web interface that automates the LD score regression analysis pipeline. To demonstrate functionality and validate our software, we replicated previously reported LD score regression analyses of 49 traits/diseases using LD Hub; and estimated SNP heritability and the genetic correlation across the different phenotypes. We also present new results obtained by uploading a recent atopic dermatitis GWAS meta-analysis to examine the genetic correlation between the condition and other potentially related traits. In response to the growing availability of publicly accessible GWAS summary-level results data, our database and the accompanying web interface will ensure maximal uptake of the LD score regression methodology, provide a useful database for the public dissemination of GWAS results, and provide a method for easily screening hundreds of traits for overlapping genetic aetiologies.The web interface and instructions for using LD Hub are available at http://ldsc.broadinstitute.org/ CONTACT: firstname.lastname@example.orgSupplementary information: Supplementary data are available at Bioinformatics online.
Project description:Polygenic risk scores (PRSs) are a method to summarize the additive trait variance captured by a set of SNPs, and can increase the power of set-based analyses by leveraging public genome-wide association study (GWAS) datasets. PRS aims to assess the genetic liability to some phenotype on the basis of polygenic risk for the same or different phenotype estimated from independent data. We propose the application of PRSs as a set-based method with an additional component of adjustment for linkage disequilibrium (LD), with potential extension of the PRS approach to analyze biologically meaningful SNP sets. We call this method POLARIS: POlygenic Ld-Adjusted RIsk Score. POLARIS identifies the LD structure of SNPs using spectral decomposition of the SNP correlation matrix and replaces the individuals' SNP allele counts with LD-adjusted dosages. Using a raw genotype dataset together with SNP effect sizes from a second independent dataset, POLARIS can be used for set-based analysis. MAGMA is an alternative set-based approach employing principal component analysis to account for LD between markers in a raw genotype dataset. We used simulations, both with simple constructed and real LD-structure, to compare the power of these methods. POLARIS shows more power than MAGMA applied to the raw genotype dataset only, but less or comparable power to combined analysis of both datasets. POLARIS has the advantages that it produces a risk score per person per set using all available SNPs, and aims to increase power by leveraging the effect sizes from the discovery set in a self-contained test of association in the test dataset.
Project description:Complex diseases will have multiple functional sites, and it will be invaluable to understand the cross-locus interaction in terms of linkage disequilibrium (LD) between those sites (epistasis) in addition to the haplotype-LD effects. We investigated the statistical properties of a class of matrix-based statistics to assess this epistasis. These statistical methods include two LD contrast tests (Zaykin et al., 2006) and partial least squares regression (Wang et al., 2008). To estimate Type 1 error rates and power, we simulated multiple two-variant disease models using the SIMLA software package. SIMLA allows for the joint action of up to two disease genes in the simulated data with all possible multiplicative interaction effects between them. Our goal was to detect an interaction between multiple disease-causing variants by means of their linkage disequilibrium (LD) patterns with other markers. We measured the effects of marginal disease effect size, haplotype LD, disease prevalence and minor allele frequency have on cross-locus interaction (epistasis). In the setting of strong allele effects and strong interaction, the correlation between the two disease genes was weak (r=0.2). In a complex system with multiple correlations (both marginal and interaction), it was difficult to determine the source of a significant result. Despite these complications, the partial least squares and modified LD contrast methods maintained adequate power to detect the epistatic effects; however, for many of the analyses we often could not separate interaction from a strong marginal effect. While we did not exhaust the entire parameter space of possible models, we do provide guidance on the effects that population parameters have on cross-locus interaction.
Project description:Linkage disequilibrium (LD) provides information about positional cloning, linkage, and evolution that cannot be inferred from other evidence, even when a correct sequence and a linkage map based on more than a handful of families become available. We present theory to construct an LD map for which distances are additive and population-specific maps are expected to be approximately proportional. For this purpose, there is only a modest difference in relative efficiency of haplotypes and diplotypes: resolving the latter into 2-locus haplotypes has significant cost or error and increases information by about 50%. LD maps for a cold spot in 19p13.3 and a more typical region in 3q21 are optimized by interval estimates. For a random sample and trustworthy map the value of LD at large distance can be predicted reliably from information over a small distance and does not depend on the evolutionary variance unless the sample size approaches the population size. Values of the association probability that can be distinguished from the value at large distance are determined not by population size but by time since a critical bottleneck. In these examples, omission of markers with significant Hardy-Weinberg disequilibrium does not improve the map, and widely discrepant draft sequences have similar estimates of the genetic parameters. The LD cold spot in 19p13.3 gives an unusually high estimate of time, supporting an argument that this relationship is general. As predicted for a region with ancient haplotypes or uniformly high recombination, there is no clear evidence of LD clustering. On the contrary, the 3q21 region is resolved into alternating blocks of stable and decreasing LD, as expected from crossover clustering. Construction of a genomewide LD map requires data not yet available, which may be complemented but not replaced by a catalog of haplotypes.
Project description:The number of SNPs required for QTL discovery is justified by the distance at which linkage disequilibrium has decayed. Simulations and real potato SNP data showed how to estimate and interpret LD decay. The magnitude of linkage disequilibrium (LD) and its decay with genetic distance determine the resolution of association mapping, and are useful for assessing the desired numbers of SNPs on arrays. To study LD and LD decay in tetraploid potato, we simulated autotetraploid genotypes and used it to explore the dependence on: (1) the number of haplotypes in the population (the amount of genetic variation) and (2) the percentage of haplotype specific SNPs (hs-SNPs). Several estimators for short-range LD were explored, such as the average r 2, median r 2, and other percentiles of r 2 (80, 90, and 95 %). For LD decay, we looked at LD½,90, the distance at which the short-range LD is halved when using the 90 % percentile of r 2 at short range, as estimator for LD. Simulations showed that the performance of various estimators for LD decay strongly depended on the number of haplotypes, although the real value of LD decay was not influenced very much by this number. The estimator LD½,90 was chosen to evaluate LD decay in 537 tetraploid varieties. LD½,90 values were 1.5 Mb for varieties released before 1945 and 0.6 Mb in varieties released after 2005. LD½,90 values within three different subpopulations ranged from 0.7 to 0.9 Mb. LD½,90 was 2.5 Mb for introgressed regions, indicating large haplotype blocks. In pericentromeric heterochromatin, LD decay was negligible. This study demonstrates that several related factors influencing LD decay could be disentangled, that no universal approach can be suggested, and that the estimation of LD decay has to be performed with great care and knowledge of the sampled material.
Project description:BMapBuilder builds maps of pairwise linkage disequilibrium (LD) in either two or three dimensions. The optimized resolution allows for graphical display of LD for single nucleotide polymorphisms (SNPs) in a whole chromosome.The program is coded in Java, which runs on all relevant operating systems, including Windows, Mac and Unix/Linux, and is available from http://bios.ugr.es/BMapBuilder.
Project description:Population substructure can lead to confounding in tests for genetic association, and failure to adjust properly can result in spurious findings. Here we address this issue of confounding by considering the impact of global ancestry (average ancestry across the genome) and local ancestry (ancestry at a specific chromosomal location) on regression parameters and relative power in ancestry-adjusted and -unadjusted models. We examine theoretical expectations under different scenarios for population substructure; applying different regression models, verifying and generalizing using simulations, and exploring the findings in real-world admixed populations. We show that admixture does not lead to confounding when the trait locus is tested directly in a single admixed population. However, if there is more complex population structure or a marker locus in linkage disequilibrium (LD) with the trait locus is tested, both global and local ancestry can be confounders. Additionally, we show the genotype parameters of adjusted and unadjusted models all provide tests for LD between the marker and trait locus, but in different contexts. The local ancestry adjusted model tests for LD in the ancestral populations, while tests using the unadjusted and the global ancestry adjusted models depend on LD in the admixed population(s), which may be enriched due to different ancestral allele frequencies. Practically, this implies that global-ancestry adjustment should be used for screening, but local-ancestry adjustment may better inform fine mapping and provide better effect estimates at trait loci.
Project description:Primary open-angle glaucoma (POAG) is the most common chronic optic neuropathy worldwide. Epidemiological studies show a robust positive relation between intraocular pressure (IOP) and POAG and modest positive association between IOP and blood pressure (BP), while the relation between BP and POAG is controversial. The International Glaucoma Genetics Consortium (n=27?558), the International Consortium on Blood Pressure (n=69?395), and the National Eye Institute Glaucoma Human Genetics Collaboration Heritable Overall Operational Database (n=37?333), represent genome-wide data sets for IOP, BP traits and POAG, respectively. We formed genome-wide significant variant panels for IOP and diastolic BP and found a strong relation with POAG (odds ratio and 95% confidence interval: 1.18 (1.14-1.21), P=1.8 × 10-27) for the former trait but no association for the latter (P=0.93). Next, we used linkage disequilibrium (LD) score regression, to provide genome-wide estimates of correlation between traits without the need for additional phenotyping. We also compared our genome-wide estimate of heritability between IOP and BP to an estimate based solely on direct measures of these traits in the Erasmus Rucphen Family (ERF; n=2519) study using Sequential Oligogenic Linkage Analysis Routines (SOLAR). LD score regression revealed high genetic correlation between IOP and POAG (48.5%, P=2.1 × 10-5); however, genetic correlation between IOP and diastolic BP (P=0.86) and between diastolic BP and POAG (P=0.42) were negligible. Using SOLAR in the ERF study, we confirmed the minimal heritability between IOP and diastolic BP (P=0.63). Overall, IOP shares genetic basis with POAG, whereas BP has limited shared genetic correlation with IOP or POAG.
Project description:Genetic correlation is a key population parameter that describes the shared genetic architecture of complex traits and diseases. It can be estimated by current state-of-art methods, i.e., linkage disequilibrium score regression (LDSC) and genomic restricted maximum likelihood (GREML). The massively reduced computing burden of LDSC compared to GREML makes it an attractive tool, although the accuracy (i.e., magnitude of standard errors) of LDSC estimates has not been thoroughly studied. In simulation, we show that the accuracy of GREML is generally higher than that of LDSC. When there is genetic heterogeneity between the actual sample and reference data from which LD scores are estimated, the accuracy of LDSC decreases further. In real data analyses estimating the genetic correlation between schizophrenia (SCZ) and body mass index, we show that GREML estimates based on ?150,000 individuals give a higher accuracy than LDSC estimates based on ?400,000 individuals (from combined meta-data). A GREML genomic partitioning analysis reveals that the genetic correlation between SCZ and height is significantly negative for regulatory regions, which whole genome or LDSC approach has less power to detect. We conclude that LDSC estimates should be carefully interpreted as there can be uncertainty about homogeneity among combined meta-datasets. We suggest that any interesting findings from massive LDSC analysis for a large number of complex traits should be followed up, where possible, with more detailed analyses with GREML methods, even if sample sizes are lesser.