Haplotype kernel association test as a powerful method to identify chromosomal regions harboring uncommon causal variants.
ABSTRACT: For most complex diseases, the fraction of heritability that can be explained by the variants discovered from genome-wide association studies is minor. Although the so-called "rare variants" (minor allele frequency [MAF] < 1%) have attracted increasing attention, they are unlikely to account for much of the "missing heritability" because very few people may carry these rare variants. The genetic variants that are likely to fill in the "missing heritability" include uncommon causal variants (MAF < 5%), which are generally untyped in association studies using tagging single-nucleotide polymorphisms (SNPs) or commercial SNP arrays. Developing powerful statistical methods can help to identify chromosomal regions harboring uncommon causal variants, while bypassing the genome-wide or exome-wide next-generation sequencing. In this work, we propose a haplotype kernel association test (HKAT) that is equivalent to testing the variance component of random effects for distinct haplotypes. With an appropriate weighting scheme given to haplotypes, we can further enhance the ability of HKAT to detect uncommon causal variants. With scenarios simulated according to the population genetics theory, HKAT is shown to be a powerful method for detecting chromosomal regions harboring uncommon causal variants.
Project description:Detecting uncommon causal variants (minor allele frequency [MAF] < 5%) is difficult with commercial single-nucleotide polymorphism (SNP) arrays that are designed to capture common variants (MAF > 5%). Haplotypes can provide insights into underlying linkage disequilibrium (LD) structure and can tag uncommon variants that are not well tagged by common variants. In this work, we propose a wei-SIMc-matching test that inversely weights haplotype similarities with the estimated standard deviation of haplotype counts to boost the power of similarity-based approaches for detecting uncommon causal variants. We then compare the power of the wei-SIMc-matching test with that of several popular haplotype-based tests, including four other similarity-based tests, a global score test for haplotypes (global), a test based on the maximum score statistic over all haplotypes (max), and two newly proposed haplotype-based tests for rare variant detection. With systematic simulations under a wide range of LD patterns, the results show that wei-SIMc-matching and global are the two most powerful tests. Among these two tests, wei-SIMc-matching has reliable asymptotic P-values, whereas global needs permutations to obtain reliable P-values when the frequencies of some haplotype categories are low or when the trait is skewed. Therefore, we recommend wei-SIMc-matching for detecting uncommon causal variants with surrounding common SNPs, in light of its power and computational feasibility.
Project description:GWAS have been successful in identifying disease susceptibility loci, but it remains a challenge to pinpoint the causal variants in subsequent fine-mapping studies. A conventional fine-mapping effort starts by sequencing dozens of randomly selected samples at susceptibility loci to discover candidate variants, which are then placed on custom arrays or used in imputation algorithms to find the causal variants. We propose that one or several rare or low-frequency causal variants can hitchhike the same common tag SNP, so causal variants may not be easily unveiled by conventional efforts. Here, we first demonstrate that the true effect size and proportion of variance explained by a collection of rare causal variants can be underestimated by a common tag SNP, thereby accounting for some of the "missing heritability" in GWAS. We then describe a case-selection approach based on phasing long-range haplotypes and sequencing cases predicted to harbor causal variants. We compare this approach with conventional strategies on a simulated data set, and we demonstrate its advantages when multiple causal variants are present. We also evaluate this approach in a GWAS on hearing loss, where the most common causal variant has a minor allele frequency (MAF) of 1.3% in the general population and 8.2% in 329 cases. With our case-selection approach, it is present in 88% of the 32 selected cases (MAF = 66%), so sequencing a subset of these cases can readily reveal the causal allele. Our results suggest that thinking beyond common variants is essential in interpreting GWAS signals and identifying causal variants.
Project description:Genome-wide association studies (GWAS) have identified hundreds of associated loci across many common diseases. Most risk variants identified by GWAS will merely be tags for as-yet-unknown causal variants. It is therefore possible that identification of the causal variant, by fine mapping, will identify alleles with larger effects on genetic risk than those currently estimated from GWAS replication studies. We show that under plausible assumptions, whilst the majority of the per-allele relative risks (RR) estimated from GWAS data will be close to the true risk at the causal variant, some could be considerable underestimates. For example, for an estimated RR in the range 1.2-1.3, there is approximately a 38% chance that it exceeds 1.4 and a 10% chance that it is over 2. We show how these probabilities can vary depending on the true effects associated with low-frequency variants and on the minor allele frequency (MAF) of the most associated SNP. We investigate the consequences of the underestimation of effect sizes for predictions of an individual's disease risk and interpret our results for the design of fine mapping experiments. Although these effects mean that the amount of heritability explained by known GWAS loci is expected to be larger than current projections, this increase is likely to explain a relatively small amount of the so-called "missing" heritability.
Project description:Genome-wide association studies (GWAS) have provided valuable insights into the genetic basis of complex traits. However, they have explained relatively little trait heritability. Recently, we proposed a new analytical approach called regional heritability mapping (RHM) that captures more of the missing genetic variation. This method is applicable both to related and unrelated populations. Here, we demonstrate the power of RHM in comparison with single-SNP GWAS and gene-based association approaches under a wide range of scenarios with variable numbers of quantitative trait loci (QTL) with common and rare causal variants in a narrow genomic region. Simulations based on real genotype data were performed to assess power to capture QTL variance, and we demonstrate that RHM has greater power to detect rare variants and/or multiple alleles in a region than other approaches. In addition, we show that RHM can capture more accurately the QTL variance, when it is caused by multiple independent effects and/or rare variants. We applied RHM to analyze three biometrical eye traits for which single-SNP GWAS have been published or performed to evaluate the effectiveness of this method in real data analysis and detected some additional loci which were not detected by other GWAS methods. RHM has the potential to explain some of missing heritability by capturing variance caused by QTL with low MAF and multiple independent QTL in a region, not captured by other GWAS methods. RHM analyses can be implemented using the software REACTA (http://www.epcc.ed.ac.uk/projects-portfolio/reacta).
Project description:The proportion of genetic variation in complex traits explained by rare variants is a key question for genomic prediction, and for identifying the basis of "missing heritability"--the proportion of additive genetic variation not captured by common variants on SNP arrays. Sequence variants in transcript and regulatory regions from 429 sequenced animals were used to impute high density SNP genotypes of 3311 Holstein sires to sequence. There were 675,062 common variants (MAF>0.05), 102,549 uncommon variants (0.01<MAF<0.05), and 83,856 rare variants (MAF<0.01). We describe a novel method for estimating the proportion of the rare variants that are sequencing errors using parent-progeny duos. We then used mixed model methodology to estimate the proportion of variance captured by these different classes of variants for fat, milk and protein yields, as well as for fertility. Common sequence variants captured 83%, 77%, 76% and 84% of the total genetic variance for fat, milk, and protein yields and fertility, respectively. This was between 2 and 5% more variance than that captured from 600k SNPs on a high density chip, although the difference was not significant. Rare variants captured 3%, 0%, 1% and 14% of the genetic variance for fat, milk and protein yields, and fertility respectively, whereas pedigree explained the remaining amount of genetic variance (none for fertility). The proportion of variation explained by rare variants is likely to be under-estimated due to reduced accuracies of imputation for this class of variants. Using common sequence variants slightly improved accuracy of genomic predictions for fat and milk yield, compared to high density SNP array genotypes. However, including rare variants from transcript regions did not increase the accuracy of genomic predictions. These results suggest that rare variants recover a small percentage of the missing heritability for complex traits, however very large reference sets will be required to exploit this to improve the accuracy of genomic predictions. Our results do suggest the contribution of rare variants to genetic variation may be greater for fitness traits.
Project description:It has been hypothesized that low frequency (1-5% minor allele frequency (MAF)) and rare (<1% MAF) variants with large effect sizes may contribute to the missing heritability in complex traits. Here, we report an association analysis of lipid traits (total cholesterol, LDL-cholesterol, HDL-cholesterol triglycerides) in up to 27 312 individuals with a comprehensive set of low frequency coding variants (ExomeChip), combined with conditional analysis in the known lipid loci. No new locus reached genome-wide significance. However, we found a new lead variant in 26 known lipid association regions of which 16 were?>1000-fold more significant than the previous sentinel variant and not in close LD (six had MAF?<5%). Furthermore, conditional analysis revealed multiple independent signals (ranging from 1 to 5) in a third of the 98 lipid loci tested, including rare variants. Addition of our novel associations resulted in between 1.5- and 2.5-fold increase in the proportion of heritability explained for the different lipid traits. Our findings suggest that rare coding variants contribute to the genetic architecture of lipid traits.
Project description:Schizophrenia (SZ) is a common and severe psychiatric disorder with both environmental and genetic risk factors, and a high heritability. After over 20 years of molecular genetics research, new molecular strategies, primarily genome-wide association studies (GWAS), have generated major tangible progress. This new data provides evidence for: (1) a number of chromosomal regions with common polymorphisms showing genome-wide association with SZ (the major histocompatibility complex, MHC, region at 6p22-p21; 18q21.2; and 2q32.1). The associated alleles present small odds ratios (the odds of a risk variant being present in cases vs. controls) and suggest causative involvement of gene regulatory mechanisms in SZ. (2) Polygenic inheritance. (3) Involvement of rare (<1%) and large (>100kb) copy number variants (CNVs). (4) A genetic overlap of SZ with autism and with bipolar disorder (BP) challenging the classical clinical classifications. Most new SZ findings (chromosomal regions and genes) have generated new biological leads. These new findings, however, still need to be translated into a better understanding of the underlying biology and into causal mechanisms. Furthermore, a considerable amount of heritability still remains unexplained (missing heritability). Deep resequencing for rare variants and system biology approaches (e.g., integrating DNA sequence and functional data) are expected to further improve our understanding of the genetic architecture of SZ and its underlying biology.
Project description:Autoimmune vitiligo is a complex disease involving polygenic risk from at least 50 loci previously identified by genome-wide association studies. The objectives of this study were to estimate and compare vitiligo heritability in European-derived patients using both family-based and 'deep imputation' genotype-based approaches. We estimated family-based heritability (h2FAM) by vitiligo recurrence among a total 8034 first-degree relatives (3776 siblings, 4258 parents or offspring) of 2122 unrelated vitiligo probands. We estimated genotype-based heritability (h2SNP) by deep imputation to Haplotype Reference Consortium and the 1000 Genomes Project data in unrelated 2812 vitiligo cases and 37 079 controls genotyped genome wide, achieving high-quality imputation from markers with minor allele frequency (MAF) as low as 0.0001. Heritability estimated by both approaches was exceedingly high; h2FAM = 0.75-0.83 and h2SNP = 0.78. These estimates are statistically identical, indicating there is essentially no remaining 'missing heritability' for vitiligo. Overall, ~70% of h2SNP is represented by common variants (MAF > 0.01) and 30% by rare variants. These results demonstrate that essentially all vitiligo heritable risk is captured by array-based genotyping and deep imputation. These findings suggest that vitiligo may provide a particularly tractable model for investigation of complex disease genetic architecture and predictive aspects of personalized medicine.
Project description:Genome-wide association studies have identified several loci associated with plasma lipid levels but those common variants together account only for a small proportion of the genetic variance of lipid traits. It has been hypothesized that the remaining heritability may partly be explained by rare variants with strong effect sizes. Here, we have comprehensively investigated the associations of both common and uncommon/rare variants in the lipoprotein lipase (LPL) gene in relation to plasma lipoprotein-lipid levels in African Blacks (ABs). For variant discovery purposes, the entire LPL gene and flanking regions were resequenced in 95 ABs with extreme high-density lipoprotein cholesterol (HDL-C) levels. A total of 308 variants were identified, of which 64 were novel. Selected common tagSNPs and uncommon/rare variants were genotyped in the entire sample (n=788), and 126 QC-passed variants were evaluated for their associations with lipoprotein-lipid levels by using single-site, haplotype and rare variant (SKAT-O) association analyses. We found eight not highly correlated (r(2)<0.40) signals (rs1801177:G>A, rs8176337:G>C, rs74304285:G>A, rs252:delA, rs316:C>A, rs329:A>G, rs12679834:T>C, and rs4921684:C>T) nominally (P<0.05) associated with lipid traits (HDL-C, LDL-C, ApoA1 or ApoB levels) in our sample. The most significant SNP, rs252:delA, represented a novel association observed with LDL-C (P=0.002) and ApoB (P=0.012). For TG and LDL-C, the haplotype analysis was more informative than the single-site analysis. The SKAT-O analysis revealed that the bin (group) containing 22 rare variants with MAF≤0.01 exhibited nominal association with TG (P=0.039) and LDL-C (P=0.027). Our study indicates that both common and uncommon/rare LPL variants/haplotypes may affect plasma lipoprotein-lipid levels in general African population.
Project description:Polygenic scores (PGS) have been widely used to predict disease risk using variants identified from genome-wide association studies (GWAS). To date, most GWAS have been conducted in populations of European ancestry, which limits the use of GWAS-derived PGS in non-European ancestry populations. Here, we derive a theoretical model of the relative accuracy (RA) of PGS across ancestries. We show through extensive simulations that the RA of PGS based on genome-wide significant SNPs can be predicted accurately from modelling linkage disequilibrium (LD), minor allele frequencies (MAF), cross-population correlations of causal SNP effects and heritability. We find that LD and MAF differences between ancestries can explain between 70 and 80% of the loss of RA of European-based PGS in African ancestry for traits like body mass index and type 2 diabetes. Our results suggest that causal variants underlying common genetic variation identified in European ancestry GWAS are mostly shared across continents.