Deep whole-genome sequencing of 90 Han Chinese genomes.
ABSTRACT: Next-generation sequencing provides a high-resolution insight into human genetic information. However, the focus of previous studies has primarily been on low-coverage data due to the high cost of sequencing. Although the 1000 Genomes Project and the Haplotype Reference Consortium have both provided powerful reference panels for imputation, low-frequency and novel variants remain difficult to discover and call with accuracy on the basis of low-coverage data. Deep sequencing provides an optimal solution for the problem of these low-frequency and novel variants. Although whole-exome sequencing is also a viable choice for exome regions, it cannot account for noncoding regions, sometimes resulting in the absence of important, causal variants. For Han Chinese populations, the majority of variants have been discovered based upon low-coverage data from the 1000 Genomes Project. However, high-coverage, whole-genome sequencing data are limited for any population, and a large amount of low-frequency, population-specific variants remain uncharacterized. We have performed whole-genome sequencing at a high depth (?×80) of 90 unrelated individuals of Chinese ancestry, collected from the 1000 Genomes Project samples, including 45 Northern Han Chinese and 45 Southern Han Chinese samples. Eighty-three of these 90 have been sequenced by the 1000 Genomes Project. We have identified 12 568 804 single nucleotide polymorphisms, 2 074 210 short InDels, and 26 142 structural variations from these 90 samples. Compared to the Han Chinese data from the 1000 Genomes Project, we have found 7 000 629 novel variants with low frequency (defined as minor allele frequency < 5%), including 5 813 503 single nucleotide polymorphisms, 1 169 199 InDels, and 17 927 structural variants. Using deep sequencing data, we have built a greatly expanded spectrum of genetic variation for the Han Chinese genome. Compared to the 1000 Genomes Project, these Han Chinese deep sequencing data enhance the characterization of a large number of low-frequency, novel variants. This will be a valuable resource for promoting Chinese genetics research and medical development. Additionally, it will provide a valuable supplement to the 1000 Genomes Project, as well as to other human genome projects.
Project description:Noncoding variants at human chromosome 9p21 near CDKN2A and CDKN2B are associated with type 2 diabetes, myocardial infarction, aneurysm, vertical cup disc ratio and at least five cancers. Here we compare approaches to more comprehensively assess genetic variation in the region. We carried out targeted sequencing at high coverage in 47 individuals and compared the results to pilot data from the 1000 Genomes Project. We imputed variants into type 2 diabetes and myocardial infarction cohorts directly from targeted sequencing, from a genotyped reference panel derived from sequencing and from 1000 Genomes Project low-coverage data. Polymorphisms with frequency >5% were captured well by all strategies. Imputation of intermediate-frequency polymorphisms required a higher density of tag SNPs in disease samples than is available on first-generation genome-wide association study (GWAS) arrays. Our association analyses identified more comprehensive sets of variants showing equivalent statistical association with type 2 diabetes or myocardial infarction, but did not identify stronger associations than the original GWAS signals.
Project description:The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.
Project description:Whole-genome sequencing across multiple samples in a population provides an unprecedented opportunity for comprehensively characterizing the polymorphic variants in the population. Although the 1000 Genomes Project (1KGP) has offered brief insights into the value of population-level sequencing, the low coverage has compromised the ability to confidently detect rare and low-frequency variants. In addition, the composition of populations in the 1KGP is not complete, despite the fact that the study design has been extended to more than 2,500 samples from more than 20 population groups. The Malays are one of the Austronesian groups predominantly present in Southeast Asia and Oceania, and the Singapore Sequencing Malay Project (SSMP) aims to perform deep whole-genome sequencing of 100 healthy Malays. By sequencing at a minimum of 30× coverage, we have illustrated the higher sensitivity at detecting low-frequency and rare variants and the ability to investigate the presence of hotspots of functional mutations. Compared to the low-pass sequencing in the 1KGP, the deeper coverage allows more functional variants to be identified for each person. A comparison of the fidelity of genotype imputation of Malays indicated that a population-specific reference panel, such as the SSMP, outperforms a cosmopolitan panel with larger number of individuals for common SNPs. For lower-frequency (<5%) markers, a larger number of individuals might have to be whole-genome sequenced so that the accuracy currently afforded by the 1KGP can be achieved. The SSMP data are expected to be the benchmark for evaluating the value of deep population-level sequencing versus low-pass sequencing, especially in populations that are poorly represented in population-genetics studies.
Project description:Next-generation sequencing technologies can effectively detect the entire spectrum of genomic variation and provide a powerful tool for systematic exploration of the universe of common, low frequency and rare variants in the entire genome. However, the current paradigm for genome-wide association studies (GWAS) is to catalogue and genotype common variants (5% < MAF). The methods and study design for testing the association of low frequency (0.5% < MAF ? 5%) and rare variation (MAF ? 0.5%) have not been thoroughly investigated. The 1000 Genomes Project represents one such endeavour to characterize the human genetic variation pattern at the MAF = 1% level as a foundation for association studies. In this report, we explore different strategies and study designs for the near future GWAS in the post-era, based on both low coverage pilot data and exon pilot data in 1000 Genomes Project.We investigated the linkage disequilibrium (LD) pattern among common and low frequency SNPs and its implication for association studies. We found that the LD between low frequency alleles and low frequency alleles, and low frequency alleles and common alleles are much weaker than the LD between common and common alleles. We examined various tagging designs with and without statistical imputation approaches and compare their power against de novo resequencing in mapping causal variants under various disease models. We used the low coverage pilot data which contain ~14 M SNPs as a hypothetical genotype-array platform (Pilot 14 M) to interrogate its impact on the selection of tag SNPs, mapping coverage and power of association tests. We found that even after imputation we still observed 45.4% of low frequency SNPs which were untaggable and only 67.7% of the low frequency variation was covered by the Pilot 14 M array.This suggested GWAS based on SNP arrays would be ill-suited for association studies of low frequency variation.
Project description:Since publication of the human genome in 2003, geneticists have been interested in risk variant associations to resolve the etiology of traits and complex diseases. The International HapMap Consortium undertook an effort to catalog all common variation across the genome (variants with a minor allele frequency (MAF) of at least 5% in one or more ethnic groups). HapMap along with advances in genotyping technology led to genome-wide association studies which have identified common variants associated with many traits and diseases. In 2008 the 1000 Genomes Project aimed to sequence 2500 individuals and identify rare variants and 99% of variants with a MAF of <1%.To determine whether the 1000 Genomes Project includes all the variants in HapMap, we examined the overlap between single nucleotide polymorphisms (SNPs) genotyped in the two resources using merged phase II/III HapMap data and low coverage pilot data from 1000 Genomes.Comparison of the two data sets showed that approximately 72% of HapMap SNPs were also found in 1000 Genomes Project pilot data. After filtering out HapMap variants with a MAF of <5% (separately for each population), 99% of HapMap SNPs were found in 1000 Genomes data.Not all variants cataloged in HapMap are also cataloged in 1000 Genomes. This could affect decisions about which resource to use for SNP queries, rare variant validation, or imputation. Both the HapMap and 1000 Genomes Project databases are useful resources for human genetics, but it is important to understand the assumptions made and filtering strategies employed by these projects.
Project description:The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation by sequencing at a level that should allow the genome-wide detection of most variants with frequencies as low as 1%. However, in the major histocompatibility complex (MHC), only the top 10 most frequent haplotypes are in the 1% frequency range whereas thousands of haplotypes are present at lower frequencies. Given the limitation of both the coverage and the read length of the sequences generated by the 1000 Genomes Project, the highly variable positions that define HLA alleles may be difficult to identify. We used classical Sanger sequencing techniques to type the HLA-A, HLA-B, HLA-C, HLA-DRB1 and HLA-DQB1 genes in the available 1000 Genomes samples and combined the results with the 103,310 variants in the MHC region genotyped by the 1000 Genomes Project. Using pairwise identity-by-descent distances between individuals and principal component analysis, we established the relationship between ancestry and genetic diversity in the MHC region. As expected, both the MHC variants and the HLA phenotype can identify the major ancestry lineage, informed mainly by the most frequent HLA haplotypes. To some extent, regions of the genome with similar genetic or similar recombination rate have similar properties. An MHC-centric analysis underlines departures between the ancestral background of the MHC and the genome-wide picture. Our analysis of linkage disequilibrium (LD) decay in these samples suggests that overestimation of pairwise LD occurs due to a limited sampling of the MHC diversity. This collection of HLA-specific MHC variants, available on the dbMHC portal, is a valuable resource for future analyses of the role of MHC in population and disease studies.
Project description:Microarray single-nucleotide polymorphism genotyping, combined with imputation of untyped variants, has been widely adopted as an efficient means to interrogate variation across the human genome. "Genomic coverage" is the total proportion of genomic variation captured by an array, either by direct observation or through an indirect means such as linkage disequilibrium or imputation. We have performed imputation-based genomic coverage assessments of eight current genotyping arrays that assay from ~0.3 to ~5 million variants. Coverage was determined separately in each of the four continental ancestry groups in the 1000 Genomes Project phase 1 release. We used the subset of 1000 Genomes variants present on each array to impute the remaining variants and assessed coverage based on correlation between imputed and observed allelic dosages. More than 75% of common variants (minor allele frequency > 0.05) are covered by all arrays in all groups except for African ancestry, and up to ~90% in all ancestries for the highest density arrays. In contrast, less than 40% of less common variants (0.01 < minor allele frequency < 0.05) are covered by low density arrays in all ancestries and 50-80% in high density arrays, depending on ancestry. We also calculated genome-wide power to detect variant-trait association in a case-control design, across varying sample sizes, effect sizes, and minor allele frequency ranges, and compare these array-based power estimates with a hypothetical array that would type all variants in 1000 Genomes. These imputation-based genomic coverage and power analyses are intended as a practical guide to researchers planning genetic studies.
Project description:Whole genome sequencing studies are essential to obtain a comprehensive understanding of the vast pattern of human genomic variations. Here we report the results of a high-coverage whole genome sequencing study for 44 unrelated healthy Caucasian adults, each sequenced to over 50-fold coverage (averaging 65.8×). We identified approximately 11 million single nucleotide polymorphisms (SNPs), 2.8 million short insertions and deletions, and over 500,000 block substitutions. We showed that, although previous studies, including the 1000 Genomes Project Phase 1 study, have catalogued the vast majority of common SNPs, many of the low-frequency and rare variants remain undiscovered. For instance, approximately 1.4 million SNPs and 1.3 million short indels that we found were novel to both the dbSNP and the 1000 Genomes Project Phase 1 data sets, and the majority of which (?96%) have a minor allele frequency less than 5%. On average, each individual genome carried ?3.3 million SNPs and ?492,000 indels/block substitutions, including approximately 179 variants that were predicted to cause loss of function of the gene products. Moreover, each individual genome carried an average of 44 such loss-of-function variants in a homozygous state, which would completely "knock out" the corresponding genes. Across all the 44 genomes, a total of 182 genes were "knocked-out" in at least one individual genome, among which 46 genes were "knocked out" in over 30% of our samples, suggesting that a number of genes are commonly "knocked-out" in general populations. Gene ontology analysis suggested that these commonly "knocked-out" genes are enriched in biological process related to antigen processing and immune response. Our results contribute towards a comprehensive characterization of human genomic variation, especially for less-common and rare variants, and provide an invaluable resource for future genetic studies of human variation and diseases.
Project description:Does genotype imputation with public reference panels identify variants contributing to disease? Genotype imputation using the 1000 Genomes Project (1KG; 2504 individuals) displayed poor coverage at the causal cystic fibrosis (CF) transmembrane conductance regulator (CFTR) locus for the International CF Gene Modifier Consortium. Imputation with the larger Haplotype Reference Consortium (HRC; 32,470 individuals) displayed improved coverage but low sensitivity of variants clinically relevant for CF. A hybrid reference that combined whole genome sequencing (WGS) from 101 CF individuals with the 1KG imputed a greater number of single-nucleotide variants (SNVs) that would be analyzed in a genetic association study (r2???0.3 and MAF???0.5%) than imputation with the HRC, while the HRC excelled in the lower frequency spectrum. Using the 1KG or HRC as reference panels missed the most common CF-causing variants or displayed low imputation accuracy. Designs that incorporate population-specific WGS can improve imputation accuracy at disease-specific loci, while imputation using public data sets can omit disease-relevant genotypes.
Project description:Indels are an important cause of human variation and central to the study of human disease. The 1000 Genomes Project Low-Coverage Pilot identified over 1.3 million indels shorter than 50 bp, of which over 890 were identified as potentially disruptive variants. Yet, despite their ubiquity, the local genomic characteristics of indels remain unexplored.Herein we describe population- and minor allele frequency-based differences in linkage disequilibrium and imputation characteristics for indels included in the 1000 Genomes Project Low-Coverage Pilot for the CEU, YRI and CHB+JPT populations. Common indels were well tagged by nearby SNPs in all studied populations, and were also tagged at a similar rate to common SNPs. Both neutral and functionally deleterious common indels were imputed with greater than 95% concordance from HapMap Phase 3 and OMNI SNP sites. Further, 38 to 56% of low frequency indels were tagged by low frequency SNPs. We were able to impute heterozygous low frequency indels with over 50% concordance. Lastly, our analysis also revealed evidence of ascertainment bias. This bias prevents us from extending the applicability of our results to highly polymorphic indels that could not be identified in the Low-Coverage Pilot.Although further scope exists to improve the imputation of low frequency indels, our study demonstrates that there are already ample opportunities to retrospectively impute indels for prior genome-wide association studies and to incorporate indel imputation into future case/control studies.