An ancestry informative marker panel design for individual ancestry estimation of Hispanic population using whole exome sequencing data.
ABSTRACT: BACKGROUND:Europeans and American Indians were major genetic ancestry of Hispanics in the U.S. These ancestral groups have markedly different incidence rates and outcomes in many types of cancers. Therefore, the genetic admixture may cause biased genetic association study with cancer susceptibility variants specifically in Hispanics. For example, the incidence rate of liver cancer has been shown with substantial disparity between Hispanic, Asian and non-Hispanic white populations. Currently, ancestry informative marker (AIM) panels have been widely utilized with up to a few hundred ancestry-informative single nucleotide polymorphisms (SNPs) to infer ancestry admixture. Notably, current available AIMs are predominantly located in intron and intergenic regions, while the whole exome sequencing (WES) protocols commonly used in translational research and clinical practice do not cover these markers. Thus, it remains challenging to accurately determine a patient's admixture proportion without additional DNA testing. RESULTS:In this study we designed an unique AIM panel that infers 3-way genetic admixture from three distinct and selective continental populations (African (AFR), European (EUR), and East Asian (EAS)) within evolutionarily conserved exonic regions. Initially, about 1 million exonic SNPs from selective three populations in the 1000 Genomes Project were trimmed by their linkage disequilibrium (LD), restricted to biallelic variants, and finally we optimized to an AIM panel with 250 SNP markers, or the UT-AIM250 panel, using their ancestral informativeness statistics. Comparing to published AIM panels, UT-AIM250 performed better accuracy when we tested with three ancestral populations (accuracy: 0.995?±?0.012 for AFR, 0.997?±?0.007 for EUR, and 0.994?±?0.012 for EAS). We further demonstrated the performance of the UT-AIM250 panel to admixed American (AMR) samples of the 1000 Genomes Project and obtained similar results (AFR, 0.085?±?0.098; EUR, 0.665?±?0.182; and EAS, 0.250?±?0.205) to previously published AIM panels (Phillips-AIM34: AFR, 0.096?±?0.127, EUR, 0.575?±?0.290, and EAS, 0.330?±?0.315; Wei-AIM278: AFR, 0.070?±?0.096, EUR, 0.537?±?0.267, and EAS, 0.393?±?0.300). Subsequently, we applied the UT-AIM250 panel to a clinical dataset of 26 self-reported Hispanic patients in South Texas with hepatocellular carcinoma (HCC). We estimated the admixture proportions using WES data of adjacent non-cancer liver tissues (AFR, 0.065?±?0.043; EUR, 0.594?±?0.150; and EAS, 0.341?±?0.160). Similar admixture proportions were identified from corresponding tumor tissues. In addition, we estimated admixture proportions of The Cancer Genome Atlas (TCGA) collection of hepatocellular carcinoma (TCGA-LIHC) samples (376 patients) using the UT-AIM250 panel. The panel obtained consistent admixture proportions from tumor and matched normal tissues, identified 3 possible incorrectly reported race/ethnicity, and/or provided race/ethnicity determination if necessary. CONCLUSIONS:Here we demonstrated the feasibility of using evolutionarily conserved exonic regions to infer admixture proportions and provided a robust and reliable control for sample collection or patient stratification for genetic analysis. R implementation of UT-AIM250 is available at https://github.com/chenlabgccri/UT-AIM250.
Project description:Publicly available genetic summary data have high utility in research and the clinic, including prioritizing putative causal variants, polygenic scoring, and leveraging common controls. However, summarizing individual-level data can mask population structure, resulting in confounding, reduced power, and incorrect prioritization of putative causal variants. This limits the utility of publicly available data, especially for understudied or admixed populations where additional research and resources are most needed. Although several methods exist to estimate ancestry in individual-level data, methods to estimate ancestry proportions in summary data are lacking. Here, we present Summix, a method to efficiently deconvolute ancestry and provide ancestry-adjusted allele frequencies (AFs) from summary data. Using continental reference ancestry, African (AFR), non-Finnish European (EUR), East Asian (EAS), Indigenous American (IAM), South Asian (SAS), we obtain accurate and precise estimates (within 0.1%) for all simulation scenarios. We apply Summix to gnomAD v.2.1 exome and genome groups and subgroups, finding heterogeneous continental ancestry for several groups, including African/African American (∼84% AFR, ∼14% EUR) and American/Latinx (∼4% AFR, ∼5% EAS, ∼43% EUR, ∼46% IAM). Compared to the unadjusted gnomAD AFs, Summix's ancestry-adjusted AFs more closely match respective African and Latinx reference samples. Even on modern, dense panels of summary statistics, Summix yields results in seconds, allowing for estimation of confidence intervals via block bootstrap. Given an accompanying R package, Summix increases the utility and equity of public genetic resources, empowering novel research opportunities.
Project description:Following up on our previous study, we conducted a genome-wide analysis of admixture for two Uyghur population samples (HGDP-UG and PanAsia-UG), collected from the northern and southern regions of Xinjiang in China, respectively. Both HGDP-UG and PanAsia-UG showed a substantial admixture of East-Asian (EAS) and European (EUR) ancestries, with an empirical estimation of ancestry contribution of 53:47 (EAS:EUR) and 48:52 for HGDP-UG and PanAsia-UG, respectively. The effective admixture time under a model with a single pulse of admixture was estimated as 110 generations and 129 generations, or admixture events occurred about 2200 and 2580 years ago for HGDP-UG and PanAsia-UG, respectively, assuming an average of 20 yr per generation. Despite Uyghurs' earlier history compared to other admixture populations, admixture mapping, holds promise for this population, because of its large size and its mixture of ancestry from different continents. We screened multiple databases and identified a genome-wide single-nucleotide polymorphism panel that can distinguish EAS and EUR ancestry of chromosomal segments in Uyghurs. The panel contains 8150 ancestry-informative markers (AIMs) showing large frequency differences between EAS and EUR populations (F(ST) > 0.25, mean F(ST) = 0.43) but small frequency differences (7999 AIMs validated) within both populations (F(ST) < 0.05, mean F(ST) < 0.01). We evaluated the effectiveness of this admixture map for localizing disease genes in two Uyghur populations. To our knowledge, our map constitutes the first practical resource for admixture mapping in Uyghurs, and it will enable studies of diseases showing differences in genetic risk between EUR and EAS populations.
Project description:BACKGROUND:Understanding how biological factors contribute to prostate cancer (PCa) health disparities requires mechanistic functional analysis of specific genes or pathways in pre-clinical cellular and animal models of this malignancy. The 22Rv1 human prostatic carcinoma cell line was originally derived from the parental CWR22R cell line. Although 22Rv1 has been well characterized and used in numerous mechanistic studies, no racial identifier has ever been disclosed for this cell line. In accordance with the need for racial diversity in cancer biospecimens and recent guidelines by the NIH on authentication of key biological resources, we sought to determine the ancestry of 22RV1 and authenticate previously reported racial identifications for four other PCa cell lines. METHODS:We used 29 established Ancestry Informative Marker (AIM) single nucleotide polymorphisms (SNPs) to conduct DNA ancestry analysis and assign ancestral proportions to a panel of five PCa cell lines that included 22Rv1, PC3, DU145, MDA-PCa-2b, and RC-77T/E. RESULTS:We found that 22Rv1 carries mixed genetic ancestry. The main ancestry proportions for this cell line were 0.41 West African (AFR) and 0.42 European (EUR). In addition, we verified the previously reported racial identifications for PC3 (0.73 EUR), DU145 (0.63 EUR), MDA-PCa-2b (0.73 AFR), and RC-77T/E (0.74 AFR) cell lines. CONCLUSIONS:Considering the mortality disparities associated with PCa, which disproportionately affect African American men, there remains a burden on the scientific community to diversify the availability of biospecimens, including cell lines, for mechanistic studies on potential biological mediators of these disparities. This study is beneficial by identifying another PCa cell line that carries substantial AFR ancestry. This finding may also open the door to new perspectives on previously published studies using this cell line.
Project description:Schizophrenia is a common polygenetic disease affecting 0.5-1% of individuals across distinct ethnic populations. PGC-II, the largest genome-wide association study investigating genetic risk factors for schizophrenia, previously identified 128 independent schizophrenia-associated genetic variants (GVs). The current study examined the genetic variability of GVs across ethnic populations. To assess the genetic variability across populations, the 'variability indices' (VIs) of the 128 schizophrenia-associated GVs were calculated. We used 2504 genomes from the 1000 Genomes Project taken from 26 worldwide healthy samples comprising five major ethnicities: East Asian (EAS: n=504), European (EUR: n=503), African (AFR: n=661), American (AMR: n=347) and South Asian (SAS: n=489). The GV with the lowest variability was rs36068923 (VI=1.07). The minor allele frequencies (MAFs) were 0.189, 0.192, 0.256, 0.183 and 0.194 for EAS, EUR, AFR, AMR and SAS, respectively. The GV with the highest variability was rs7432375 (VI=9.46). The MAFs were 0.791, 0.435, 0.041, 0.594 and 0.508 for EAS, EUR, AFR, AMR and SAS, respectively. When we focused on the EAS and EUR population, the allele frequencies of 86 GVs significantly differed between the EAS and EUR (P<3.91 × 10-4). The GV with the highest variability was rs4330281 (P=1.55 × 10-138). The MAFs were 0.023 and 0.519 for the EAS and EUR, respectively. The GV with the lowest variability was rs2332700 (P=9.80 × 10-1). The MAFs were similar between these populations (that is, 0.246 and 0.247 for the EAS and EUR, respectively). Interestingly, the mean allele frequencies of the GVs did not significantly differ between these populations (P>0.05). Although genetic heterogeneities were observed in the schizophrenia-associated GVs across ethnic groups, the combination of these GVs might increase the risk of schizophrenia.
Project description:We evaluated the performance of three PGx panels to estimate biogeographical ancestry: the DMET panel, and the VIP and Preemptive PGx panels described in the literature. Our analysis indicate that the three panels capture quite well the individual variation in admixture proportions observed in recently admixed populations throughout the Americas, with the Preemptive PGx and DMET panels performing better than the VIP panel. We show that these panels provide reliable information about biogeographic ancestry and can be used to guide the implementation of PGx clinical decision-support (CDS) tools. We also report that using these panels it is possible to control for the effects of population stratification in association studies in recently admixed populations, as exemplified with a warfarin dosing GWA study in a sample from Brazil.
Project description:For admixture mapping studies in Mexican Americans (MAM), we define a genomewide single-nucleotide-polymorphism (SNP) panel that can distinguish between chromosomal segments of Amerindian (AMI) or European (EUR) ancestry. These studies used genotypes for >400,000 SNPs, defined in EUR and both Pima and Mayan AMI, to define a set of ancestry-informative markers (AIMs). The use of two AMI populations was necessary to remove a subset of SNPs that distinguished genotypes of only one AMI subgroup from EUR genotypes. The AIMs set contained 8,144 SNPs separated by a minimum of 50 kb with only three intermarker intervals >1 Mb and had EUR/AMI FST values >0.30 (mean FST = 0.48) and Mayan/Pima FST values <0.05 (mean FST < 0.01). Analysis of a subset of these SNP AIMs suggested that this panel may also distinguish ancestry between EUR and other disparate AMI groups, including Quechuan from South America. We show, using realistic simulation parameters that are based on our analyses of MAM genotyping results, that this panel of SNP AIMs provides good power for detecting disease-associated chromosomal segments for genes with modest ethnicity risk ratios. A reduced set of 5,287 SNP AIMs captured almost the same admixture mapping information, but smaller SNP sets showed substantial drop-off in admixture mapping information and power. The results will enable studies of type 2 diabetes, rheumatoid arthritis, and other diseases among which epidemiological studies suggest differences in the distribution of ancestry-associated susceptibility.
Project description:To assess the statistical significance of associations between variants and traits, genome-wide association studies (GWAS) should employ an appropriate threshold that accounts for the massive burden of multiple testing in the study. Although most studies in the current literature commonly set a genome-wide significance threshold at the level of P=5.0 × 10<sup>-8</sup>, the adequacy of this value for respective populations has not been fully investigated. To empirically estimate thresholds for different ancestral populations, we conducted GWAS simulations using the 1000 Genomes Phase 3 data set for Africans (AFR), Europeans (EUR), Admixed Americans (AMR), East Asians (EAS) and South Asians (SAS). The estimated empirical genome-wide significance thresholds were P<sub>sig</sub>=3.24 × 10<sup>-8</sup> (AFR), 9.26 × 10<sup>-8</sup> (EUR), 1.83 × 10<sup>-7</sup> (AMR), 1.61 × 10<sup>-7</sup> (EAS) and 9.46 × 10<sup>-8</sup> (SAS). We additionally conducted trans-ethnic meta-analyses across all populations (ALL) and all populations except for AFR (?AFR), which yielded P<sub>sig</sub>=3.25 × 10<sup>-8</sup> (ALL) and 4.20 × 10<sup>-8</sup> (?AFR). Our results indicate that the current threshold (P=5.0 × 10<sup>-8</sup>) is overly stringent for all ancestral populations except for Africans; however, we should employ a more stringent threshold when conducting a meta-analysis, regardless of the presence of African samples.
Project description:Variation in individual admixture proportions leads to heterogeneity within populations. Though novel methods and marker panels have been developed to quantify individual admixture, empirical data describing individual admixture distributions are limited. We investigated variation in individual admixture in four U.S. populations (European American [EA], African American [AA], Hispanics from Connecticut [East Coast, or EC], and Hispanics from California [West Coast, or WC]) assuming three-way intermixture among Europeans, Africans, and Indigenous Americans. Admixture estimates were inferred using a panel of 36 microsatellites and one SNP, which have significant allele frequency differences between ancestral populations, and by using both a maximum likelihood (ML)-based method and a Bayesian method implemented in the program STRUCTURE. Simulation studies showed that estimates obtained with this marker panel are within 96% of expected values. EAs had the lowest non-European admixture with both methods, but showed greater homogeneity with STRUCTURE than with ML. All other samples showed a high degree of variation in admixture estimates with both methods, were highly concordant, and showed evidence of admixture stratification. With both methods, AA subjects had on average, 16% European and <10% Indigenous American admixture. EC Hispanics had higher mean African admixture and the WC Hispanics had higher mean Indigenous American admixture, possibly reflecting their different continental origins.
Project description:Genetic admixture has been utilized as a tool for identifying loci associated with complex traits and diseases in recently admixed populations such as African Americans. In particular, admixture mapping is an efficient approach to identifying genetic basis for those complex diseases with substantial racial or ethnic disparities. Though current advances in admixture mapping algorithms may utilize the entire panel of SNPs, providing ancestry-informative markers (AIMs) that can differentiate parental populations and estimate ancestry proportions in an admixed population may particularly benefit admixture mapping in studies of limited samples, help identify unsuitable individuals (e.g., through genotyping the most informative ancestry markers) before starting large genome-wide association studies (GWAS), or guide larger scale targeted deep re-sequencing for determining specific disease-causing variants. Defining panels of AIMs based on commercial, high-throughput genotyping platforms will facilitate the utilization of these platforms for simultaneous admixture mapping of complex traits and diseases, in addition to conventional GWAS. Here, we describe AIMs detected based on the Shannon Information Content (SIC) or Fst for African Americans with genome-wide coverage that were selected from ?2.3 million single nucleotide polymorphisms (SNPs) covered by the Affymetrix Axiom Pan-African array, a newly developed genotyping platform optimized for individuals of African ancestry.
Project description:BACKGROUND:Given the scarcity of cell lines from underrepresented populations, it is imperative that genetic ancestry for these cell lines is characterized. Consequences of cell line mischaracterization include squandered resources and publication retractions. METHODS:We calculated genetic ancestry proportions for 15 cell lines to assess the accuracy of previous race/ethnicity classification and determine previously unknown estimates. DNA was extracted from cell lines and genotyped for ancestry informative markers representing West African (WA), Native American (NA), and European (EUR) ancestry. RESULTS:Of the cell lines tested, all previously classified as White/Caucasian were accurately described with mean EUR ancestry proportions of 97%. Cell lines previously classified as Black/African American were not always accurately described. For instance, the 22Rv1 prostate cancer cell line was recently found to carry mixed genetic ancestry using a much smaller panel of markers. However, our more comprehensive analysis determined the 22Rv1 cell line carries 99% EUR ancestry. Most notably, the E006AA-hT prostate cancer cell line, classified as African American, was found to carry 92% EUR ancestry. We also determined the MDA-MB-468 breast cancer cell line carries 23% NA ancestry, suggesting possible Afro-Hispanic/Latina ancestry. CONCLUSIONS:Our results suggest predominantly EUR ancestry for the White/Caucasian-designated cell lines, yet high variance in ancestry for the Black/African American-designated cell lines. In addition, we revealed an extreme misclassification of the E006AA-hT cell line. IMPACT:Genetic ancestry estimates offer more sophisticated characterization leading to better contextualization of findings. Ancestry estimates should be provided for all cell lines to avoid erroneous conclusions in disparities literature.