SummaryAUC: a tool for evaluating the performance of polygenic risk prediction models in validation datasets with only summary level statistics.
ABSTRACT: MOTIVATION:Polygenic risk score (PRS) methods based on genome-wide association studies (GWAS) have a potential for predicting the risk of developing complex diseases and are expected to become more accurate with larger training datasets and innovative statistical methods. The area under the ROC curve (AUC) is often used to evaluate the performance of PRSs, which requires individual genotypic and phenotypic data in an independent GWAS validation dataset. We are motivated to develop methods for approximating AUC of PRSs based on the summary level data of the validation dataset, which will greatly facilitate the development of PRS models for complex diseases. RESULTS:We develop statistical methods and an R package SummaryAUC for approximating the AUC and its variance of a PRS when only the summary level data of the validation dataset are available. SummaryAUC can be applied to PRSs with SNPs either genotyped or imputed in the validation dataset. We examined the performance of SummaryAUC using a large-scale GWAS of schizophrenia. SummaryAUC provides accurate approximations to AUCs and their variances. The bias of AUC is typically <0.5% in most analyses. SummaryAUC cannot be applied to PRSs that use all SNPs in the genome because it is computationally prohibitive. AVAILABILITY AND IMPLEMENTATION:https://github.com/lsncibb/SummaryAUC. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:Stratification of women according to their risk of breast cancer based on polygenic risk scores (PRSs) could improve screening and prevention strategies. Our aim was to develop PRSs, optimized for prediction of estrogen receptor (ER)-specific disease, from the largest available genome-wide association dataset and to empirically validate the PRSs in prospective studies. The development dataset comprised 94,075 case subjects and 75,017 control subjects of European ancestry from 69 studies, divided into training and validation sets. Samples were genotyped using genome-wide arrays, and single-nucleotide polymorphisms (SNPs) were selected by stepwise regression or lasso penalized regression. The best performing PRSs were validated in an independent test set comprising 11,428 case subjects and 18,323 control subjects from 10 prospective studies and 190,040 women from UK Biobank (3,215 incident breast cancers). For the best PRSs (313 SNPs), the odds ratio for overall disease per 1 standard deviation in ten prospective studies was 1.61 (95%CI: 1.57-1.65) with area under receiver-operator curve (AUC) = 0.630 (95%CI: 0.628-0.651). The lifetime risk of overall breast cancer in the top centile of the PRSs was 32.6%. Compared with women in the middle quintile, those in the highest 1% of risk had 4.37- and 2.78-fold risks, and those in the lowest 1% of risk had 0.16- and 0.27-fold risks, of developing ER-positive and ER-negative disease, respectively. Goodness-of-fit tests indicated that this PRS was well calibrated and predicts disease risk accurately in the tails of the distribution. This PRS is a powerful and reliable predictor of breast cancer risk that may improve breast cancer prevention programs.
Project description:Purpose:Elevated intraocular pressure (IOP) is an important risk factor for glaucoma. We constructed polygenic risk scores (PRSs) for IOP using the UK Biobank (UKB) data set to test whether the PRSs are associated with IOP and whether using them improves glaucoma prediction. Methods:We conducted this study using 435,678 European participants from the UKB. We constructed weighted and unweighted PRSs using single nucleotide polymorphisms (SNPs) derived from the UKB data and previously reported IOP SNPs. We examined the associations of the PRSs with IOP and primary open-angle glaucoma (POAG) using linear and logistic regression, respectively. To quantify the discriminatory ability of the PRSs on POAG, we used the area under the receiver operating characteristic curve (AUC). Results:The weighted PRS was significantly associated with IOP (P ∼ 10-200), after adjusting for age and sex. The PRS explained an additional 4% of variance in IOP. The weighted PRS was also significantly associated with POAG (P = 1.8 × 10-77). Subjects in the top quintile of the IOP PRS were 6.34 (95% confidence interval [CI]: 4.82-8.33; P = 2.1 × 10-57) times more likely to have POAG, compared to those in the bottom category. The weighted PRS improved the discriminatory power for POAG (AUC increased by 5%, P = 6.2 × 10-22) when added to the other covariates. The unweighted PRS exhibited similar results. Conclusions:We determined that IOP PRSs are significantly associated with IOP and improve the prediction of POAG. Translational Relevance:PRSs help reduce the burden of glaucoma by early detection of genetically susceptible individuals.
Project description:Polygenic risk scores (PRSs) are a method to summarize the additive trait variance captured by a set of SNPs, and can increase the power of set-based analyses by leveraging public genome-wide association study (GWAS) datasets. PRS aims to assess the genetic liability to some phenotype on the basis of polygenic risk for the same or different phenotype estimated from independent data. We propose the application of PRSs as a set-based method with an additional component of adjustment for linkage disequilibrium (LD), with potential extension of the PRS approach to analyze biologically meaningful SNP sets. We call this method POLARIS: POlygenic Ld-Adjusted RIsk Score. POLARIS identifies the LD structure of SNPs using spectral decomposition of the SNP correlation matrix and replaces the individuals' SNP allele counts with LD-adjusted dosages. Using a raw genotype dataset together with SNP effect sizes from a second independent dataset, POLARIS can be used for set-based analysis. MAGMA is an alternative set-based approach employing principal component analysis to account for LD between markers in a raw genotype dataset. We used simulations, both with simple constructed and real LD-structure, to compare the power of these methods. POLARIS shows more power than MAGMA applied to the raw genotype dataset only, but less or comparable power to combined analysis of both datasets. POLARIS has the advantages that it produces a risk score per person per set using all available SNPs, and aims to increase power by leveraging the effect sizes from the discovery set in a self-contained test of association in the test dataset.
Project description:Gestational diabetes Mellitus (GDM) affects 1 in 7 births and is associated with numerous adverse health outcomes for both mother and child. GDM is suspected to share a large common genetic background with type 2 diabetes (T2D). The aim of our study was to characterize different GDM polygenic risk scores (PRSs) and test their association with GDM using data from the South Asian Birth Cohort (START). PRSs were derived for 832 South Asian women from START using the pruning and thresholding (P?+?T), LDpred, and GraBLD methods. Weights were derived from a multi-ethnic and a white Caucasian study of the DIAGRAM consortium. GDM status was defined using South Asian-specific glucose values in response to an oral glucose tolerance test. Association with GDM was tested using logistic regression. Results were replicated in South Asian women from the UK Biobank (UKB) study. The top ranking P?+?T, LDpred and GraBLD PRSs were all based on DIAGRAM's multi-ethnic study. The best PRS was highly associated with GDM in START (AUC?=?0.62, OR?=?1.60 [95% CI?=?1.44-1.69]), and in South Asian women from UKB (AUC?=?0.65, OR?=?1.69 [95% CI?=?1.28-2.24]). Our results highlight the importance of combining genome-wide genotypes and summary statistics from large multi-ethnic studies to optimize PRSs in South Asians.
Project description:Many psychiatric disorders are associated with impaired executive functioning (EF). The associated EF component varies by psychiatric disorders, and this variation might be due to genetic liability. We explored the genetic association between five psychiatric disorders and EF in clinically-recruited attention deficit hyperactivity disorder (ADHD) children using polygenic risk score (PRS) methodology. Genome-wide association study (GWAS) summary data for ADHD, major depressive disorder (MDD), schizophrenia (SZ), bipolar disorder (BIP) and autism were used to calculate the PRSs. EF was evaluated by the Stroop test for inhibitory control, the trail-making test for cognitive flexibility, and the digital span test for working memory in a Chinese ADHD cohort (n?=?1147). Exploratory factor analysis of the three measures identified one principal component for EF (EF-PC). Linear regression models were used to analyze the association between each PRS and the EF measures. The role of EF measures in mediating the effects of the PRSs on ADHD symptoms was also analyzed. The result showed the PRSs for MDD, ADHD and BIP were all significantly associated with the EF-PC. For each EF component, the association results were different for the PRSs of the five psychiatric disorders: the PRSs for ADHD and MDD were associated with inhibitory control (adjusted P?=?0.0183 and 0.0313, respectively), the PRS for BIP was associated with working memory (adjusted P?=?0.0416), and the PRS for SZ was associated with cognitive flexibility (adjusted P?=?0.0335). All three EF measures were significantly correlated with ADHD symptoms. In mediation analyses, the ADHD and MDD PRSs, which were associated with inhibitory control, had significant indirect effects on ADHD symptoms through the mediation of inhibitory control. These findings indicate that the polygenic risks for several psychiatric disorders influence specific executive dysfunction in children with ADHD. The results helped to clarify the relationship between risk genes of each mental disorder and the intermediate cognitive domain, which may further help elucidate the risk genes and motivate efforts to develop EF measures as a diagnostic marker and future treatment target.
Project description:BACKGROUND:Few studies have evaluated the performance of existing breast cancer risk prediction models among women of African ancestry. In replication studies of genetic variants, a change in direction of the risk association is a common phenomenon. Termed flip-flop, it means that a variant is risk factor in one population but protective in another, affecting the performance of risk prediction models. METHODS:We used data from the genome-wide association study (GWAS) of breast cancer in the African diaspora (The Root consortium), which included 3686 participants of African ancestry from Nigeria, USA, and Barbados. Polygenic risk scores (PRSs) were constructed from the published odds ratios (ORs) of four sets of susceptibility loci for breast cancer. Discrimination capacity was measured using the area under the receiver operating characteristic curve (AUC). RESULTS:Flip-flop phenomenon was observed among 30~40% of variants across studies. Using the 34 variants with consistent directionality among previous studies, we constructed a PRS with AUC of 0.531 (95% confidence interval [CI]: 0.512-0.550), which is similar to the PRS using 93 variants and ORs from European ancestry populations (AUC = 0.525, 95% CI: 0.506-0.544). Additionally, we found the 34-variant PRS has good discriminative accuracy in women with family history of breast cancer (AUC = 0.586, 95% CI: 0.532-0.640). CONCLUSIONS:We found that PRS based on variants identified from prior GWASs conducted in women of European and Asian ancestries did not provide a comparable degree of risk stratification for women of African ancestry. Further large-scale fine-mapping studies in African ancestry populations are desirable to discover population-specific genetic risk variants.
Project description:OBJECTIVE:Pharmacogenomic studies of antipsychotics have typically examined effects of individual polymorphisms. By contrast, polygenic risk scores (PRSs) derived from genome-wide association studies (GWAS) can quantify the influence of thousands of common alleles of small effect in a single measure. The authors examined whether PRSs for schizophrenia were predictive of antipsychotic efficacy in four independent cohorts of patients with first-episode psychosis (total N=510). METHOD:All study subjects received initial treatment with antipsychotic medication for first-episode psychosis, and all were genotyped on standard single-nucleotide polymorphism (SNP) arrays imputed to the 1000 Genomes Project reference panel. PRS was computed based on the results of the large-scale schizophrenia GWAS reported by the Psychiatric Genomics Consortium. Symptoms were measured by using total symptom rating scales at baseline and at week 12 or at the last follow-up visit before dropout. RESULTS:In the discovery cohort, higher PRS significantly predicted higher symptom scores at the 12-week follow-up (controlling for baseline symptoms, sex, age, and ethnicity). The PRS threshold set at a p value <0.01 gave the strongest result in the discovery cohort and was used to replicate the findings in the other three cohorts. Higher PRS significantly predicted greater posttreatment symptoms in the combined replication analysis and was individually significant in two of the three replication cohorts. Across the four cohorts, PRS was significantly predictive of adjusted 12-week symptom scores (pooled partial r=0.18; 3.24% of variance explained). Patients with low PRS were more likely to be treatment responders than patients with high PRS (odds ratio=1.91 in the two Caucasian samples). CONCLUSIONS:Patients with higher PRS for schizophrenia tended to have less improvement with antipsychotic drug treatment. PRS burden may have potential utility as a prognostic biomarker.
Project description:BACKGROUND & AIMS:We developed comprehensive models to determine risk of Barrett's esophagus (BE) or esophageal adenocarcinoma (EAC) based on genetic and non-genetic factors. METHODS:We used pooled data from 3288 patients with BE, 2511 patients with EAC, and 2177 individuals without either (controls) from participants in the international Barrett's and EAC consortium as well as the United Kingdom's BE gene study and stomach and esophageal cancer study. We collected data on 23 genetic variants associated with risk for BE or EAC, and constructed a polygenic risk score (PRS) for cases and controls by summing the risk allele counts for the variants weighted by their natural log-transformed effect estimates (odds ratios) extracted from genome-wide association studies. We also collected data on demographic and lifestyle factors (age, sex, smoking, body mass index, use of nonsteroidal anti-inflammatory drugs) and symptoms of gastroesophageal reflux disease (GERD). Risk models with various combinations of non-genetic factors and the PRS were compared for their accuracy in identifying patients with BE or EAC using the area under the receiver operating characteristic curve (AUC) analysis. RESULTS:Individuals in the highest quartile of risk, based on genetic factors (PRS), had a 2-fold higher risk of BE (odds ratio, 2.22; 95% confidence interval, 1.89-2.60) or EAC (odds ratio, 2.46; 95% confidence interval, 2.07-2.92) than individual in the lowest quartile of risk based on PRS. Risk models developed based on only demographic or lifestyle factors or GERD symptoms identified patients with BE or EAC with AUC values ranging from 0.637 to 0.667. Combining data on demographic or lifestyle factors with data on GERD symptoms identified patients with BE with an AUC of 0.793 and patients with EAC with an AUC of 0.745. Including PRSs with these data only minimally increased the AUC values for BE (to 0.799) and EAC (to 0.754). Including the PRSs in the model developed based on non-genetic factors resulted in a net reclassification improvement for BE of 3.0% and for EAC of 5.6%. CONCLUSIONS:We used data from 3 large databases of patients from studies of BE or EAC to develop a risk prediction model based on genetic, clinical, and demographic/lifestyle factors. We identified a PRS that increases discrimination and net reclassification of individuals with vs without BE and EAC. However, the absolute magnitude of improvement is not sufficient to justify its clinical use.
Project description:Polygenic prediction using genome-wide SNPs can provide high prediction accuracy for complex traits. Here, we investigate the question of how to account for genetic ancestry when conducting polygenic prediction. We show that the accuracy of polygenic prediction in structured populations may be partly due to genetic ancestry. However, we hypothesized that explicitly modeling ancestry could improve polygenic prediction accuracy. We analyzed three GWAS of hair color (HC), tanning ability (TA), and basal cell carcinoma (BCC) in European Americans (sample size from 7,440 to 9,822) and considered two widely used polygenic prediction approaches: polygenic risk scores (PRSs) and best linear unbiased prediction (BLUP). We compared polygenic prediction without correction for ancestry to polygenic prediction with ancestry as a separate component in the model. In 10-fold cross-validation using the PRS approach, the R(2) for HC increased by 66% (0.0456-0.0755; P < 10(-16)), the R(2) for TA increased by 123% (0.0154 to 0.0344; P < 10(-16)), and the liability-scale R(2) for BCC increased by 68% (0.0138-0.0232; P < 10(-16)) when explicitly modeling ancestry, which prevents ancestry effects from entering into each SNP effect and being overweighted. Surprisingly, explicitly modeling ancestry produces a similar improvement when using the BLUP approach, which fits all SNPs simultaneously in a single variance component and causes ancestry to be underweighted. We validate our findings via simulations, which show that the differences in prediction accuracy will increase in magnitude as sample sizes increase. In summary, our results show that explicitly modeling ancestry can be important in both PRS and BLUP prediction.
Project description:MOTIVATION:Methods for analysis of GWAS summary statistics have encouraged data sharing and democratized the analysis of different diseases. Ideal validation for such methods is application to simulated data, where some 'truth' is known. As GWAS increase in size, so does the computational complexity of such evaluations; standard practice repeatedly simulates and analyses genotype data for all individuals in an example study. RESULTS:We have developed a novel method based on an alternative approach, directly simulating GWAS summary data, without individual data as an intermediate step. We mathematically derive the expected statistics for any set of causal variants and their effect sizes, conditional upon control haplotype frequencies (available from public reference datasets). Simulation of GWAS summary output can be conducted independently of sample size by simulating random variates about these expected values. Across a range of scenarios, our method, produces very similar output to that from simulating individual genotypes with a substantial gain in speed even for modest sample sizes. Fast simulation of GWAS summary statistics will enable more complete and rapid evaluation of summary statistic methods as well as opening new potential avenues of research in fine mapping and gene set enrichment analysis. AVAILABILITY AND IMPLEMENTATION:Our method is available under a GPL license as an R package from http://github.com/chr1swallace/simGWAS. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.