Inferring the Molecular Mechanisms of Noncoding Alzheimer's Disease-Associated Genetic Variants.
ABSTRACT: Most of the loci identified by genome-wide association studies (GWAS) for late-onset Alzheimer's disease (LOAD) are in strong linkage disequilibrium (LD) with nearby variants all of which could be the actual functional variants, often in non-protein-coding regions and implicating underlying gene regulatory mechanisms. We set out to characterize the causal variants, regulatory mechanisms, tissue contexts, and target genes underlying these associations. We applied our INFERNO algorithm to the top 19 non-APOE loci from the IGAP GWAS study. INFERNO annotated all LD-expanded variants at each locus with tissue-specific regulatory activity. Bayesian co-localization analysis of summary statistics and eQTL data was performed to identify tissue-specific target genes. INFERNO identified enhancer dysregulation in all 19 tag regions analyzed, significant enrichments of enhancer overlaps in the immune-related blood category, and co-localized eQTL signals overlapping enhancers from the matching tissue class in ten regions (ABCA7, BIN1, CASS4, CD2AP, CD33, CELF1, CLU, EPHA1, FERMT2, ZCWPW1). In several cases, we identified dysregulation of long noncoding RNA (lncRNA) transcripts and applied the lncRNA target identification algorithm from INFERNO to characterize their downstream biological effects. We also validated the allele-specific effects of several variants on enhancer function using luciferase expression assays. By integrating functional genomics with GWAS signals, our analysis yielded insights into the regulatory mechanisms, tissue contexts, genes, and biological processes affected by noncoding genetic variation associated with LOAD risk.
Project description:The majority of variants identified by genome-wide association studies (GWAS) reside in the noncoding genome, affecting regulatory elements including transcriptional enhancers. However, characterizing their effects requires the integration of GWAS results with context-specific regulatory activity and linkage disequilibrium annotations to identify causal variants underlying noncoding association signals and the regulatory elements, tissue contexts, and target genes they affect. We propose INFERNO, a novel method which integrates hundreds of functional genomics datasets spanning enhancer activity, transcription factor binding sites, and expression quantitative trait loci with GWAS summary statistics. INFERNO includes novel statistical methods to quantify empirical enrichments of tissue-specific enhancer overlap and to identify co-regulatory networks of dysregulated long noncoding RNAs (lncRNAs). We applied INFERNO to two large GWAS studies. For schizophrenia (36,989 cases, 113,075 controls), INFERNO identified putatively causal variants affecting brain enhancers for known schizophrenia-related genes. For inflammatory bowel disease (IBD) (12,882 cases, 21,770 controls), INFERNO found enrichments of immune and digestive enhancers and lncRNAs involved in regulation of the adaptive immune response. In summary, INFERNO comprehensively infers the molecular mechanisms of causal noncoding variants, providing a sensitive hypothesis generation method for post-GWAS analysis. The software is available as an open source pipeline and a web server.
Project description:DNA variants (SNPs) that predispose to common traits often localize within noncoding regulatory elements such as enhancers. Moreover, loci identified by genome-wide association studies (GWAS) often contain multiple SNPs in linkage disequilibrium (LD), any of which may be causal. Thus, determining the effect of these multiple variant SNPs on target transcript levels has been a major challenge. Here, we provide evidence that for six common autoimmune disorders (rheumatoid arthritis, Crohn's disease, celiac disease, multiple sclerosis, lupus, and ulcerative colitis), the GWAS association arises from multiple polymorphisms in LD that map to clusters of enhancer elements active in the same cell type. This finding suggests a "multiple enhancer variant" hypothesis for common traits, where several variants in LD impact multiple enhancers and cooperatively affect gene expression. Using a novel method to delineate enhancer-gene interactions, we show that multiple enhancer variants within a given locus typically target the same gene. Using available data from HapMap and B lymphoblasts as a model system, we provide evidence at numerous loci that multiple enhancer variants cooperatively contribute to altered expression of their gene targets. The effects on target transcript levels tend to be modest and can be either gain- or loss-of-function. Additionally, the genes associated with multiple enhancer variants encode proteins that are often functionally related and enriched in common pathways. Overall, the multiple enhancer variant hypothesis offers a new paradigm by which noncoding variants can confer susceptibility to common traits.
Project description:SUMMARY:We report Spark-based INFERence of the molecular mechanisms of NOn-coding genetic variants (SparkINFERNO), a scalable bioinformatics pipeline characterizing non-coding genome-wide association study (GWAS) association findings. SparkINFERNO prioritizes causal variants underlying GWAS association signals and reports relevant regulatory elements, tissue contexts and plausible target genes they affect. To achieve this, the SparkINFERNO algorithm integrates GWAS summary statistics with large-scale collection of functional genomics datasets spanning enhancer activity, transcription factor binding, expression quantitative trait loci and other functional datasets across more than 400 tissues and cell types. Scalability is achieved by an underlying API implemented using Apache Spark and Giggle-based genomic indexing. We evaluated SparkINFERNO on large GWASs and show that SparkINFERNO is more than 60 times efficient and scales with data size and amount of computational resources. AVAILABILITY AND IMPLEMENTATION:SparkINFERNO runs on clusters or a single server with Apache Spark environment, and is available at https://bitbucket.org/wanglab-upenn/SparkINFERNO or https://hub.docker.com/r/wanglab/spark-inferno. CONTACT:email@example.com. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:Genome-wide association studies (GWAS) have linked dozens of single nucleotide polymorphisms (SNPs) with Parkinson's disease (PD) risk. Ascertaining the functional and eventual causal mechanisms underlying these relationships has proven difficult. The majority of risk SNPs, and nearby SNPs in linkage disequilibrium (LD), are found in intergenic or intronic regions and confer risk through allele-dependent expression of multiple unknown target genes. Combining GWAS results with publicly available GTEx data, generated through eQTL (expression quantitative trait loci) identification studies, enables a direct association of SNPs to gene expression levels and aids in narrowing the large population of potential genetic targets for hypothesis-driven experimental cell biology. Separately, overlapping of SNPs with putative enhancer segmentations can strengthen target filtering. We report here the results of analyzing 7,607 PD risk SNPs along with an additional 23,759 high linkage disequilibrium-associated variants paired with eQTL gene expression. We found that enrichment analysis on the set of genes following target filtering pointed to a single large LD block at 6p21 that contained multiple HLA-MHC-II genes. These MHC-II genes remain associated with PD when the genes were filtered for correlation between GWAS significance and eQTL levels, strongly indicating a direct effect on PD etiology.
Project description:SNPs associated with disease susceptibility often reside in enhancer clusters, or super-enhancers. Constituents of these enhancer clusters cooperate to regulate target genes and often extend beyond the linkage disequilibrium (LD) blocks containing risk SNPs identified in genome-wide association studies (GWAS). We identified 'outside variants', defined as SNPs in weak LD with GWAS risk SNPs that physically interact with risk SNPs as part of a target gene's regulatory circuitry. These outside variants further explain variation in target gene expression beyond that explained by GWAS-associated SNPs. Additionally, the clinical risk associated with GWAS SNPs is considerably modified by the genotype of outside variants. Collectively, these findings suggest a potential model in which outside variants and GWAS SNPs that physically interact in 3D chromatin collude to influence target transcript levels as well as clinical risk. This model offers an additional hypothesis for the source of missing heritability for complex traits.
Project description:Most expression quantitative trait locus (eQTL) studies to date have been performed in heterogeneous tissues as opposed to specific cell types. To better understand the cell-type-specific regulatory landscape of human melanocytes, which give rise to melanoma but account for <5% of typical human skin biopsies, we performed an eQTL analysis in primary melanocyte cultures from 106 newborn males. We identified 597,335 cis-eQTL SNPs prior to linkage disequilibrium (LD) pruning and 4997 eGenes (FDR < 0.05). Melanocyte eQTLs differed considerably from those identified in the 44 GTEx tissue types, including skin. Over a third of melanocyte eGenes, including key genes in melanin synthesis pathways, were unique to melanocytes compared to those of GTEx skin tissues or TCGA melanomas. The melanocyte data set also identified trans-eQTLs, including those connecting a pigmentation-associated functional SNP with four genes, likely through cis-regulation of IRF4 Melanocyte eQTLs are enriched in cis-regulatory signatures found in melanocytes as well as in melanoma-associated variants identified through genome-wide association studies. Melanocyte eQTLs also colocalized with melanoma GWAS variants in five known loci. Finally, a transcriptome-wide association study using melanocyte eQTLs uncovered four novel susceptibility loci, where imputed expression levels of five genes (ZFP90, HEBP1, MSC, CBWD1, and RP11-383H13.1) were associated with melanoma at genome-wide significant P-values. Our data highlight the utility of lineage-specific eQTL resources for annotating GWAS findings, and present a robust database for genomic research of melanoma risk and melanocyte biology.
Project description:Using RefSeq annotations, most disease/trait-associated genetic variants identified by genome-wide association studies (GWAS) appear to be located within intronic or intergenic regions, which makes it difficult to interpret their functions. We reassessed GWAS-Associated single-nucleotide polymorphisms (herein termed as GASs) for their potential functionalities using integrative approaches. 8834 of 9184 RefSeq "noncoding" GASs were reassessed to have potential regulatory functionalities. As examples, 3 variants (rs3130320, rs3806932 and rs6890853) were shown to have regulatory properties in HepG2, A549 and 293T cells. Except rs3130320 as a known expression quantitative trait loci (eQTL), rs3806932 and rs6890853 were not reported as eQTLs in previous reports. 1999 of 9184 "noncoding" GASs were re-annotated to the promoters or intragenic regions using Ensembl, UCSC and AceView gene annotations but they were not annotated into corresponding regions in RefSeq database. Moreover, these GAS-harboring genes were broadly expressed across different tissues and a portion of them was expressed in a tissue-specific manner, suggesting that they could be functional. Collectively, our study demonstrates the benefits of using integrative analyses to interpret genetic variants and may help to predict or explain disease susceptibility more accurately and comprehensively.
Project description:BACKGROUND:There are two main types of lung cancer: small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC). NSCLC has many subtypes, but the two most common are lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC). These subtypes are mainly classified by physiological and pathological characteristics, although there is increasing evidence of genetic and molecular differences as well. Although some work has been done at the somatic level to explore the genetic and biological differences among subtypes, little work has been done that interrogates these differences at the germline level to characterize the unique and shared susceptibility genes for each subtype. METHODS:We used single-nucleotide polymorphisms (SNPs) from a genome-wide association study (GWAS) of European samples to interrogate the similarity of the subtypes at the SNP, gene, pathway, and regulatory levels. We expanded these genotyped SNPs to include all SNPs in linkage disequilibrium (LD) using data from the 1000 Genomes Project. We mapped these SNPs to several lung tissue expression quantitative trait loci (eQTL) and enhancer datasets to identify regulatory SNPs and their target genes. We used these genes to perform a biological pathway analysis for each subtype. RESULTS:We identified 8295, 8734, and 8361 SNPs with moderate association signals for LUAD, LUSC, and SCLC, respectively. Those SNPs had p < 1 × 10- 3 in the original GWAS or were within LD (r2 > 0.8, Europeans) to the genotyped SNPs. We identified 215, 320, and 172 disease-associated genes for LUAD, LUSC, and SCLC, respectively. Only five genes (CHRNA5, IDH3A, PSMA4, RP11-650 L12.2, and TBC1D2B) overlapped all subtypes. Furthermore, we observed only two pathways from the Kyoto Encyclopedia of Genes and Genomes shared by all subtypes. At the regulatory level, only three eQTL target genes and two enhancer target genes overlapped between all subtypes. CONCLUSIONS:Our results suggest that the three lung cancer subtypes do not share much genetic signal at the SNP, gene, pathway, or regulatory level, which differs from the common subtype classification based upon histology. However, three (CHRNA5, IDH3A, and PSMA4) of the five genes shared between the subtypes are well-known lung cancer genes that may act as general lung cancer genes regardless of subtype.
Project description:BACKGROUND:Psoriasis is a chronic inflammatory skin disease, for which genome-wide association studies (GWAS) have identified many genetic variants as risk markers. However, the details of underlying molecular mechanisms, especially which variants are functional, are poorly understood. METHODS:We utilized a computational approach to survey psoriasis-associated functional variants that might affect protein functions or gene expression levels. We developed a pipeline by integrating publicly available datasets provided by GWAS Catalog, FANTOM5, GTEx, SNP2TFBS, and DeepBlue. To identify functional variants on exons or splice sites, we used a web-based annotation tool in the Ensembl database. To search for noncoding functional variants within promoters or enhancers, we used eQTL data calculated by GTEx. The data of variants lying on transcription factor binding sites provided by SNP2TFBS were used to predict detailed functions of the variants. RESULTS:We discovered 22 functional variant candidates, of which 8 were in noncoding regions. We focused on the enhancer variant rs72635708 (T?>?C) in the 1p36.23 region; this variant is within the enhancer region of the ERRFI1 gene, which regulates lipid metabolism in the liver and skin morphogenesis via EGF signaling. Further analysis showed that the ERRFI1 promoter spatially contacts with the enhancer, despite the 170?kb distance between them. We found that this variant lies on the AP-1 complex binding motif and may modulate binding levels. CONCLUSIONS:The minor allele rs72635708 (rs72635708-C) might affect the ERRFI1 promoter activity, which results in unstable expression of ERRFI1, enhancing the risk of psoriasis via disruption of lipid metabolism and skin cell proliferation. Our study represents a successful example of predicting molecular pathogenesis by integration and reanalysis of public data.
Project description:Integrating genome-wide association (GWAS) and expression quantitative trait locus (eQTL) data into transcriptome-wide association studies (TWAS) based on predicted expression can boost power to detect novel disease loci or pinpoint the susceptibility gene at a known disease locus. However, it is often the case that multiple eQTL genes colocalize at disease loci, making the identification of the true susceptibility gene challenging, due to confounding through linkage disequilibrium (LD). To distinguish between true susceptibility genes (where the genetic effect on phenotype is mediated through expression) and colocalization due to LD, we examine an extension of the Mendelian randomization (MR) egger regression method that allows for LD while only requiring summary association data for both GWAS and eQTL. We derive the standard TWAS approach in the context of MR and show in simulations that the standard TWAS does not control type I error for causal gene identification when eQTLs have pleiotropic or LD-confounded effects on disease. In contrast, LD-aware MR-Egger (LDA MR-Egger) regression can control type I error in this case while attaining similar power as other methods in situations where these provide valid tests. However, when the direct effects of genetic variants on traits are correlated with the eQTL associations, all of the methods we examined including LDA MR-Egger regression can have inflated type I error. We illustrate these methods by integrating gene expression within a recent large-scale breast cancer GWAS to provide guidance on susceptibility gene identification.