Allele-specific enhancers mediate associations between LCAT and ABCA1 polymorphisms and HDL metabolism.
ABSTRACT: For most complex traits, the majority of SNPs identified through genome-wide association studies (GWAS) reside within noncoding regions that have no known function. However, these regions are enriched for the regulatory enhancers specific to the cells relevant to the specific trait. Indeed, many of the GWAS loci that have been functionally characterized lie within enhancers that regulate expression levels of key genes. In order to identify polymorphisms with potential allele-specific regulatory effects, we developed a bioinformatics pipeline that harnesses epigenetic signatures as well as transcription factor (TF) binding motifs to identify putative enhancers containing a SNP with potential allele-specific TF binding in linkage disequilibrium (LD) with a GWAS-identified SNP. We applied the approach to GWAS findings for blood lipids, revealing 7 putative enhancers harboring associated SNPs, 3 of which lie within the introns of LCAT and ABCA1, genes that play crucial roles in cholesterol biogenesis and lipoprotein metabolism. All 3 enhancers demonstrated allele-specific in vitro regulatory activity in liver-derived cell lines. We demonstrated that these putative enhancers are in close physical proximity to the promoters of their respective genes, in situ, likely through chromatin looping. In addition, the associated alleles altered the likelihood of transcription activator STAT3 binding. Our results demonstrate that through our approach, the LD blocks that contain GWAS signals, often hundreds of kilobases in size with multiple SNPs serving as statistical proxies to the true functional site, can provide an experimentally testable hypothesis for the underlying regulatory mechanism linking genetic variants to complex traits.
Project description:Recent studies have shown that disease-susceptibility variants frequently lie in cell-type-specific enhancer elements. To identify, interpret, and prioritize such risk variants, we must identify the enhancers active in disease-relevant cell types, their upstream transcription factor (TF) binding, and their downstream target genes. To address this need, we built HACER (http://bioinfo.vanderbilt.edu/AE/HACER/), an atlas of Human ACtive Enhancers to interpret Regulatory variants. The HACER atlas catalogues and annotates in-vivo transcribed cell-type-specific enhancers, as well as placing enhancers within transcriptional regulatory networks by integrating ENCODE TF ChIP-Seq and predicted/validated chromatin interaction data. We demonstrate the utility of HACER in (i) offering a mechanistic hypothesis to explain the association of SNP rs614367 with ER-positive breast cancer risk, (ii) exploring tumor-specific enhancers in selective MYC dysregulation and (iii) prioritizing/annotating non-coding regulatory regions targeting CCND1. HACER provides a valuable resource for studies of GWAS, non-coding variants, and enhancer-mediated regulation.
Project description:Plasma levels of high-density lipoprotein cholesterol (HDL-C) have been associated to cardiovascular disease. The high heritability of HDL-C plasma levels has been an incentive for several genome wide association studies (GWASs) which identified, among others, variants in the first intron of the GALNT2 gene strongly associated to HDL-C levels. However, the lead GWAS SNP associated to HDL-C levels in this genomic region, rs4846914, is located outside of transcription factor (TF) binding sites defined by chromatin immunoprecipitation followed by DNA sequencing (ChIP-seq) experiments in the ENCODE project and is therefore unlikely to be functional. In this study we apply a bioinformatics approach which rely on the premise that ChIP-seq reads can identify allele specific binding of a TF at cell specific regulatory elements harboring allele specific SNPs (AS-SNPs). EMSA and luciferase assays were used to validate the allele specific binding and to test the enhancer activity of the regulatory element harboring the AS-SNP rs4846913 as well as the neighboring rs2144300 which are in high LD with rs4846914.Using luciferase assays we found that rs4846913 and the neighboring rs2144300 displayed allele specific enhancer activity. We propose that an inhibitor binds preferentially to the rs4846913-C allele with an inhibitory boost from the synergistic binding of other TFs at the neighboring SNP rs2144300. These events influence the transcription level of GALNT2.The results suggest that rs4846913 and rs2144300 drive the association to HDL-C plasma levels through an inhibitory regulation of GALNT2 rather than the reported lead GWAS SNP rs4846914.
Project description:Lung cancer is the leading cause of cancer death. To date, many SNPs have been found associated with lung cancer risk through genome-wide association studies (GWAS). However, since most GWAS SNPs lie in non-coding regions and are co-inherited with hundreds of SNPs in linkage disequilibrium (LD), which SNP(s) play a causal role in the disease remains poorly understood. Here we aim to identify causal SNPs associated with lung cancer risk through investigating allele-specific effects (ASE). By integrating data from 374 sequencing experiments (including ChIP-seq, DNase-seq, ATAC-seq, and FAIRE-seq) performed in 71 lung-relevant cells, we found 30 lung cancer risk-associated SNPs that showed ASE. Of particular interest, seven SNPs from four loci (12p13.33, 5p15.33, 6p21.33, 22q12.22) were also found as the top-ranked SNPs in fine mapping studies, lending statistical support to our hypothesis that SNPs showing ASE are strong candidates for causal SNPs. Three SNPs, rs11571379, rs7725218, and rs3101018, showed allele-specific enhancer/promoter activities in luciferase reporter assays, supporting their causal roles. Predictions of transcription factor (TF) binding sites and target genes suggest that allele-specific binding of TFs to rs11571379, rs7725218, and rs3101018 regulates the expression of RAD52, TERT and C4A respectively, which could contribute to lung cancer risk through a variety of mechanisms. In conclusion, we have performed a comprehensive ASE evaluation of lung cancer risk-associated SNPs. Our findings highlight three potential causal SNPs and provide insights into the mechanism of by which these risk loci can contribute to lung cancer. This dataset contains the whole-genome bisulfite sequencing data from AECs used to call SNPtypes. Overall design: WGBS data from 3 individuals experiments were used to determine allele-specific effects of SNPs on functional epigenomes This dataset presents methylation domain calling data.
Project description:Large experimental efforts are characterizing the regulatory genome, yet we are still missing a systematic definition of functional and silent genetic variants in non-coding regions. Here, we integrated DNaseI footprinting data with sequence-based transcription factor (TF) motif models to predict the impact of a genetic variant on TF binding across 153 tissues and 1,372 TF motifs. Each annotation we derived is specific for a cell-type condition or assay and is locally motif-driven. We found 5.8 million genetic variants in footprints, 66% of which are predicted by our model to affect TF binding. Comprehensive examination using allele-specific hypersensitivity (ASH) reveals that only the latter group consistently shows evidence for ASH (3,217 SNPs at 20% FDR), suggesting that most (97%) genetic variants in footprinted regulatory regions are indeed silent. Combining this information with GWAS data reveals that our annotation helps in computationally fine-mapping 86 SNPs in GWAS hit regions with at least a 2-fold increase in the posterior odds of picking the causal SNP. The rich meta information provided by the tissue-specificity and the identity of the putative TF binding site being affected also helps in identifying the underlying mechanism supporting the association. As an example, the enrichment for LDL level-associated SNPs is 9.1-fold higher among SNPs predicted to affect HNF4 binding sites than in a background model already including tissue-specific annotation.
Project description:Genome-wide association studies (GWAS) have discovered thousands loci associated with disease risk and quantitative traits, yet most of the variants responsible for risk remain uncharacterized. The majority of GWAS-identified loci are enriched for non-coding single-nucleotide polymorphisms (SNPs) and defining the molecular mechanism of risk is challenging. Many non-coding causal SNPs are hypothesized to alter transcription factor (TF) binding sites as the mechanism by which they affect organismal phenotypes. We employed an integrative genomics approach to identify candidate TF binding motifs that confer breast cancer-specific phenotypes identified by GWAS. We performed de novo motif analysis of regulatory elements, analyzed evolutionary conservation of identified motifs, and assayed TF footprinting data to identify sequence elements that recruit TFs and maintain chromatin landscape in breast cancer-relevant tissue and cell lines. We identified candidate causal SNPs that are predicted to alter TF binding within breast cancer-relevant regulatory regions that are in strong linkage disequilibrium with significantly associated GWAS SNPs. We confirm that the TFs bind with predicted allele-specific preferences using CTCF ChIP-seq data. We used The Cancer Genome Atlas breast cancer patient data to identify ANKLE1 and ZNF404 as the target genes of candidate TF binding site SNPs in the 19p13.11 and 19q13.31 GWAS-identified loci. These SNPs are associated with the expression of ZNF404 and ANKLE1 in breast tissue. This integrative analysis pipeline is a general framework to identify candidate causal variants within regulatory regions and TF binding sites that confer phenotypic variation and disease risk.
Project description:SNPs associated with disease susceptibility often reside in enhancer clusters, or super-enhancers. Constituents of these enhancer clusters cooperate to regulate target genes and often extend beyond the linkage disequilibrium (LD) blocks containing risk SNPs identified in genome-wide association studies (GWAS). We identified 'outside variants', defined as SNPs in weak LD with GWAS risk SNPs that physically interact with risk SNPs as part of a target gene's regulatory circuitry. These outside variants further explain variation in target gene expression beyond that explained by GWAS-associated SNPs. Additionally, the clinical risk associated with GWAS SNPs is considerably modified by the genotype of outside variants. Collectively, these findings suggest a potential model in which outside variants and GWAS SNPs that physically interact in 3D chromatin collude to influence target transcript levels as well as clinical risk. This model offers an additional hypothesis for the source of missing heritability for complex traits.
Project description:Most variants implicated in common human disease by genome-wide association studies (GWAS) lie in noncoding sequence intervals. Despite the suggestion that regulatory element disruption represents a common theme, identifying causal risk variants within implicated genomic regions remains a major challenge. Here we present a new sequence-based computational method to predict the effect of regulatory variation, using a classifier (gkm-SVM) that encodes cell type-specific regulatory sequence vocabularies. The induced change in the gkm-SVM score, deltaSVM, quantifies the effect of variants. We show that deltaSVM accurately predicts the impact of SNPs on DNase I sensitivity in their native genomic contexts and accurately predicts the results of dense mutagenesis of several enhancers in reporter assays. Previously validated GWAS SNPs yield large deltaSVM scores, and we predict new risk-conferring SNPs for several autoimmune diseases. Thus, deltaSVM provides a powerful computational approach to systematically identify functional regulatory variants.
Project description:The resolution of genome-wide association studies (GWAS) is limited by the linkage disequilibrium (LD) structure of the population being studied. Selecting the most likely causal variants within an LD block is relatively straightforward within coding sequence, but is more difficult when all variants are intergenic. Predicting functional non-coding sequence has been recently facilitated by the availability of conservation and epigenomic information. We present HaploReg, a tool for exploring annotations of the non-coding genome among the results of published GWAS or novel sets of variants. Using LD information from the 1000 Genomes Project, linked SNPs and small indels can be visualized along with their predicted chromatin state in nine cell types, conservation across mammals and their effect on regulatory motifs. Sets of SNPs, such as those resulting from GWAS, are analyzed for an enrichment of cell type-specific enhancers. HaploReg will be useful to researchers developing mechanistic hypotheses of the impact of non-coding variants on clinical phenotypes and normal variation. The HaploReg database is available at http://compbio.mit.edu/HaploReg.
Project description:Genome-wide association studies (GWAS) have identified >100 independent SNPs that modulate the risk of type 2 diabetes (T2D) and related traits. However, the pathogenic mechanisms of most of these SNPs remain elusive. Here, we examined genomic, epigenomic, and transcriptomic profiles in human pancreatic islets to understand the links between genetic variation, chromatin landscape, and gene expression in the context of T2D. We first integrated genome and transcriptome variation across 112 islet samples to produce dense cis-expression quantitative trait loci (cis-eQTL) maps. Additional integration with chromatin-state maps for islets and other diverse tissue types revealed that cis-eQTLs for islet-specific genes are specifically and significantly enriched in islet stretch enhancers. High-resolution chromatin accessibility profiling using assay for transposase-accessible chromatin sequencing (ATAC-seq) in two islet samples enabled us to identify specific transcription factor (TF) footprints embedded in active regulatory elements, which are highly enriched for islet cis-eQTL. Aggregate allelic bias signatures in TF footprints enabled us de novo to reconstruct TF binding affinities genetically, which support the high-quality nature of the TF footprint predictions. Interestingly, we found that T2D GWAS loci were strikingly and specifically enriched in islet Regulatory Factor X (RFX) footprints. Remarkably, within and across independent loci, T2D risk alleles that overlap with RFX footprints uniformly disrupt the RFX motifs at high-information content positions. Together, these results suggest that common regulatory variations have shaped islet TF footprints and the transcriptome and that a confluent RFX regulatory grammar plays a significant role in the genetic component of T2D predisposition.
Project description:Lung cancer is the leading cause of cancer death. To date, many SNPs have been found associated with lung cancer risk through genome-wide association studies (GWAS). However, since most GWAS SNPs lie in non-coding regions and are co-inherited with hundreds of SNPs in linkage disequilibrium (LD), which SNP(s) play a causal role in the disease remains poorly understood. Here we aim to identify causal SNPs associated with lung cancer risk through investigating allele-specific effects (ASE). Sequencing data used in the study included that which our group generated for primary alveolar epithelial and basal cells, described below, as well as data from publicly available websites, including: ENCODE (Encyclopedia of DNA Elements, https://genome.ucsc.edu/ENCODE), Roadmap Epigenomics( https://www.ncbi.nlm.nih.gov/geo/roadmap/epigenomics), DBTSS (Database of Transcription Start Site, Suzuki et al. 2015), and GEO (Gene Expression Omnibus, Watanabe et al. 2013). Data we generated in-house from highly pure populations of primary human AEC and basal cells included ChIP-seq, FAIRE-seq, whole-genome bisulfite sequencing (WGBS), and ATAC-seq. Libraries and FASTQ files were generated at the USC Epigenome Center. By integrating data from these 374 sequencing experiments (including ChIP-seq, DNase-seq, ATAC-seq, and FAIRE-seq) performed in 71 lung-relevant cells, we found 30 lung cancer risk-associated SNPs that showed ASE. Of particular interest, seven SNPs from four loci (12p13.33, 5p15.33, 6p21.33, 22q12.22) were also found as the top-ranked SNPs in fine mapping studies, lending statistical support to our hypothesis that SNPs showing ASE are strong candidates for causal SNPs. Three SNPs, rs11571379, rs7725218, and rs3101018, showed allele-specific enhancer/promoter activities in luciferase reporter assays, supporting their causal roles. Predictions of transcription factor (TF) binding sites and target genes suggest that allele-specific binding of TFs to rs11571379, rs7725218, and rs3101018 regulates the expression of RAD52, TERT and C4A respectively, which could contribute to lung cancer risk through a variety of mechanisms. In conclusion, we have performed a comprehensive ASE evaluation of lung cancer risk-associated SNPs. Our findings highlight three potential causal SNPs and provide insights into the mechanism of by which these risk loci can contribute to lung cancer. Overall design: Data from 374 sequencing experiments (including ChIP-seq, DNase-seq, ATAC-seq, and FAIRE-seq) performed in 71 lung-relevant cells were used to determine allele-specific effects of SNPs on functional epigenomes