Convolutional neural network model to predict causal risk factors that share complex regulatory features.
ABSTRACT: Major progress in disease genetics has been made through genome-wide association studies (GWASs). One of the key tasks for post-GWAS analyses is to identify causal noncoding variants with regulatory function. Here, on the basis of >2000 functional features, we developed a convolutional neural network framework for combinatorial, nonlinear modeling of complex patterns shared by risk variants scattered among multiple associated loci. When applied for major psychiatric disorders and autoimmune diseases, neural and immune features, respectively, exhibited high explanatory power while reflecting the pathophysiology of the relevant disease. The predicted causal variants were concentrated in active regulatory regions of relevant cell types and tended to be in physical contact with transcription factors while residing in evolutionarily conserved regions and resulting in expression changes of genes related to the given disease. We demonstrate some examples of novel candidate causal variants and associated genes. Our method is expected to contribute to the identification and functional interpretation of potential causal noncoding variants in post-GWAS analyses.
Project description:Schizophrenia (SZ) is a devastating mental disorder afflicting 1% of the population. Recent genome-wide association studies (GWASs) of SZ have identified >100 risk loci. However, the causal variants/genes and the causal mechanisms remain largely unknown, which hinders the translation of GWAS findings into disease biology and drug targets. Most risk variants are noncoding, thus likely regulate gene expression. A major mechanism of transcriptional regulation is chromatin remodeling, and open chromatin is a versatile predictor of regulatory sequences. MicroRNA-mediated post-transcriptional regulation plays an important role in SZ pathogenesis. Neurons differentiated from patient-specific induced pluripotent stem cells (iPSCs) provide an experimental model to characterize the genetic perturbation of regulatory variants that are often specific to cell type and/or developmental stage. The emerging genome-editing technology enables the creation of isogenic iPSCs and neurons to efficiently characterize the effects of SZ-associated regulatory variants on SZ-relevant molecular and cellular phenotypes involving dopaminergic, glutamatergic, and GABAergic neurotransmissions. SZ GWAS findings equipped with the emerging functional genomics approaches provide an unprecedented opportunity for understanding new disease biology and identifying novel drug targets.
Project description:Human genome-wide association studies (GWASs) have identified numerous associations between single nucleotide polymorphisms (SNPs) and pulmonary function. Proving that there is a causal relationship between GWAS SNPs, many of which are noncoding and without known functional impact, and these traits has been elusive. Furthermore, noncoding GWAS-identified SNPs may exert trans-regulatory effects rather than impact the proximal gene. Noncoding variants in 5-hydroxytryptamine (serotonin) receptor 4 (HTR4) are associated with pulmonary function in human GWASs. To gain insight into whether this association is causal, we tested whether Htr4-null mice have altered pulmonary function. We found that HTR4-deficient mice have 12% higher baseline lung resistance and also increased methacholine-induced airway hyperresponsiveness (AHR) as measured by lung resistance (27%), tissue resistance (48%), and tissue elastance (30%). Furthermore, Htr4-null mice were more sensitive to serotonin-induced AHR. In models of exposure to bacterial lipopolysaccharide, bleomycin, and allergic airway inflammation induced by house dust mites, pulmonary function and cytokine profiles in Htr4-null mice differed little from their wild-type controls. The findings of altered baseline lung function and increased AHR in Htr4-null mice support a causal relationship between genetic variation in HTR4 and pulmonary function identified in human GWAS.
Project description:To date, the widely used genome-wide association studies (GWASs) of the human genome have reported thousands of variants that are significantly associated with various human traits. However, in the vast majority of these cases, the causal variants responsible for the observed associations remain unknown. In order to facilitate the identification of causal variants, we designed a simple computational method called the "preferential linkage disequilibrium (LD)" approach, which follows the variants discovered by GWASs to pinpoint the causal variants, even if they are rare compared with the discovery variants. The approach is based on the hypothesis that the GWAS-discovered variant is better at tagging the causal variants than are most other variants evaluated in the original GWAS. Applying the preferential LD approach to the GWAS signals of five human traits for which the causal variants are already known, we successfully placed the known causal variants among the top ten candidates in the majority of these cases. Application of this method to additional GWASs, including those of hepatitis C virus treatment response, plasma levels of clotting factors, and late-onset Alzheimer disease, has led to the identification of a number of promising candidate causal variants. This method represents a useful tool for delineating causal variants by bringing together GWAS signals and the rapidly accumulating variant data from next-generation sequencing.
Project description:One of the formative goals of genetics research is to understand how genetic variation leads to phenotypic differences and human disease. Genome-wide association studies (GWASs) bring us closer to this goal by linking variation with disease faster than ever before. Despite this, GWASs alone are unable to pinpoint disease-causing single nucleotide polymorphisms (SNPs). Noncoding SNPs, which represent the majority of GWAS SNPs, present a particular challenge. To address this challenge, an array of computational tools designed to prioritize and predict the function of noncoding GWAS SNPs have been developed. However, fewer than 40% of GWAS publications from 2015 utilized these tools. We discuss several leading methods for annotating noncoding variants and how they can be integrated into research pipelines in hopes that they will be broadly applied in future GWAS analyses.
Project description:The complex language of eukaryotic gene expression remains incompletely understood. Despite the importance suggested by many noncoding variants statistically associated with human disease, nearly all such variants have unknown mechanisms. Here, we address this challenge using an approach based on a recent machine learning advance-deep convolutional neural networks (CNNs). We introduce the open source package Basset to apply CNNs to learn the functional activity of DNA sequences from genomics data. We trained Basset on a compendium of accessible genomic sites mapped in 164 cell types by DNase-seq, and demonstrate greater predictive accuracy than previous methods. Basset predictions for the change in accessibility between variant alleles were far greater for Genome-wide association study (GWAS) SNPs that are likely to be causal relative to nearby SNPs in linkage disequilibrium with them. With Basset, a researcher can perform a single sequencing assay in their cell type of interest and simultaneously learn that cell's chromatin accessibility code and annotate every mutation in the genome with its influence on present accessibility and latent potential for accessibility. Thus, Basset offers a powerful computational approach to annotate and interpret the noncoding genome.
Project description:Genome-wide association studies (GWASs) have ascertained numerous trait-associated common genetic variants, frequently localized to regulatory DNA. We found that common genetic variation at BCL11A associated with fetal hemoglobin (HbF) level lies in noncoding sequences decorated by an erythroid enhancer chromatin signature. Fine-mapping uncovers a motif-disrupting common variant associated with reduced transcription factor (TF) binding, modestly diminished BCL11A expression, and elevated HbF. The surrounding sequences function in vivo as a developmental stage-specific, lineage-restricted enhancer. Genome engineering reveals the enhancer is required in erythroid but not B-lymphoid cells for BCL11A expression. These findings illustrate how GWASs may expose functional variants of modest impact within causal elements essential for appropriate gene expression. We propose the GWAS-marked BCL11A enhancer represents an attractive target for therapeutic genome engineering for the ?-hemoglobinopathies.
Project description:Genome-wide association studies (GWASs) for many complex diseases, including inflammatory bowel disease (IBD), produced hundreds of disease-associated loci-the majority of which are noncoding. The number of GWAS loci is increasing very rapidly, but the process of translating single nucleotide polymorphisms (SNPs) from these loci to genomic medicine is lagging. In this study, we investigated 4,734 variants from 152 IBD associated GWAS loci (IBD associated 152 lead noncoding SNPs identified from pooled GWAS results + 4,582 variants in strong linkage-disequilibrium (LD) (r2 ?0.8) for EUR population of 1K Genomes Project) using four publicly available bioinformatics tools, e.g. dbPSHP, CADD, GWAVA, and RegulomeDB, to annotate and prioritize putative regulatory variants. Of the 152 lead noncoding SNPs, around 11% are under strong negative selection (GERP++ RS ?2); and ~30% are under balancing selection (Tajima's D score >2) in CEU population (1K Genomes Project)--though these regions are positively selected (GERP++ RS <0) in mammalian evolution. The analysis of 4,734 variants using three integrative annotation tools produced 929 putative functional SNPs, of which 18 SNPs (from 15 GWAS loci) are in concordance with all three classifiers. These prioritized noncoding SNPs may contribute to IBD pathogenesis by dysregulating the expression of nearby genes. This study showed the usefulness of integrative annotation for prioritizing fewer functional variants from a large number of GWAS markers.
Project description:BACKGROUND:In the last decade, a large number of common variants underlying complex diseases have been identified through genome-wide association studies (GWASs). Summary data of the GWASs are freely and publicly available. The summary data is usually obtained through single marker analysis. Gene-based analysis offers a useful alternative and complement to single marker analysis. Results from gene level association tests can be more readily integrated with downstream functional and pathogenic investigations. Most existing gene-based methods fall into two categories: burden tests and quadratic tests. Burden tests are usually powerful when the directions of effects of causal variants are the same. However, they may suffer loss of statistical power when different directions of effects exist at the causal variants. The power of quadratic tests is not affected by the directions of effects but could be less powerful due to issues such as the large number of degree of freedoms. These drawbacks of existing gene based methods motivated us to develop a new powerful method to identify disease associated genes using existing GWAS summary data. METHODS AND RESULTS:In this paper, we propose a new truncated statistic method (TS) by utilizing a truncated method to find the genes that have a true contribution to the genetic association. Extensive simulation studies demonstrate that our proposed test outperforms other comparable tests. We applied TS and other comparable methods to the schizophrenia GWAS data and type 2 diabetes (T2D) GWAS meta-analysis summary data. TS identified more disease associated genes than comparable methods. Many of the significant genes identified by TS may have important mechanisms relevant to the associated traits. TS is implemented in C program TS, which is freely and publicly available online. CONCLUSIONS:The proposed truncated statistic outperforms existing methods. It can be employed to detect novel traits associated genes using GWAS summary data.
Project description:The vast majority of genome-wide association study (GWAS) risk loci fall in non-coding regions of the genome. One possible hypothesis is that these GWAS risk loci alter the individual's disease risk through their effect on gene expression in different tissues. In order to understand the mechanisms driving a GWAS risk locus, it is helpful to determine which gene is affected in specific tissue types. For example, the relevant gene and tissue could play a role in the disease mechanism if the same variant responsible for a GWAS locus also affects gene expression. Identifying whether or not the same variant is causal in both GWASs and expression quantitative trail locus (eQTL) studies is challenging because of the uncertainty induced by linkage disequilibrium and the fact that some loci harbor multiple causal variants. However, current methods that address this problem assume that each locus contains a single causal variant. In this paper, we present eCAVIAR, a probabilistic method that has several key advantages over existing methods. First, our method can account for more than one causal variant in any given locus. Second, it can leverage summary statistics without accessing the individual genotype data. We use both simulated and real datasets to demonstrate the utility of our method. Using publicly available eQTL data on 45 different tissues, we demonstrate that eCAVIAR can prioritize likely relevant tissues and target genes for a set of glucose- and insulin-related trait loci.
Project description:Genome-wide association studies (GWASs) have identified thousands of loci associated with hundreds of complex diseases and traits, and progress is being made toward elucidating the causal variants and genes underlying these associations. Functional characterization of mechanisms at GWAS loci is a multi-faceted challenge. Challenges include linkage disequilibrium and allelic heterogeneity at each locus, the noncoding nature of most loci, and the time and cost needed for experimentally evaluating the potential mechanistic contributions of genes and variants. As GWAS sample sizes increase, more loci are identified, and the complexities of individual loci emerge. Loci can consist of multiple association signals, each of which can reflect the influence of multiple variants, inseparable by association analyses. Each signal within a locus can influence the same or different target genes. Experimental studies of genes and variants can differ on the basis of cell type, cellular environment, or other context-specific variables. In this review, we describe the complexity of mechanisms at GWAS loci-including multiple signals, multiple variants, and/or multiple genes-and the implications these complexities hold for experimental study design and interpretation of GWAS mechanisms.