Identifying noncoding risk variants using disease-relevant gene regulatory networks.
ABSTRACT: Identifying noncoding risk variants remains a challenging task. Because noncoding variants exert their effects in the context of a gene regulatory network (GRN), we hypothesize that explicit use of disease-relevant GRNs can significantly improve the inference accuracy of noncoding risk variants. We describe Annotation of Regulatory Variants using Integrated Networks (ARVIN), a general computational framework for predicting causal noncoding variants. It employs a set of novel regulatory network-based features, combined with sequence-based features to infer noncoding risk variants. Using known causal variants in gene promoters and enhancers in a number of diseases, we show ARVIN outperforms state-of-the-art methods that use sequence-based features alone. Additional experimental validation using reporter assay further demonstrates the accuracy of ARVIN. Application of ARVIN to seven autoimmune diseases provides a holistic view of the gene subnetwork perturbed by the combinatorial action of the entire set of risk noncoding mutations.
Project description:Major progress in disease genetics has been made through genome-wide association studies (GWASs). One of the key tasks for post-GWAS analyses is to identify causal noncoding variants with regulatory function. Here, on the basis of >2000 functional features, we developed a convolutional neural network framework for combinatorial, nonlinear modeling of complex patterns shared by risk variants scattered among multiple associated loci. When applied for major psychiatric disorders and autoimmune diseases, neural and immune features, respectively, exhibited high explanatory power while reflecting the pathophysiology of the relevant disease. The predicted causal variants were concentrated in active regulatory regions of relevant cell types and tended to be in physical contact with transcription factors while residing in evolutionarily conserved regions and resulting in expression changes of genes related to the given disease. We demonstrate some examples of novel candidate causal variants and associated genes. Our method is expected to contribute to the identification and functional interpretation of potential causal noncoding variants in post-GWAS analyses.
Project description:BACKGROUND:Genetic variants currently known to affect coronary artery disease (CAD) risk explain less than one-quarter of disease heritability. The heritability contribution of gene regulatory networks (GRNs) in CAD, which are modulated by both genetic and environmental factors, is unknown. OBJECTIVES:This study sought to determine the heritability contributions of single nucleotide polymorphisms affecting gene expression (eSNPs) in GRNs causally linked to CAD. METHODS:Seven vascular and metabolic tissues collected in 2 independent genetics-of-gene-expression studies of patients with CAD were used to identify eSNPs and to infer coexpression networks. To construct GRNs with causal relations to CAD, the prior information of eSNPs in the coexpression networks was used in a Bayesian algorithm. Narrow-sense CAD heritability conferred by the GRNs was calculated from individual-level genotype data from 9 European genome-wide association studies (GWAS) (13,612 cases, 13,758 control cases). RESULTS:The authors identified and replicated 28 independent GRNs active in CAD. The genetic variation in these networks contributed to 10.0% of CAD heritability beyond the 22% attributable to risk loci identified by GWAS. GRNs in the atherosclerotic arterial wall (n = 7) and subcutaneous or visceral abdominal fat (n = 9) were most strongly implicated, jointly explaining 8.2% of CAD heritability. In all, these 28 GRNs (each contributing to >0.2% of CAD heritability) comprised 24 to 841 genes, whereof 1 to 28 genes had strong regulatory effects (key disease drivers) and harbored many relevant functions previously associated with CAD. The gene activity in these 28 GRNs also displayed strong associations with genetic and phenotypic cardiometabolic disease variations both in humans and mice, indicative of their pivotal roles as mediators of gene-environmental interactions in CAD. CONCLUSIONS:GRNs capture a major portion of genetic variance and contribute to heritability beyond that of genetic loci currently known to affect CAD risk. These networks provide a framework to identify novel risk genes/pathways and study molecular interactions within and across disease-relevant tissues leading to CAD.
Project description:Most cancer-associated genetic variants identified from genome-wide association studies (GWAS) do not obviously change protein structure, leading to the hypothesis that the associations are attributable to regulatory polymorphisms. Translating genetic associations into mechanistic insights can be facilitated by knowledge of the causal regulatory variant (or variants) responsible for the statistical signal. Experimental validation of candidate functional variants is onerous, making bioinformatic approaches necessary to prioritize candidates for laboratory analysis. Thus, a systematic approach for recognizing functional (and, therefore, likely causal) variants in noncoding regions is an important step toward interpreting cancer risk loci. This review provides a detailed introduction to current regulatory variant annotations, followed by an overview of how to leverage these resources to prioritize candidate functional polymorphisms in regulatory regions.
Project description:Gene regulatory networks (GRNs) control cellular function and decision making during tissue development and homeostasis. Mathematical tools based on dynamical systems theory are often used to model these networks, but the size and complexity of these models mean that their behaviour is not always intuitive and the underlying mechanisms can be difficult to decipher. For this reason, methods that simplify and aid exploration of complex networks are necessary. To this end we develop a broadly applicable form of the Zwanzig-Mori projection. By first converting a thermodynamic state ensemble model of gene regulation into mass action reactions we derive a general method that produces a set of time evolution equations for a subset of components of a network. The influence of the rest of the network, the bulk, is captured by memory functions that describe how the subnetwork reacts to its own past state via components in the bulk. These memory functions provide probes of near-steady state dynamics, revealing information not easily accessible otherwise. We illustrate the method on a simple cross-repressive transcriptional motif to show that memory functions not only simplify the analysis of the subnetwork but also have a natural interpretation. We then apply the approach to a GRN from the vertebrate neural tube, a well characterised developmental transcriptional network composed of four interacting transcription factors. The memory functions reveal the function of specific links within the neural tube network and identify features of the regulatory structure that specifically increase the robustness of the network to initial conditions. Taken together, the study provides evidence that Zwanzig-Mori projections offer powerful and effective tools for simplifying and exploring the behaviour of GRNs.
Project description:The majority of variants identified by genome-wide association studies (GWAS) reside in the noncoding genome, affecting regulatory elements including transcriptional enhancers. However, characterizing their effects requires the integration of GWAS results with context-specific regulatory activity and linkage disequilibrium annotations to identify causal variants underlying noncoding association signals and the regulatory elements, tissue contexts, and target genes they affect. We propose INFERNO, a novel method which integrates hundreds of functional genomics datasets spanning enhancer activity, transcription factor binding sites, and expression quantitative trait loci with GWAS summary statistics. INFERNO includes novel statistical methods to quantify empirical enrichments of tissue-specific enhancer overlap and to identify co-regulatory networks of dysregulated long noncoding RNAs (lncRNAs). We applied INFERNO to two large GWAS studies. For schizophrenia (36,989 cases, 113,075 controls), INFERNO identified putatively causal variants affecting brain enhancers for known schizophrenia-related genes. For inflammatory bowel disease (IBD) (12,882 cases, 21,770 controls), INFERNO found enrichments of immune and digestive enhancers and lncRNAs involved in regulation of the adaptive immune response. In summary, INFERNO comprehensively infers the molecular mechanisms of causal noncoding variants, providing a sensitive hypothesis generation method for post-GWAS analysis. The software is available as an open source pipeline and a web server.
Project description:Genome-wide association studies (GWASs) have identified a large number of noncoding associations, calling for systematic mapping to causal regulatory variants and their distal target genes. A widely used method, quantitative trait loci (QTL) mapping for chromatin or expression traits, suffers from sample-to-sample experimental variation and trans-acting or environmental effects. Instead, alleles at heterozygous loci can be compared within a sample, thereby controlling for those confounding factors. Here we introduce a method for chromatin structure-based allele-specific pairing of regulatory variants and target transcripts. With phased genotypes, much of allele-specific expression could be explained by paired allelic cis-regulation across a long range. This approach showed approximately two times greater sensitivity than QTL mapping. There are cases in which allele imbalance cannot be tested because heterozygotes are not available among reference samples. Therefore, we employed a machine learning method to predict missing positive cases based on various features shared by observed allele-specific pairs. We showed that only 10 reference samples are sufficient to achieve high prediction accuracy with a low sampling variation. In conclusion, our method enables highly sensitive fine mapping and target identification for trait-associated variants based on a small number of reference samples.
Project description:Bistability plays a central role in the gene regulatory networks (GRNs) controlling many essential biological functions, including cellular differentiation and cell cycle control. However, establishing the network topologies that can exhibit bistability remains a challenge, in part due to the exceedingly large variety of GRNs that exist for even a small number of components. We begin to address this problem by employing chemical reaction network theory in a comprehensive in silico survey to determine the capacity for bistability of more than 40,000 simple networks that can be formed by two transcription factor-coding genes and their associated proteins (assuming only the most elementary biochemical processes). We find that there exist reaction rate constants leading to bistability in ?90% of these GRN models, including several circuits that do not contain any of the TF cooperativity commonly associated with bistable systems, and the majority of which could only be identified as bistable through an original subnetwork-based analysis. A topological sorting of the two-gene family of networks based on the presence or absence of biochemical reactions reveals eleven minimal bistable networks (i.e., bistable networks that do not contain within them a smaller bistable subnetwork). The large number of previously unknown bistable network topologies suggests that the capacity for switch-like behavior in GRNs arises with relative ease and is not easily lost through network evolution. To highlight the relevance of the systematic application of CRNT to bistable network identification in real biological systems, we integrated publicly available protein-protein interaction, protein-DNA interaction, and gene expression data from Saccharomyces cerevisiae, and identified several GRNs predicted to behave in a bistable fashion.
Project description:Most variants implicated in common human disease by genome-wide association studies (GWAS) lie in noncoding sequence intervals. Despite the suggestion that regulatory element disruption represents a common theme, identifying causal risk variants within implicated genomic regions remains a major challenge. Here we present a new sequence-based computational method to predict the effect of regulatory variation, using a classifier (gkm-SVM) that encodes cell type-specific regulatory sequence vocabularies. The induced change in the gkm-SVM score, deltaSVM, quantifies the effect of variants. We show that deltaSVM accurately predicts the impact of SNPs on DNase I sensitivity in their native genomic contexts and accurately predicts the results of dense mutagenesis of several enhancers in reporter assays. Previously validated GWAS SNPs yield large deltaSVM scores, and we predict new risk-conferring SNPs for several autoimmune diseases. Thus, deltaSVM provides a powerful computational approach to systematically identify functional regulatory variants.
Project description:Schizophrenia (SZ) is a devastating mental disorder afflicting 1% of the population. Recent genome-wide association studies (GWASs) of SZ have identified >100 risk loci. However, the causal variants/genes and the causal mechanisms remain largely unknown, which hinders the translation of GWAS findings into disease biology and drug targets. Most risk variants are noncoding, thus likely regulate gene expression. A major mechanism of transcriptional regulation is chromatin remodeling, and open chromatin is a versatile predictor of regulatory sequences. MicroRNA-mediated post-transcriptional regulation plays an important role in SZ pathogenesis. Neurons differentiated from patient-specific induced pluripotent stem cells (iPSCs) provide an experimental model to characterize the genetic perturbation of regulatory variants that are often specific to cell type and/or developmental stage. The emerging genome-editing technology enables the creation of isogenic iPSCs and neurons to efficiently characterize the effects of SZ-associated regulatory variants on SZ-relevant molecular and cellular phenotypes involving dopaminergic, glutamatergic, and GABAergic neurotransmissions. SZ GWAS findings equipped with the emerging functional genomics approaches provide an unprecedented opportunity for understanding new disease biology and identifying novel drug targets.
Project description:BACKGROUND:We previously reported on CERENKOV, an approach for identifying regulatory single nucleotide polymorphisms (rSNPs) that is based on 246 annotation features. CERENKOV uses the xgboost classifier and is designed to be used to find causal noncoding SNPs in loci identified by genome-wide association studies (GWAS). We reported that CERENKOV has state-of-the-art performance (by two traditional measures and a novel GWAS-oriented measure, AVGRANK) in a comparison to nine other tools for identifying functional noncoding SNPs, using a comprehensive reference SNP set (OSU17, 15,331 SNPs). Given that SNPs are grouped within loci in the reference SNP set and given the importance of the data-space manifold geometry for machine-learning model selection, we hypothesized that within-locus inter-SNP distances would have class-based distributional biases that could be exploited to improve rSNP recognition accuracy. We thus defined an intralocus SNP "radius" as the average data-space distance from a SNP to the other intralocus neighbors, and explored radius likelihoods for five distance measures. RESULTS:We expanded the set of reference SNPs to 39,083 (the OSU18 set) and extracted CERENKOV SNP feature data. We computed radius empirical likelihoods and likelihood densities for rSNPs and control SNPs, and found significant likelihood differences between rSNPs and control SNPs. We fit parametric models of likelihood distributions for five different distance measures to obtain ten log-likelihood features that we combined with the 248-dimensional CERENKOV feature matrix. On the OSU18 SNP set, we measured the classification accuracy of CERENKOV with and without the new distance-based features, and found that the addition of distance-based features significantly improves rSNP recognition performance as measured by AUPVR, AUROC, and AVGRANK. Along with feature data for the OSU18 set, the software code for extracting the base feature matrix, estimating ten distance-based likelihood ratio features, and scoring candidate causal SNPs, are released as open-source software CERENKOV2. CONCLUSIONS:Accounting for the locus-specific geometry of SNPs in data-space significantly improved the accuracy with which noncoding rSNPs can be computationally identified.