Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data.
ABSTRACT: Many genetic variants that influence phenotypes of interest are located outside of protein-coding genes, yet existing methods for identifying such variants have poor predictive power. Here we introduce a new computational method, called LINSIGHT, that substantially improves the prediction of noncoding nucleotide sites at which mutations are likely to have deleterious fitness consequences, and which, therefore, are likely to be phenotypically important. LINSIGHT combines a generalized linear model for functional genomic data with a probabilistic model of molecular evolution. The method is fast and highly scalable, enabling it to exploit the 'big data' available in modern genomics. We show that LINSIGHT outperforms the best available methods in identifying human noncoding variants associated with inherited diseases. In addition, we apply LINSIGHT to an atlas of human enhancers and show that the fitness consequences at enhancers depend on cell type, tissue specificity, and constraints at associated promoters.
Project description:Mutations in protein-coding genes are well established as the basis for human cancer, yet how alterations within noncoding genome, a substantial fraction of which contain cis-regulatory elements (CRE), contribute to cancer pathophysiology remains elusive. Here, we developed an integrative approach to systematically identify and characterize noncoding regulatory variants with functional consequences in human hematopoietic malignancies. Combining targeted resequencing of hematopoietic lineage-associated CREs and mutation discovery, we uncovered 1,836 recurrently mutated CREs containing leukemia-associated noncoding variants. By enhanced CRISPR/dCas9-based CRE perturbation screening and functional analyses, we identified 218 variant-associated oncogenic or tumor-suppressive CREs in human leukemia. Noncoding variants at KRAS and PER2 enhancers reside in proximity to nuclear receptor (NR) binding regions and modulate transcriptional activities in response to NR signaling in leukemia cells. NR binding sites frequently colocalize with noncoding variants across cancer types. Hence, recurrent noncoding variants connect enhancer dysregulation with nuclear receptor signaling in hematopoietic malignancies. SIGNIFICANCE: We describe an integrative approach to identify noncoding variants in human leukemia, and reveal cohorts of variant-associated oncogenic and tumor-suppressive cis-regulatory elements including KRAS and PER2 enhancers. Our findings support a model in which noncoding regulatory variants connect enhancer dysregulation with nuclear receptor signaling to modulate gene programs in hematopoietic malignancies.See related commentary by van Galen, p. 646.This article is highlighted in the In This Issue feature, p. 627.
Project description:Identifying noncoding risk variants remains a challenging task. Because noncoding variants exert their effects in the context of a gene regulatory network (GRN), we hypothesize that explicit use of disease-relevant GRNs can significantly improve the inference accuracy of noncoding risk variants. We describe Annotation of Regulatory Variants using Integrated Networks (ARVIN), a general computational framework for predicting causal noncoding variants. It employs a set of novel regulatory network-based features, combined with sequence-based features to infer noncoding risk variants. Using known causal variants in gene promoters and enhancers in a number of diseases, we show ARVIN outperforms state-of-the-art methods that use sequence-based features alone. Additional experimental validation using reporter assay further demonstrates the accuracy of ARVIN. Application of ARVIN to seven autoimmune diseases provides a holistic view of the gene subnetwork perturbed by the combinatorial action of the entire set of risk noncoding mutations.
Project description:The majority of variants identified by genome-wide association studies (GWAS) reside in the noncoding genome, affecting regulatory elements including transcriptional enhancers. However, characterizing their effects requires the integration of GWAS results with context-specific regulatory activity and linkage disequilibrium annotations to identify causal variants underlying noncoding association signals and the regulatory elements, tissue contexts, and target genes they affect. We propose INFERNO, a novel method which integrates hundreds of functional genomics datasets spanning enhancer activity, transcription factor binding sites, and expression quantitative trait loci with GWAS summary statistics. INFERNO includes novel statistical methods to quantify empirical enrichments of tissue-specific enhancer overlap and to identify co-regulatory networks of dysregulated long noncoding RNAs (lncRNAs). We applied INFERNO to two large GWAS studies. For schizophrenia (36,989 cases, 113,075 controls), INFERNO identified putatively causal variants affecting brain enhancers for known schizophrenia-related genes. For inflammatory bowel disease (IBD) (12,882 cases, 21,770 controls), INFERNO found enrichments of immune and digestive enhancers and lncRNAs involved in regulation of the adaptive immune response. In summary, INFERNO comprehensively infers the molecular mechanisms of causal noncoding variants, providing a sensitive hypothesis generation method for post-GWAS analysis. The software is available as an open source pipeline and a web server.
Project description:SUMMARY:Addressing deleterious effects of noncoding mutations is an essential step towards the identification of disease-causal mutations of gene regulatory elements. Several methods for quantifying the deleteriousness of noncoding mutations using artificial intelligence, deep learning and other approaches have been recently proposed. Although the majority of the proposed methods have demonstrated excellent accuracy on different test sets, there is rarely a consensus. In addition, advanced statistical and artificial learning approaches used by these methods make it difficult porting these methods outside of the labs that have developed them. To address these challenges and to transform the methodological advances in predicting deleterious noncoding mutations into a practical resource available for the broader functional genomics and population genetics communities, we developed SNPDelScore, which uses a panel of proposed methods for quantifying deleterious effects of noncoding mutations to precompute and compare the deleteriousness scores of all common SNPs in the human genome in 44 cell lines. The panel of deleteriousness scores of a SNP computed using different methods is supplemented by functional information from the GWAS Catalog, libraries of transcription factor-binding sites, and genic characteristics of mutations. SNPDelScore comes with a genome browser capable of displaying and comparing large sets of SNPs in a genomic locus and rapidly identifying consensus SNPs with the highest deleteriousness scores making those prime candidates for phenotype-causal polymorphisms. AVAILABILITY AND IMPLEMENTATION:https://www.ncbi.nlm.nih.gov/research/snpdelscore/. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:The majority of genome-wide association study (GWAS) risk variants reside in non-coding DNA sequences. Understanding how these sequence modifications lead to transcriptional alterations and cell-to-cell variability can help unraveling genotype-phenotype relationships. Here, we describe a computational method, dubbed CAPE, which calculates the likelihood of a genetic variant deactivating enhancers by disrupting the binding of transcription factors (TFs) in a given cellular context. CAPE learns sequence signatures associated with putative enhancers originating from large-scale sequencing experiments (such as ChIP-seq or DNase-seq) and models the change in enhancer signature upon a single nucleotide substitution. CAPE accurately identifies causative cis-regulatory variation including expression quantitative trait loci (eQTLs) and DNase I sensitivity quantitative trait loci (dsQTLs) in a tissue-specific manner with precision superior to several currently available methods. The presented method can be trained on any tissue-specific dataset of enhancers and known functional variants and applied to prioritize disease-associated variants in the corresponding tissue.
Project description:Evidence that noncoding mutation can result in cancer driver events is mounting. However, it is more difficult to assign molecular biological consequences to noncoding mutations than to coding mutations, and a typical cancer genome contains many more noncoding mutations than protein-coding mutations. Accordingly, parsing functional noncoding mutation signal from noise remains an important challenge. Here we use an empirical approach to identify putatively functional noncoding somatic single nucleotide variants (SNVs) from liver cancer genomes. Annotation of candidate variants by publicly available epigenome datasets finds that 40.5% of SNVs fall in regulatory elements. When assigned to specific regulatory elements, we find that the distribution of regulatory element mutation mirrors that of nonsynonymous coding mutation, where few regulatory elements are recurrently mutated in a patient population but many are singly mutated. We find potential gain-of-binding site events among candidate SNVs, suggesting a mechanism of action for these variants. When aggregating noncoding somatic mutation in promoters, we find that genes in the ERBB signaling and MAPK signaling pathways are significantly enriched for promoter mutations. Altogether, our results suggest that functional somatic SNVs in cancer are sporadic, but occasionally occur in regulatory elements and may affect phenotype by creating binding sites for transcriptional regulators. Accordingly, we propose that noncoding mutation should be formally accounted for when determining gene- and pathway-mutation burden in cancer.
Project description:A genetic etiology is identified for one-third of patients with congenital heart disease (CHD), with 8% of cases attributable to coding de novo variants (DNVs). To assess the contribution of noncoding DNVs to CHD, we compared genome sequences from 749 CHD probands and their parents with those from 1,611 unaffected trios. Neural network prediction of noncoding DNV transcriptional impact identified a burden of DNVs in individuals with CHD (n?=?2,238 DNVs) compared to controls (n?=?4,177; P?=?8.7?×?10<sup>-4</sup>). Independent analyses of enhancers showed an excess of DNVs in associated genes (27 genes versus 3.7 expected, P?=?1?×?10<sup>-5</sup>). We observed significant overlap between these transcription-based approaches (odds ratio (OR)?=?2.5, 95% confidence interval (CI) 1.1-5.0, P?=?5.4?×?10<sup>-3</sup>). CHD DNVs altered transcription levels in 5 of 31 enhancers assayed. Finally, we observed a DNV burden in RNA-binding-protein regulatory sites (OR?=?1.13, 95% CI 1.1-1.2, P?=?8.8?×?10<sup>-5</sup>). Our findings demonstrate an enrichment of potentially disruptive regulatory noncoding DNVs in a fraction of CHD at least as high as that observed for damaging coding DNVs.
Project description:Common diseases are complex, multifactorial disorders whose pathogenesis is influenced by the interplay of genetic predisposition and environmental factors. Genome-wide association studies have interrogated genetic polymorphisms across genomes of individuals to test associations between genotype and susceptibility to specific disorders, providing insights into the genetic architecture of several complex disorders. However, genetic variants associated with the susceptibility to common diseases are often located in noncoding regions of the genome, such as tissue-specific enhancers or long noncoding RNAs, suggesting that regulatory elements might play a relevant role in human diseases. Enhancers are cis-regulatory genomic sequences that act in concert with promoters to regulate gene expression in a precise spatiotemporal manner. They can be located at a considerable distance from their cognate target promoters, increasing the difficulty of their identification. Genomes are organized in domains of chromatin folding, namely topologically associating domains (TADs). Identification of enhancer-promoter interactions within TADs has revealed principles of cell-type specificity across several organisms and tissues. The vast majority of mammalian genomes are pervasively transcribed, accounting for a previously unappreciated complexity of the noncoding RNA fraction. Particularly, long noncoding RNAs have emerged as key players for the establishment of chromatin architecture and regulation of gene expression. In this perspective, we describe the new advances in the fields of transcriptomics and genome organization, focusing on the role of noncoding genomic variants in the predisposition of common diseases. Finally, we propose a new framework for the identification of the next generation of pharmacological targets for common human diseases.
Project description:Genomes carry millions of noncoding variants, and identifying the tiny fraction with functional consequences is a major challenge for genomics. We assessed the role of selection on long noncoding RNAs (lncRNAs) for domestication-related changes in rice grains. Among 3363 lncRNA transcripts identified in early developing panicles, 95% of those with differential expression (329 lncRNAs) between Oryza sativa ssp. japonica and wild rice were significantly down-regulated in the domestication event. Joint genome and transcriptome analyses reveal that directional selection on lncRNAs altered the expression of energy metabolism genes during domestication. Transgenic experiments and population analyses with three focal lncRNAs illustrate that selection on these loci led to increased starch content and grain weight. Together, our findings indicate that genome-wide selection for lncRNA down-regulation was an important mechanism for the emergence of rice domestication traits.
Project description:Identifying functionally relevant variants against the background of ubiquitous genetic variation is a major challenge in human genetics. For variants in protein-coding regions, our understanding of the genetic code and splicing allows us to identify likely candidates, but interpreting variants outside genic regions is more difficult. Here we present genome-wide annotation of variants (GWAVA), a tool that supports prioritization of noncoding variants by integrating various genomic and epigenomic annotations.