A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic events.
ABSTRACT: Genome-wide association study (GWAS) methods applied to bacterial genomes have shown promising results for genetic marker discovery or detailed assessment of marker effect. Recently, alignment-free methods based on k-mer composition have proven their ability to explore the accessory genome. However, they lead to redundant descriptions and results which are sometimes hard to interpret. Here we introduce DBGWAS, an extended k-mer-based GWAS method producing interpretable genetic variants associated with distinct phenotypes. Relying on compacted De Bruijn graphs (cDBG), our method gathers cDBG nodes, identified by the association model, into subgraphs defined from their neighbourhood in the initial cDBG. DBGWAS is alignment-free and only requires a set of contigs and phenotypes. In particular, it does not require prior annotation or reference genomes. It produces subgraphs representing phenotype-associated genetic variants such as local polymorphisms and mobile genetic elements (MGE). It offers a graphical framework which helps interpret GWAS results. Importantly it is also computationally efficient-experiments took one hour and a half on average. We validated our method using antibiotic resistance phenotypes for three bacterial species. DBGWAS recovered known resistance determinants such as mutations in core genes in Mycobacterium tuberculosis, and genes acquired by horizontal transfer in Staphylococcus aureus and Pseudomonas aeruginosa-along with their MGE context. It also enabled us to formulate new hypotheses involving genetic variants not yet described in the antibiotic resistance literature. An open-source tool implementing DBGWAS is available at https://gitlab.com/leoisl/dbgwas.
Project description:Most disease-associated variants identified by genome-wide association studies (GWAS) exist in noncoding regions. In spite of the common agreement that such variants may disrupt biological functions of their hosting regulatory elements, it remains a great challenge to characterize the risk of a genetic variant within the implicated genome sequence. Therefore, it is essential to develop an effective computational model that is not only capable of predicting the potential risk of a genetic variant but also valid in interpreting how the function of the genome is affected with the occurrence of the variant.We developed a method named kmerForest that used a random forest classifier with k-mer counts to predict accessible chromatin regions purely based on DNA sequences. We demonstrated that our method outperforms existing methods in distinguishing known accessible chromatin regions from random genomic sequences. Furthermore, the performance of our method can further be improved with the incorporation of sequence conservation features. Based on this model, we assessed importance of the k-mer features by a series of permutation experiments, and we characterized the risk of a single nucleotide polymorphism (SNP) on the function of the genome using the difference between the importance of the k-mer features affected by the occurrence of the SNP. We conducted a series of experiments and showed that our model can well discriminate between pathogenic and normal SNPs. Particularly, our model correctly prioritized SNPs that are proved to be enriched for the binding sites of FOXA1 in breast cancer cell lines from previous studies.We presented a novel method to interpret functional genetic variants purely base on DNA sequences. The proposed k-mer based score offers an effective means of measuring the impact of SNPs on the function of the genome, and thus shedding light on the identification of genetic risk factors underlying complex traits and diseases.
Project description:Discovery of genetic variants underlying bacterial phenotypes and the prediction of phenotypes such as antibiotic resistance are fundamental tasks in bacterial genomics. Genome-wide association study (GWAS) methods have been applied to study these relations, but the plastic nature of bacterial genomes and the clonal structure of bacterial populations creates challenges. We introduce an alignment-free method which finds sets of loci associated with bacterial phenotypes, quantifies the total effect of genetics on the phenotype, and allows accurate phenotype prediction, all within a single computationally scalable joint modeling framework. Genetic variants covering the entire pangenome are compactly represented by extended DNA sequence words known as unitigs, and model fitting is achieved using elastic net penalization, an extension of standard multiple regression. Using an extensive set of state-of-the-art bacterial population genomic data sets, we demonstrate that our approach performs accurate phenotype prediction, comparable to popular machine learning methods, while retaining both interpretability and computational efficiency. Compared to those of previous approaches, which test each genotype-phenotype association separately for each variant and apply a significance threshold, the variants selected by our joint modeling approach overlap substantially.IMPORTANCE Being able to identify the genetic variants responsible for specific bacterial phenotypes has been the goal of bacterial genetics since its inception and is fundamental to our current level of understanding of bacteria. This identification has been based primarily on painstaking experimentation, but the availability of large data sets of whole genomes with associated phenotype metadata promises to revolutionize this approach, not least for important clinical phenotypes that are not amenable to laboratory analysis. These models of phenotype-genotype association can in the future be used for rapid prediction of clinically important phenotypes such as antibiotic resistance and virulence by rapid-turnaround or point-of-care tests. However, despite much effort being put into adapting genome-wide association study (GWAS) approaches to cope with bacterium-specific problems, such as strong population structure and horizontal gene exchange, current approaches are not yet optimal. We describe a method that advances methodology for both association and generation of portable prediction models.
Project description:Genome-wide association studies (GWAS) aim to identify genetic factors associated with phenotypes. Standard analyses test variants for associations individually. However, variant-level associations are hard to identify and can be difficult to interpret biologically. Enrichment analyses help address both problems by targeting sets of biologically related variants. Here we introduce a new model-based enrichment method that requires only GWAS summary statistics. Applying this method to interrogate 4,026 gene sets in 31 human phenotypes identifies many previously-unreported enrichments, including enrichments of endochondral ossification pathway for height, NFAT-dependent transcription pathway for rheumatoid arthritis, brain-related genes for coronary artery disease, and liver-related genes for Alzheimer's disease. A key feature of our method is that inferred enrichments automatically help identify new trait-associated genes. For example, accounting for enrichment in lipid transport genes highlights association between MTTP and low-density lipoprotein levels, whereas conventional analyses of the same data found no significant variants near this gene.
Project description:Background:Multivariate testing tools that integrate multiple genome-wide association studies (GWAS) have become important as the number of phenotypes gathered from study cohorts and biobanks has increased. While these tools have been shown to boost statistical power considerably over univariate tests, an important remaining challenge is to interpret which traits are driving the multivariate association and which traits are just passengers with minor contributions to the genotype-phenotypes association statistic. Results:We introduce MetaPhat, a novel bioinformatics tool to conduct GWAS of multiple correlated traits using univariate GWAS results and to decompose multivariate associations into sets of central traits based on intuitive trace plots that visualize Bayesian Information Criterion (BIC) and P-value statistics of multivariate association models. We validate MetaPhat with Global Lipids Genetics Consortium GWAS results, and we apply MetaPhat to univariate GWAS results for 21 heritable and correlated polyunsaturated lipid species from 2,045 Finnish samples, detecting seven independent loci associated with a cluster of lipid species. In most cases, we are able to decompose these multivariate associations to only three to five central traits out of all 21 traits included in the analyses. We release MetaPhat as an open source tool written in Python with built-in support for multi-processing, quality control, clumping and intuitive visualizations using the R software. Conclusion:MetaPhat efficiently decomposes associations between multivariate phenotypes and genetic variants into smaller sets of central traits and improves the interpretation and specificity of genome-phenome associations. MetaPhat is freely available under the MIT license at: https://sourceforge.net/projects/meta-pheno-association-tracer.
Project description:BACKGROUND:Continuing evolution of the Mycobacterium tuberculosis (Mtb) complex genomes associated with resistance to anti-tuberculosis drugs is threatening tuberculosis disease control efforts. Both multi- and extensively drug resistant Mtb (MDR and XDR, respectively) are increasing in prevalence, but the full set of Mtb genes involved are not known. There is a need for increased sensitivity of genome-wide approaches in order to elucidate the genetic basis of anti-microbial drug resistance and gain a more detailed understanding of Mtb genome evolution in a context of widespread antimicrobial therapy. Population structure within the Mtb complex, due to clonal expansion, lack of lateral gene transfer and low levels of recombination between lineages, may be reducing statistical power to detect drug resistance associated variants. RESULTS:To investigate the effect of lineage-specific effects on the identification of drug resistance associations, we applied genome-wide association study (GWAS) and convergence-based (PhyC) methods to multiple drug resistance phenotypes of a global dataset of Mtb lineages 2 and 4, using both lineage-wise and combined approaches. We identify both well-established drug resistance variants and novel associations; uniquely identifying associations for both lineage-specific and -combined GWAS analyses. We report 17 potential novel associations between antimicrobial resistance phenotypes and Mtb genomic variants. CONCLUSIONS:For GWAS, both lineage-specific and -combined analyses are useful, whereas PhyC may perform better in contexts of greater diversity. Unique associations with XDR in lineage-specific analyses provide evidence of diverging evolutionary trajectories between lineages 2 and 4 in response to antimicrobial drug therapy.
Project description:Genome-wide association studies (GWAS) have identified tens of thousands of genetic variants associated with hundreds of phenotypes and diseases, which have provided clinical and medical benefits to patients with novel biomarkers and therapeutic targets. However, identification of risk variants associated with complex diseases remains challenging as they are often affected by many genetic variants with small or moderate effects. There has been accumulating evidence suggesting that different complex traits share common risk basis, namely pleiotropy. Recently, several statistical methods have been developed to improve statistical power to identify risk variants for complex traits through a joint analysis of multiple GWAS datasets by leveraging pleiotropy. While these methods were shown to improve statistical power for association mapping compared to separate analyses, they are still limited in the number of phenotypes that can be integrated. In order to address this challenge, in this paper, we propose a novel statistical framework, graph-GPA, to integrate a large number of GWAS datasets for multiple phenotypes using a hidden Markov random field approach. Application of graph-GPA to a joint analysis of GWAS datasets for 12 phenotypes shows that graph-GPA improves statistical power to identify risk variants compared to statistical methods based on smaller number of GWAS datasets. In addition, graph-GPA also promotes better understanding of genetic mechanisms shared among phenotypes, which can potentially be useful for the development of improved diagnosis and therapeutics. The R implementation of graph-GPA is currently available at https://dongjunchung.github.io/GGPA/.
Project description:Advances in genome sequencing technologies and genome-wide association studies (GWAS) have provided unprecedented insights into the molecular basis of microbial phenotypes and enabled the identification of the underlying genetic variants in real populations. However, utilization of genome sequencing in clinical phenotyping of bacteria is challenging due to the lack of reliable and accurate approaches. Here, we report a method for predicting microbial resistance patterns using genome sequencing data. We analyzed whole genome sequences of 1,680 Streptococcus pneumoniae isolates from four independent populations using GWAS and identified probable hotspots of genetic variation which correlate with phenotypes of resistance to essential classes of antibiotics. With the premise that accumulation of putative resistance-conferring SNPs, potentially in combination with specific resistance genes, precedes full resistance, we retrogressively surveyed the hotspot loci and quantified the number of SNPs and/or genes, which if accumulated would confer full resistance to an otherwise susceptible strain. We name this approach the 'distance to resistance'. It can be used to identify the creep towards complete antibiotics resistance in bacteria using genome sequencing. This approach serves as a basis for the development of future sequencing-based methods for predicting resistance profiles of bacterial strains in hospital microbiology and public health settings.
Project description:Although genome-wide association studies (GWAS) have been successful at finding thousands of disease-associated genetic variants (GVs), identifying causal variants and elucidating the mechanisms by which genotypes influence phenotypes are critical open questions. A key challenge is that a large percentage of disease-associated GVs are potential regulatory variants located in noncoding regions, making them difficult to interpret. Recent research efforts focus on going beyond annotating GVs by integrating functional annotation data with GWAS to prioritize GVs. However, applicability of these approaches is challenged by high dimensionality and heterogeneity of functional annotation data. Furthermore, existing methods often assume global associations of GVs with annotation data. This strong assumption is susceptible to violations for GVs involved in many complex diseases. To address these issues, we develop a general regression framework, named Annotation Regression for GWAS (ARoG). ARoG is based on a finite mixture of linear regressions model where GWAS association measures are viewed as responses and functional annotations as predictors. This mixture framework addresses heterogeneity of effects of GVs by grouping them into clusters and high dimensionality of the functional annotations by enabling annotation selection within each cluster. ARoG further employs permutation testing to evaluate the significance of selected annotations. Computational experiments indicate that ARoG can discover distinct associations between disease risk and functional annotations. Application of ARoG to autism and schizophrenia data from Psychiatric Genomics Consortium led to identification of GVs that significantly affect interactions of several transcription factors with DNA as potential mechanisms contributing to these disorders.
Project description:One of the most challenging tasks of the post-genome-wide association studies (GWAS) research era is the identification of functional variants among those associated with a trait for an observed GWAS signal. Several methods have been developed to evaluate the potential functional implications of genetic variants. Each of these tools has its own scoring system, which forces users to become acquainted with each approach to interpret their results. From an awareness of the amount of work needed to analyze and integrate results for a single locus, we proposed a flexible and versatile approach designed to help the prioritization of variants by aggregating the predictions of their potential functional implications. This approach has been made available through a graphical user interface called DSNetwork, which acts as a single point of entry to almost 60 reference predictors for both coding and non-coding variants and displays predictions in an easy-to-interpret visualization. We confirmed the usefulness of our methodology by successfully identifying functional variants in four breast cancer and nine schizophrenia susceptibility loci.
Project description:As our understanding of genetics has improved, genome-wide association studies (GWAS) have identified numerous variants associated with lifestyle behaviours and health outcomes. However, what is sometimes overlooked is the possibility that genetic variants identified in GWAS of disease might reflect the effect of modifiable risk factors as well as direct genetic effects. We discuss this possibility with illustrative examples from tobacco and alcohol research, in which genetic variants that predict behavioural phenotypes have been seen in GWAS of diseases known to be causally related to these behaviours. This consideration has implications for the interpretation of GWAS findings.