Staphylococcus aureus viewed from the perspective of 40,000+ genomes.
ABSTRACT: Low-cost Illumina sequencing of clinically-important bacterial pathogens has generated thousands of publicly available genomic datasets. Analyzing these genomes and extracting relevant information for each pathogen and the associated clinical phenotypes requires not only resources and bioinformatic skills but organism-specific knowledge. In light of these issues, we created Staphopia, an analysis pipeline, database and application programming interface, focused on Staphylococcus aureus, a common colonizer of humans and a major antibiotic-resistant pathogen responsible for a wide spectrum of hospital and community-associated infections. Written in Python, Staphopia's analysis pipeline consists of submodules running open-source tools. It accepts raw FASTQ reads as an input, which undergo quality control filtration, error correction and reduction to a maximum of approximately 100× chromosome coverage. This reduction significantly reduces total runtime without detrimentally affecting the results. The pipeline performs de novo assembly-based and mapping-based analysis. Automated gene calling and annotation is performed on the assembled contigs. Read-mapping is used to call variants (single nucleotide polymorphisms and insertion/deletions) against a reference S. aureus chromosome (N315, ST5). We ran the analysis pipeline on more than 43,000 S. aureus shotgun Illumina genome projects in the public European Nucleotide Archive database in November 2017. We found that only a quarter of known multi-locus sequence types (STs) were represented but the top 10 STs made up 70% of all genomes. methicillin-resistant S. aureus (MRSA) were 64% of all genomes. Using the Staphopia database we selected 380 high quality genomes deposited with good metadata, each from a different multi-locus ST, as a non-redundant diversity set for studying S. aureus evolution. In addition to answering basic science questions, Staphopia could serve as a potential platform for rapid clinical diagnostics of S. aureus isolates in the future. The system could also be adapted as a template for other organism-specific databases.
Project description:Background:The concept of the "pan-genome," which refers to the total complement of genes within a given sample or species, is well established in bacterial genomics. Rapid and scalable pipelines are available for managing and interpreting pan-genomes from large batches of annotated assemblies. However, despite overwhelming evidence that variation in intergenic regions in bacteria can directly influence phenotypes, most current approaches for analyzing pan-genomes focus exclusively on protein-coding sequences. Findings:To address this we present Piggy, a novel pipeline that emulates Roary except that it is based only on intergenic regions. A key utility provided by Piggy is the detection of highly divergent ("switched") intergenic regions (IGRs) upstream of genes. We demonstrate the use of Piggy on large datasets of clinically important lineages of Staphylococcus aureus and Escherichia coli. Conclusions:For S. aureus, we show that highly divergent (switched) IGRs are associated with differences in gene expression and we establish a multilocus reference database of IGR alleles (igMLST; implemented in BIGSdb).
Project description:The Uppsala University Chlamydia trachomatis multilocus sequence type (MLST) database (http://mlstdb.bmc.uu.se) is based on five target regions (non-housekeeping genes) and the ompA gene. Each target has various numbers of alleles-hctB, 89; CT058, 51; CT144, 30; CT172, 38; and pbpB, 35-derived from 13 studies. Our aims were to perform an overall analysis of all C. trachomatis MLST sequence types (STs) in the database, examine STs with global spread, and evaluate the phylogenetic capability by using the five targets. A total of 415 STs were recognized from 2,089 specimens. The addition of 49 ompA gene variants created 459 profiles. ST variation and their geographical distribution were characterized using eBURST and minimum spanning tree analyses. There were 609 samples from men having sex with men (MSM), with 4 predominating STs detected in this group, comprising 63% of MSM cases. Four other STs predominated among 1,383 heterosexual cases comprising, 31% of this group. The diversity index in ocular trachoma cases was significantly lower than in sexually transmitted chlamydia infections. Predominating STs were identified in 12 available C. trachomatis whole genomes which were compared to 22 C. trachomatis full genomes without predominating STs. No specific gene in the 12 genomes with predominating STs could be linked to successful spread of certain STs. Phylogenetic analysis showed that MLST targets provide a tree similar to trees based on whole-genome analysis. The presented MLST scheme identified C. trachomatis strains with global spread. It provides a tool for epidemiological investigations and is useful for phylogenetic analyses.
Project description:The European bison is a non-model organism; thus, most of its genetic and genomic analyses have been performed using cattle-specific resources, such as BovineSNP50 BeadChip or Illumina Bovine 800 K HD Bead Chip. The problem with non-specific tools is the potential loss of evolutionary diversified information (ascertainment bias) and species-specific markers. Here, we have used a genotyping-by-sequencing (GBS) approach for genotyping 256 samples from the European bison population in Bialowieza Forest (Poland) and performed an analysis using two integrated pipelines of the STACKS software: one is de novo (without reference genome) and the other is a reference pipeline (with reference genome). Moreover, we used a reference pipeline with two different genomes, i.e., <i>Bos taurus</i> and European bison. Genotyping by sequencing (GBS) is a useful tool for SNP genotyping in non-model organisms due to its cost effectiveness. Our results support GBS with a reference pipeline without PCR duplicates as a powerful approach for studying the population structure and genotyping data of non-model organisms. We found more polymorphic markers in the reference pipeline in comparison to the de novo pipeline. The decreased number of SNPs from the de novo pipeline could be due to the extremely low level of heterozygosity in European bison. It has been confirmed that all the de novo/<i>Bos taurus</i> and <i>Bos taurus</i> reference pipeline obtained SNPs were unique and not included in 800 K BovineHD BeadChip.
Project description:Gene3D http://gene3d.biochem.ucl.ac.uk is a database of domain annotations of Ensembl and UniProtKB protein sequences. Domains are predicted using a library of profile HMMs representing 2737 CATH superfamilies. Gene3D has previously featured in the Database issue of NAR and here we report updates to the website and database. The current Gene3D (v14) release has expanded its domain assignments to ? 20,000 cellular genomes and over 43 million unique protein sequences, more than doubling the number of protein sequences since our last publication. Amongst other updates, we have improved our Functional Family annotation method. We have also improved the quality and coverage of our 3D homology modelling pipeline of predicted CATH domains. Additionally, the structural models have been expanded to include an extra model organism (Drosophila melanogaster). We also document a number of additional visualization tools in the Gene3D website.
Project description:Methicillin-resistant <i>Staphylococcus aureus</i> (MRSA) presenting <i>spa</i> type t899 is commonly associated with sequence type 9 (ST9) but is also increasingly linked to ST398. This study provides genomic insight into the diversity of t899 isolates using core genome multilocus sequence typing (cgMLST), single nucleotide polymorphism (SNP)-based phylogeny, and the description of selected antimicrobial resistance and virulence markers. The SNP-based phylogenic tree showed that isolates sharing the same <i>spa</i> type (t899) but different STs highly diverged in their core and accessory genomes, revealing discriminant antimicrobial resistance (AMR) and virulence markers. Our results highlighted the idea that in a surveillance context where only <i>spa</i> typing is used, an additional multiplex PCR for the detection of the <i>tet</i>(M), <i>sak</i>, and <i>seg</i> genes would be valuable in helping distinguish ST9 from ST398 isolates on a routine basis.<b>IMPORTANCE</b> This study showed the genetic diversity and population structure of <i>S. aureus</i> presenting the same <i>spa</i> type, t899, but belonging to different STs. Our findings revealed that these isolates vary deeply in their core and accessory genomes, contrary to what is regularly inferred from studies using <i>spa</i> typing only. Given that identical <i>spa</i> types can be associated with different STs and that <i>spa</i> typing only is not appropriate for <i>S. aureus</i> isolates that have undergone major recombination events which include the passage of the <i>spa</i> gene (such as in t899-positive MRSA), the combination of both MLST and <i>spa</i> typing methods is recommended. However, <i>spa</i> typing alone is still largely used in surveillance studies and basic characterization. Our data suggest that additional markers, such as <i>tet</i>(M), <i>sak</i>, and <i>seg</i> genes, could be implemented in an easy and inexpensive manner in order to identify <i>S. aureus</i> lineages with a higher accuracy.
Project description:The complete genomes of four Brachyspira hyodysenteriae isolates of the four different sequence types (STs) (ST6, ST66, ST196, and ST197) causing swine dysentery in Switzerland were generated by whole-genome sequencing and <i>de novo</i> hybrid assembly of reads obtained from second (Illumina) and third (Oxford Nanopore Technologies and Pacific Biosciences) generation high-throughput sequencing.
Project description:Mapping metagenome reads to reference databases is the standard approach for assessing microbial taxonomic and functional diversity from metagenomic data. However, public reference databases often lack recently generated genomic data such as metagenome-assembled genomes (MAGs), which can limit the sensitivity of read-mapping approaches. We previously developed the Struo pipeline in order to provide a straight-forward method for constructing custom databases; however, the pipeline does not scale well enough to cope with the ever-increasing number of publicly available microbial genomes. Moreover, the pipeline does not allow for efficient database updating as new data are generated. To address these issues, we developed Struo2, which is >3.5 fold faster than Struo at database generation and can also efficiently update existing databases. We also provide custom Kraken2, Bracken, and HUMAnN3 databases that can be easily updated with new genomes and/or individual gene sequences. Efficient database updating, coupled with our pre-generated databases, enables "assembly-enhanced" profiling, which increases database comprehensiveness via inclusion of native genomic content. Inclusion of newly generated genomic content can greatly increase database comprehensiveness, especially for understudied biomes, which will enable more accurate assessments of microbiome diversity.
Project description:Infections due to <i>Staphylococcus argenteus</i> have been increasingly reported worldwide and the microbe cannot be distinguished from <i>Staphylococcus aureus</i> by standard methods. Its complement of virulence determinants and antibiotic resistance genes remain unclear, and how far these are distinct from those produced by <i>S. aureus</i> remains undetermined. In order to address these uncertainties, we have collected 132 publicly available sequences from fourteen different countries, including the United Kingdom, between 2005 and 2018 to study the global genetic structure of the population. We have compared the genomes for antibiotic resistance genes, virulence determinants and mobile genetic elements such as phages, pathogenicity islands and presence of plasmid groups between different clades. 20% (<i>n</i> = 26) isolates were methicillin resistant harboring a <i>mec</i>A gene and 88% were penicillin resistant, harboring the <i>blaZ</i> gene. ST2250 was identified as the most frequent strain, but ST1223, which was the second largest group, contained a marginally larger number of virulence genes compared to the other STs. Novel <i>S. argenteus</i> pathogenicity islands were identified in our isolates harboring <i>tsst-1, seb, sec3, ear, selk, selq</i> toxin genes, as well as chromosomal clusters of enterotoxin and superantigen-like genes. Strain-specific type I modification systems were widespread which would limit interstrain transfer of genetic material. In addition, ST2250 possessed a CRISPR/Cas system, lacking in most other STs. <i>S. argenteus</i> possesses important genetic differences from <i>S. aureus</i>, as well as between different STs, with the potential to produce distinct clinical manifestations.
Project description:BACKGROUND: Pathogen diagnostic assays based on polymerase chain reaction (PCR) technology provide high sensitivity and specificity. However, the design of these diagnostic assays is computationally intensive, requiring high-throughput methods to identify unique PCR signatures in the presence of an ever increasing availability of sequenced genomes. RESULTS: We present the Tool for PCR Signature Identification (TOPSI), a high-performance computing pipeline for the design of PCR-based pathogen diagnostic assays. The TOPSI pipeline efficiently designs PCR signatures common to multiple bacterial genomes by obtaining the shared regions through pairwise alignments between the input genomes. TOPSI successfully designed PCR signatures common to 18 Staphylococcus aureus genomes in less than 14 hours using 98 cores on a high-performance computing system. CONCLUSIONS: TOPSI is a computationally efficient, fully integrated tool for high-throughput design of PCR signatures common to multiple bacterial genomes. TOPSI is freely available for download at http://www.bhsai.org/downloads/topsi.tar.gz.
Project description:Background:For the plant pathogenic phytoplasmas, as well as for several fastidious prokaryotes, axenic cultivation is extremely difficult or not possible yet; therefore, even with second generation sequencing methods, obtaining the sequence of their genomes is challenging due to host sequence contamination. Objective:With the Phytoassembly pipeline here presented, we aim to provide a method to obtain high quality genome drafts for the phytoplasmas and other uncultivable plant pathogens, by exploiting the coverage differential in the ILLUMINA sequences from the pathogen and the host, and using the sequencing of a healthy, isogenic plant as a filter. Validation:The pipeline has been benchmarked using simulated and real ILLUMINA runs from phytoplasmas whose genome is known, and it was then used to obtain high quality drafts for three new phytoplasma genomes. Conclusion:For phytoplasma infected samples containing >2-4% of pathogen DNA and an isogenic reference healthy sample, the resulting assemblies can be next to complete. The Phytoassembly source code is available on GitHub at https://github.com/cpolano/phytoassembly.