Defining and Evaluating a Core Genome Multilocus Sequence Typing Scheme for Whole-Genome Sequence-Based Typing of Listeria monocytogenes.
ABSTRACT: Whole-genome sequencing (WGS) has emerged today as an ultimate typing tool to characterize Listeria monocytogenes outbreaks. However, data analysis and interlaboratory comparability of WGS data are still challenging for most public health laboratories. Therefore, we have developed and evaluated a new L. monocytogenes typing scheme based on genome-wide gene-by-gene comparisons (core genome multilocus the sequence typing [cgMLST]) to allow for a unique typing nomenclature. Initially, we determined the breadth of the L. monocytogenes population based on MLST data with a Bayesian approach. Based on the genome sequence data of representative isolates for the whole population, cgMLST target genes were defined and reappraised with 67 L. monocytogenes isolates from two outbreaks and serotype reference strains. The Bayesian population analysis generated five L. monocytogenes groups. Using all available NCBI RefSeq genomes (n = 36) and six additionally sequenced strains, all genetic groups were covered. Pairwise comparisons of these 42 genome sequences resulted in 1,701 cgMLST targets present in all 42 genomes with 100% overlap and ?90% sequence similarity. Overall, ?99.1% of the cgMLST targets were present in 67 outbreak and serotype reference strains, underlining the representativeness of the cgMLST scheme. Moreover, cgMLST enabled clustering of outbreak isolates with ?10 alleles difference and unambiguous separation from unrelated outgroup isolates. In conclusion, the novel cgMLST scheme not only improves outbreak investigations but also enables, due to the availability of the automatically curated cgMLST nomenclature, interlaboratory exchange of data that are crucial, especially for rapid responses during transsectorial outbreaks.
Project description:Enterococcus faecium, a common inhabitant of the human gut, has emerged in the last 2 decades as an important multidrug-resistant nosocomial pathogen. Since the start of the 21st century, multilocus sequence typing (MLST) has been used to study the molecular epidemiology of E. faecium. However, due to the use of a small number of genes, the resolution of MLST is limited. Whole-genome sequencing (WGS) now allows for high-resolution tracing of outbreaks, but current WGS-based approaches lack standardization, rendering them less suitable for interlaboratory prospective surveillance. To overcome this limitation, we developed a core genome MLST (cgMLST) scheme for E. faecium. cgMLST transfers genome-wide single nucleotide polymorphism(SNP) diversity into a standardized and portable allele numbering system that is far less computationally intensive than SNP-based analysis of WGS data. The E. faecium cgMLST scheme was built using 40 genome sequences that represented the diversity of the species. The scheme consists of 1,423 cgMLST target genes. To test the performance of the scheme, we performed WGS analysis of 103 outbreak isolates from five different hospitals in the Netherlands, Denmark, and Germany. The cgMLST scheme performed well in distinguishing between epidemiologically related and unrelated isolates, even between those that had the same sequence type (ST), which denotes the higher discriminatory power of this cgMLST scheme over that of conventional MLST. We also show that in terms of resolution, the performance of the E. faecium cgMLST scheme is equivalent to that of an SNP-based approach. In conclusion, the cgMLST scheme developed in this study facilitates rapid, standardized, and high-resolution tracing of E. faecium outbreaks.
Project description:Multi-country outbreaks of foodborne bacterial disease present challenges in their detection, tracking, and notification. As food is increasingly distributed across borders, such outbreaks are becoming more common. This increases the need for high-resolution, accessible, and replicable isolate typing schemes. Here we evaluate a core genome multilocus typing (cgMLST) scheme for the high-resolution reproducible typing of Salmonella enterica (S. enterica) isolates, by its application to a large European outbreak of S. enterica serovar Enteritidis. This outbreak had been extensively characterised using single nucleotide polymorphism (SNP)-based approaches. The cgMLST analysis was congruent with the original SNP-based analysis, the epidemiological data, and whole genome MLST (wgMLST) analysis. Combination of the cgMLST and epidemiological data confirmed that the genetic diversity among the isolates predated the outbreak, and was likely present at the infection source. There was consequently no link between country of isolation and genetic diversity, but the cgMLST clusters were congruent with date of isolation. Furthermore, comparison with publicly available Enteritidis isolate data demonstrated that the cgMLST scheme presented is highly scalable, enabling outbreaks to be contextualised within the Salmonella genus. The cgMLST scheme is therefore shown to be a standardised and scalable typing method, which allows Salmonella outbreaks to be analysed and compared across laboratories and jurisdictions.
Project description:Whole-genome sequencing (WGS) has been established for bacterial subtyping and is regularly used to study pathogen transmission, to investigate outbreaks, and to perform routine surveillance. Core-genome multilocus sequence typing (cgMLST) is a bacterial subtyping method that uses WGS data to provide a high-resolution strain characterization. This study aimed at developing a novel cgMLST scheme for Bacillus anthracis, a notorious pathogen that causes anthrax in livestock and humans worldwide. The scheme comprises 3,803 genes that were conserved in 57 B. anthracis genomes spanning the whole phylogeny. The scheme has been evaluated and applied to 584 genomes from 50 countries. On average, 99.5% of the cgMLST targets were detected. The cgMLST results confirmed the classical canonical single-nucleotide-polymorphism (SNP) grouping of B. anthracis into major clades and subclades. Genetic distances calculated based on cgMLST were comparable to distances from whole-genome-based SNP analysis with similar phylogenetic topology and comparable discriminatory power. Additionally, the application of the cgMLST scheme to anthrax outbreaks from Germany and Italy led to a definition of a cutoff threshold of five allele differences to trace epidemiologically linked strains for cluster typing and transmission analysis. Finally, the association of two clusters of B. anthracis with human cases of injectional anthrax in four European countries was confirmed using cgMLST. In summary, this study presents a novel cgMLST scheme that provides high-resolution strain genotyping for B. anthracis. This scheme can be used in parallel with SNP typing methods to facilitate rapid and harmonized interlaboratory comparisons, essential for global surveillance and outbreak analysis. The scheme is publicly available for application by users, including those with little bioinformatics knowledge.
Project description:Clostridium difficile, recently renamed Clostridioides difficile, is the most common cause of antibiotic-associated nosocomial gastrointestinal infections worldwide. To differentiate endogenous infections and transmission events, highly discriminatory subtyping is necessary. Today, methods based on whole-genome sequencing data are increasingly used to subtype bacterial pathogens; however, frequently a standardized methodology and typing nomenclature are missing. Here we report a core genome multilocus sequence typing (cgMLST) approach developed for C. difficile Initially, we determined the breadth of the C. difficile population based on all available MLST sequence types with Bayesian inference (BAPS). The resulting BAPS partitions were used in combination with C. difficile clade information to select representative isolates that were subsequently used to define cgMLST target genes. Finally, we evaluated the novel cgMLST scheme with genomes from 3,025 isolates. BAPS grouping (n = 6 groups) together with the clade information led to a total of 11 representative isolates that were included for cgMLST definition and resulted in 2,270 cgMLST genes that were present in all isolates. Overall, 2,184 to 2,268 cgMLST targets were detected in the genome sequences of 70 outbreak-associated and reference strains, and on average 99.3% cgMLST targets (1,116 to 2,270 targets) were present in 2,954 genomes downloaded from the NCBI database, underlining the representativeness of the cgMLST scheme. Moreover, reanalyzing different cluster scenarios with cgMLST were concordant to published single nucleotide variant analyses. In conclusion, the novel cgMLST is representative for the whole C. difficile population, is highly discriminatory in outbreak situations, and provides a unique nomenclature facilitating interlaboratory exchange.
Project description:The environmental bacterium <i>Pseudomonas aeruginosa</i>, particularly multidrug-resistant clones, is often associated with nosocomial infections and outbreaks. Today, core genome multilocus sequence typing (cgMLST) is frequently applied to delineate sporadic cases from nosocomial transmissions. However, until recently, no cgMLST scheme for a standardized typing of <i>P. aeruginosa</i> was available. To establish a novel cgMLST scheme for <i>P. aeruginosa</i>, we initially determined the breadth of the <i>P. aeruginosa</i> population based on MLST data with a Bayesian approach (BAPS). Using genomic data of representative isolates for the whole population and all 12 serogroups, we extracted target genes and further refined them using a random data set of 1,000 <i>P. aeruginosa</i> genomes. Subsequently, we investigated reproducibility and discriminatory ability with repeatedly sequenced isolates and isolates from well-defined outbreak scenarios, respectively, and compared clustering applying two recently published cgMLST schemes. BAPS generated seven <i>P. aeruginosa</i> groups. To cover these and all serogroups, 15 reference strains were used to determine genes common in all strains. After refinement with the data set of 1,000 genomes, the cgMLST scheme consisted of 3,867 target genes, which are representative of the <i>P. aeruginosa</i> population and highly reproducible using biological replicates. We finally evaluated the scheme by reanalyzing two published outbreaks where the authors used single-nucleotide polymorphism (SNP) typing. In both cases, cgMLST was concordant with the previous SNP results and the results of the two other cgMLST schemes. In conclusion, the highly reproducible novel <i>P. aeruginosa</i> cgMLST scheme facilitates outbreak investigations due to the publicly available cgMLST nomenclature.
Project description:Many listeriosis outbreaks are caused by a few globally distributed clonal groups, designated clonal complexes or epidemic clones, of Listeria monocytogenes, several of which have been defined by classic multilocus sequence typing (MLST) schemes targeting 6 to 8 housekeeping or virulence genes. We have developed and evaluated core genome MLST (cgMLST) schemes and applied them to isolates from multiple clonal groups, including those associated with 39 listeriosis outbreaks. The cgMLST clusters were congruent with MLST-defined clonal groups, which had various degrees of diversity at the whole-genome level. Notably, cgMLST could distinguish among outbreak strains and epidemiologically unrelated strains of the same clonal group, which could not be achieved using classic MLST schemes. The precise selection of cgMLST gene targets may not be critical for the general identification of clonal groups and outbreak strains. cgMLST analyses further identified outbreak strains, including those associated with recent outbreaks linked to contaminated French-style cheese, Hispanic-style cheese, stone fruit, caramel apple, ice cream, and packaged leafy green salad, as belonging to major clonal groups. We further developed lineage-specific cgMLST schemes, which can include accessory genes when core genomes do not possess sufficient diversity, and this provided additional resolution over species-specific cgMLST. Analyses of isolates from different common-source listeriosis outbreaks revealed various degrees of diversity, indicating that the numbers of allelic differences should always be combined with cgMLST clustering and epidemiological evidence to define a listeriosis outbreak.Classic multilocus sequence typing (MLST) schemes targeting internal fragments of 6 to 8 genes that define clonal complexes or epidemic clones have been widely employed to study L. monocytogenes biodiversity and its relation to pathogenicity potential and epidemiology. We demonstrated that core genome MLST schemes can be used for the simultaneous identification of clonal groups and the differentiation of individual outbreak strains and epidemiologically unrelated strains of the same clonal group. We further developed lineage-specific cgMLST schemes that targeted more genomic regions than the species-specific cgMLST schemes. Our data revealed the genome-level diversity of clonal groups defined by classic MLST schemes. Our identification of U.S. and international outbreaks caused by major clonal groups can contribute to further understanding of the global epidemiology of L. monocytogenes.
Project description:We have employed whole genome sequencing to define and evaluate a core genome multilocus sequence typing (cgMLST) scheme for Acinetobacter baumannii. To define a core genome we downloaded a total of 1,573 putative A. baumannii genomes from NCBI as well as representative isolates belonging to the eight previously described international A. baumannii clonal lineages. The core genome was then employed against a total of fifty-three carbapenem-resistant A. baumannii isolates that were previously typed by PFGE and linked to hospital outbreaks in eight German cities. We defined a core genome of 2,390 genes of which an average 98.4% were called successfully from 1,339 A. baumannii genomes, while Acinetobacter nosocomialis, Acinetobacter pittii, and Acinetobacter calcoaceticus resulted in 71.2%, 33.3%, and 23.2% good targets, respectively. When tested against the previously identified outbreak strains, we found good correlation between PFGE and cgMLST clustering, with 0-8 allelic differences within a pulsotype, and 40-2,166 differences between pulsotypes. The highest number of allelic differences was between the isolates representing the international clones. This typing scheme was highly discriminatory and identified separate A. baumannii outbreaks. Moreover, because a standardised cgMLST nomenclature is used, the system will allow inter-laboratory exchange of data.
Project description:<h4>Background</h4> As whole-genome sequencing for pathogen genomes becomes increasingly popular, the typing methods of gene-by-gene comparison, such as core genome multilocus sequence typing (cgMLST) and whole-genome multilocus sequence typing (wgMLST), are being routinely implemented in molecular epidemiology. However, some intrinsic problems remain. For example, genomic sequences with varying read depths, read lengths, and assemblers influence the genome assemblies, introducing error or missing alleles into the generated allelic profiles. These errors and missing alleles might create “specious discrepancy” among closely related isolates, thus making accurate epidemiological interpretation challenging. In addition, the rapid growth of the cgMLST allelic profile database can cause problems related to storage and maintenance as well as long query search times. <h4>Methods</h4> We attempted to resolve these issues by decreasing the scheme size to reduce the occurrence of error and missing alleles, alleviate the storage burden, and improve the query search time. The challenge in this approach is maintaining the typing resolution when using fewer loci. We achieved this by using a popular artificial intelligence technique, XGBoost, coupled with Shapley additive explanations for feature selection. Finally, 370 loci from the original 1701 cgMLST loci of Listeria monocytogenes were selected. <h4>Results</h4> Although the size of the final scheme (LmScheme_370) was approximately 80% lower than that of the original cgMLST scheme, its discriminatory power, tested for 35 outbreaks, was concordant with that of the original cgMLST scheme. Although we used L. monocytogenes as a demonstration in this study, the approach can be applied to other schemes and pathogens. Our findings might help elucidate gene-by-gene–based epidemiology.
Project description:At present, the most used methods for Klebsiella pneumoniae subtyping are multilocus sequence typing (MLST) and pulsed-field gel electrophoresis (PFGE). However, the discriminatory power of MLST could not meet the need for distinguishing outbreak and non-outbreak isolates and the PFGE is time-consuming and labor-intensive. A core genome multilocus sequence typing (cgMLST) scheme for whole-genome sequence-based typing of K. pneumoniae was developed for solving the disadvantages of these traditional molecular subtyping methods. Firstly, we used the complete genome of K. pneumoniae strain HKUOPLC as the reference genome and 907 genomes of K. pneumoniae download from NCBI database as original genome dataset to determine cgMLST target genes. A total of 1,143 genes were retained as cgMLST target genes. Secondly, we used 26 K. pneumoniae strains from a nosocomial infection outbreak to evaluate the cgMLST scheme. cgMLST enabled clustering of outbreak strains with <10 alleles difference and unambiguous separation from unrelated outgroup strains. Moreover, cgMLST revealed that there may be several sub-clones of epidemic ST11 clone. In conclusion, the novel cgMLST scheme not only showed higher discriminatory power compared with PFGE and MLST in outbreak investigations but also showed ability to reveal more population structure characteristics than MLST.
Project description:Human campylobacteriosis, caused by Campylobacter jejuni and C. coli, remains a leading cause of bacterial gastroenteritis in many countries, but the epidemiology of campylobacteriosis outbreaks remains poorly defined, largely due to limitations in the resolution and comparability of isolate characterization methods. Whole-genome sequencing (WGS) data enable the improvement of sequence-based typing approaches, such as multilocus sequence typing (MLST), by substantially increasing the number of loci examined. A core genome MLST (cgMLST) scheme defines a comprehensive set of those loci present in most members of a bacterial group, balancing very high resolution with comparability across the diversity of the group. Here we propose a set of 1,343 loci as a human campylobacteriosis cgMLST scheme (v1.0), the allelic profiles of which can be assigned to core genome sequence types. The 1,343 loci chosen were a subset of the 1,643 loci identified in the reannotation of the genome sequence of C. jejuni isolate NCTC 11168, chosen as being present in >95% of draft genomes of 2,472 representative United Kingdom campylobacteriosis isolates, comprising 2,207 (89.3%) C. jejuni isolates and 265 (10.7%) C. coli isolates. Validation of the cgMLST scheme was undertaken with 1,478 further high-quality draft genomes, containing 150 or fewer contiguous sequences, from disease isolate collections: 99.5% of these isolates contained ?95% of the 1,343 cgMLST loci. In addition to the rapid and effective high-resolution analysis of large numbers of diverse isolates, the cgMLST scheme enabled the efficient identification of very closely related isolates from a well-defined single-source campylobacteriosis outbreak.