Refinement of whole-genome multilocus sequence typing analysis by addressing gene paralogy.
ABSTRACT: We developed a user-friendly program, Genome Profiler (GeP), to refine whole-genome multilocus sequence typing analysis by addressing gene paralogy with conserved gene neighborhoods. In comparison to similar programs, GeP produced overall the best results in terms of accuracy and is thus a useful alternative to resolve relationships of bacterial isolates.
Project description:We present ParaDB (http://abi.marseille.inserm.fr/paradb/), a new database for large-scale paralogy studies in vertebrate genomes. We intended to collect all information (sequence, mapping and phylogenetic data) needed to map and detect new paralogous regions, previously defined as Paralogons. The AceDB database software was used to generate graphical objects and to organize data. General data were automatically collated from public sources (Ensembl, GadFly and RefSeq). ParaDB provides access to data derived from whole genome sequences (Homo sapiens, Mus musculus and Drosophila melanogaster): cDNA and protein sequences, positional information, bibliographical links. In addition, we provide BLAST results for each protein sequence, InParanoid orthologs and 'In-Paralogs' data, previously established paralogy data, and, to compare vertebrates and Drosophila, orthology data.
Project description:<h4>Background</h4> As whole-genome sequencing for pathogen genomes becomes increasingly popular, the typing methods of gene-by-gene comparison, such as core genome multilocus sequence typing (cgMLST) and whole-genome multilocus sequence typing (wgMLST), are being routinely implemented in molecular epidemiology. However, some intrinsic problems remain. For example, genomic sequences with varying read depths, read lengths, and assemblers influence the genome assemblies, introducing error or missing alleles into the generated allelic profiles. These errors and missing alleles might create “specious discrepancy” among closely related isolates, thus making accurate epidemiological interpretation challenging. In addition, the rapid growth of the cgMLST allelic profile database can cause problems related to storage and maintenance as well as long query search times. <h4>Methods</h4> We attempted to resolve these issues by decreasing the scheme size to reduce the occurrence of error and missing alleles, alleviate the storage burden, and improve the query search time. The challenge in this approach is maintaining the typing resolution when using fewer loci. We achieved this by using a popular artificial intelligence technique, XGBoost, coupled with Shapley additive explanations for feature selection. Finally, 370 loci from the original 1701 cgMLST loci of Listeria monocytogenes were selected. <h4>Results</h4> Although the size of the final scheme (LmScheme_370) was approximately 80% lower than that of the original cgMLST scheme, its discriminatory power, tested for 35 outbreaks, was concordant with that of the original cgMLST scheme. Although we used L. monocytogenes as a demonstration in this study, the approach can be applied to other schemes and pathogens. Our findings might help elucidate gene-by-gene–based epidemiology.
Project description:Mycobacterium xenopi is an opportunistic mycobacterial pathogen of increasing clinical importance. Surveillance of M. xenopi is hampered by the absence of tools for genotyping and molecular epidemiology. In this study, we describe the development and evaluation of an effective multilocus sequence typing strategy for M. xenopi.
Project description:Reliable prediction of orthology is central to comparative genomics. Approaches based on phylogenetic analyses closely resemble the original definition of orthology and paralogy and are known to be highly accurate. However, the large computational cost associated to these analyses is a limiting factor that often prevents its use at genomic scales. Recently, several projects have addressed the reconstruction of large collections of high-quality phylogenetic trees from which orthology and paralogy relationships can be inferred. This provides us with the opportunity to infer the evolutionary relationships of genes from multiple, independent, phylogenetic trees. Using such strategy, we combine phylogenetic information derived from different databases, to predict orthology and paralogy relationships for 4.1 million proteins in 829 fully sequenced genomes. We show that the number of independent sources from which a prediction is made, as well as the level of consistency across predictions, can be used as reliable confidence scores. A webserver has been developed to easily access these data (http://orthology.phylomedb.org), which provides users with a global repository of phylogeny-based orthology and paralogy predictions.
Project description:Multilocus sequence typing has been useful for genotyping pathogens in surveillance and epidemiologic studies. However, it cannot reflect the true relationships of isolates for species with very dynamic genomes. Using a robust genome phylogeny, we demonstrated the limitations of this method for typing Acinetobacter baumannii.
Project description:PCR primers targeting loci in the current Burkholderia cepacia complex multilocus sequence typing scheme were redesigned to (i) more reliably amplify these loci from B. cepacia complex species, (ii) amplify these same loci from additional Burkholderia species, and (iii) enable the use of a single primer set per locus for both amplification and DNA sequencing.
Project description:Accurate predictions of orthology and paralogy relationships are necessary to infer human molecular function from experiments in model organisms. Previous genome-scale approaches to predicting these relationships have been limited by their use of protein similarity and their failure to take into account multiple splicing events and gene prediction errors. We have developed PhyOP, a new phylogenetic orthology prediction pipeline based on synonymous rate estimates, which accurately predicts orthology and paralogy relationships for transcripts, genes, exons, or genomic segments between closely related genomes. We were able to identify orthologue relationships to human genes for 93% of all dog genes from Ensembl. Among 1:1 orthologues, the alignments covered a median of 97.4% of protein sequences, and 92% of orthologues shared essentially identical gene structures. PhyOP accurately recapitulated genomic maps of conserved synteny. Benchmarking against predictions from Ensembl and Inparanoid showed that PhyOP is more accurate, especially in its predictions of paralogy. Nearly half (46%) of PhyOP paralogy predictions are unique. Using PhyOP to investigate orthologues and paralogues in the human and dog genomes, we found that the human assembly contains 3-fold more gene duplications than the dog. Species-specific duplicate genes, or "in-paralogues," are generally shorter and have fewer exons than 1:1 orthologues, which is consistent with selective constraints and mutation biases based on the sizes of duplicated genes. In-paralogues have experienced elevated amino acid and synonymous nucleotide substitution rates. Duplicates possess similar biological functions for either the dog or human lineages. Having accounted for 2,954 likely pseudogenes and gene fragments, and after separating 346 erroneously merged genes, we estimated that the human genome encodes a minimum of 19,700 protein-coding genes, similar to the gene count of nematode worms. PhyOP is a fast and robust approach to orthology prediction that will be applicable to whole genomes from multiple closely related species. PhyOP will be particularly useful in predicting orthology for mammalian genomes that have been incompletely sequenced, and for large families of rapidly duplicating genes.
Project description:Staphylococcus pseudintermedius is an opportunistic pathogen in dogs. Four housekeeping genes with allelic polymorphisms were identified and used to develop an expanded multilocus sequence typing (MLST) scheme. The new seven-locus technique shows S. pseudintermedius to have greater genetic diversity than previous methods and discriminates more isolates based upon host origin.
Project description:We evaluated three multilocus sequence typing (MLST) schemes for Staphylococcus epidermidis and selected the seven most discriminatory loci for the formation of a new, more powerful MLST scheme. This improved scheme gave 31 sequence types (STs) and 5 clonal complexes (CCs), whereas the other schemes delineate 16 to 24 STs and 1 to 3 CCs.
Project description:A multilocus sequence typing (MLST) scheme was developed for Klebsiella pneumoniae. Sequences of seven housekeeping genes were obtained for 67 K. pneumoniae strains, including 19 ceftazidime- and ciprofloxacin-resistant isolates. Forty distinct allelic profiles were identified. MLST data were validated against ribotyping and showed high (96%) discriminatory power. The MLST approach provides unambiguous data useful for the epidemiology of K. pneumoniae isolates.