ABSTRACT: With ever-increasing numbers of microbial genomes being sequenced, efficient tools are needed to perform strain-level identification of any newly sequenced genome. Here, we present the SNP identification for strain typing (SNIT) pipeline, a fast and accurate software system that compares a newly sequenced bacterial genome with other genomes of the same species to identify single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels). Based on this information, the pipeline analyzes the polymorphic loci present in all input genomes to identify the genome that has the fewest differences with the newly sequenced genome. Similarly, for each of the other genomes, SNIT identifies the input genome with the fewest differences. Results from five bacterial species show that the SNIT pipeline identifies the correct closest neighbor with 75% to 100% accuracy. The SNIT pipeline is available for download at http://www.bhsai.org/snit.html.
Project description:BACKGROUND: With multiple strains of various pathogens being sequenced, it is necessary to develop high-throughput methods that can simultaneously process multiple bacterial or viral genomes to find common fingerprints as well as fingerprints that are unique to each individual genome. We present algorithmic enhancements to an existing single-genome pipeline that allows for efficient design of microarray probes common to groups of target genomes. The enhanced pipeline takes advantage of the similarities in the input genomes to narrow the search to short, nonredundant regions of the target genomes and, thereby, significantly reduces the computation time. The pipeline also computes a three-state hybridization matrix, which gives the expected hybridization of each probe with each target. RESULTS: Design of microarray probes for eight pathogenic Burkholderia genomes shows that the multiple-genome pipeline is nearly four-times faster than the single-genome pipeline for this application. The probes designed for these eight genomes were experimentally tested with one non-target and three target genomes. Hybridization experiments show that less than 10% of the designed probes cross hybridize with non-targets. Also, more than 65% of the probes designed to identify all Burkholderia mallei and B. pseudomallei strains successfully hybridize with a B. pseudomallei strain not used for probe design. CONCLUSION: The savings in runtime suggest that the enhanced pipeline can be used to design fingerprints for tens or even hundreds of related genomes in a single run. Hybridization results with an unsequenced B. pseudomallei strain indicate that the designed probes might be useful in identifying unsequenced strains of B. mallei and B. pseudomallei.
Project description:BACKGROUND: Despite the large volume of genome sequencing data produced by next-generation sequencing technologies and the highly sophisticated software dedicated to handling these types of data, gaps are commonly found in draft genome assemblies. The existence of gaps compromises our ability to take full advantage of the genome data. This study aims to identify a practical approach for biologists to complete their own genome assemblies using commonly available tools and resources. RESULTS: A pipeline was developed to assemble complete genomes primarily from the next generation sequencing (NGS) data. The input of the pipeline is paired-end Illumina sequence reads, and the output is a high quality complete genome sequence. The pipeline alternates the employment of computational and biological methods in seven steps. It combines the strengths of de novo assembly, reference-based assembly, customized programming, public databases utilization, and wet lab experimentation. The application of the pipeline is demonstrated by the completion of a bacterial genome, Thermotoga sp. strain RQ7, a hydrogen-producing strain. CONCLUSIONS: The developed pipeline provides an example of effective integration of computational and biological principles. It highlights the complementary roles that in silico and wet lab methodologies play in bioinformatical studies. The constituting principles and methods are applicable to similar studies on both prokaryotic and eukaryotic genomes.
Project description:BACKGROUND: Many computational methods are available for assembly and annotation of newly sequenced microbial genomes. However, when new genomes are reported in the literature, there is frequently very little critical analysis of choices made during the sequence assembly and gene annotation stages. These choices have a direct impact on the biologically relevant products of a genomic analysis--for instance identification of common and differentiating regions among genomes in a comparison, or identification of enriched gene functional categories in a specific strain. Here, we examine the outcomes of different assembly and analysis steps in typical workflows in a comparison among strains of Vibrio vulnificus. RESULTS: Using six recently sequenced strains of V. vulnificus, we demonstrate the "alternate realities" of comparative genomics, and how they depend on the choice of a robust assembly method and accurate ab initio annotation. We apply several popular assemblers for paired-end Illumina data, and three well-regarded ab initio genefinders. We demonstrate significant differences in detected gene overlap among comparative genomics workflows that depend on these two steps. The divergence between workflows, even those using widely adopted methods, is obvious both at the single genome level and when a comparison is performed. In a typical example where multiple workflows are applied to the strain V. vulnificus CECT 4606, a workflow that uses the Velvet assembler and Glimmer gene finder identifies 3275 gene features, while a workflow that uses the Velvet assembler and the RAST annotation system identifies 5011 gene features. Only 3171 genes are identical between both workflows. When we examine 9 assembly/annotation workflow scenarios as input to a three-way genome comparison, differentiating genes and even differentially represented functional categories change significantly from scenario to scenario. CONCLUSIONS: Inconsistencies in genomic analysis can arise depending on the choices that are made during the assembly and annotation stages. These inconsistencies can have a significant impact on the interpretation of an individual genome's content. The impact is multiplied when comparison of content and function among multiple genomes is the goal. Tracking the analysis history of the data--its analytic provenance--is critical for reproducible analysis of genome data.
Project description:BACKGROUND: Pathogen diagnostic assays based on polymerase chain reaction (PCR) technology provide high sensitivity and specificity. However, the design of these diagnostic assays is computationally intensive, requiring high-throughput methods to identify unique PCR signatures in the presence of an ever increasing availability of sequenced genomes. RESULTS: We present the Tool for PCR Signature Identification (TOPSI), a high-performance computing pipeline for the design of PCR-based pathogen diagnostic assays. The TOPSI pipeline efficiently designs PCR signatures common to multiple bacterial genomes by obtaining the shared regions through pairwise alignments between the input genomes. TOPSI successfully designed PCR signatures common to 18 Staphylococcus aureus genomes in less than 14 hours using 98 cores on a high-performance computing system. CONCLUSIONS: TOPSI is a computationally efficient, fully integrated tool for high-throughput design of PCR signatures common to multiple bacterial genomes. TOPSI is freely available for download at http://www.bhsai.org/downloads/topsi.tar.gz.
Project description:BACKGROUND: As more and more genomes are sequenced, genome annotation becomes increasingly important in bridging the gap between sequence and biology. Gene prediction, which is at the center of genome annotation, usually integrates various resources to compute consensus gene structures. However, many newly sequenced genomes have limited resources for gene predictions. In an effort to create high-quality gene models of the cucumber genome (Cucumis sativus var. sativus), based on the EVidenceModeler gene prediction pipeline, we incorporated the massively parallel complementary DNA sequencing (RNA-Seq) reads of 10 cucumber tissues into EVidenceModeler. We applied the new pipeline to the reassembled cucumber genome and included a comparison between our predicted protein-coding gene sets and a published set. RESULTS: The reassembled cucumber genome, annotated with RNA-Seq reads from 10 tissues, has 23, 248 identified protein-coding genes. Compared with the published prediction in 2009, approximately 8, 700 genes reveal structural modifications and 5, 285 genes only appear in the reassembled cucumber genome. All the related results, including genome sequence and annotations, are available at http://cmb.bnu.edu.cn/Cucumis_sativus_v20/. CONCLUSIONS: We conclude that RNA-Seq greatly improves the accuracy of prediction of protein-coding genes in the reassembled cucumber genome. The comparison between the two gene sets also suggests that it is feasible to use RNA-Seq reads to annotate newly sequenced or less-studied genomes.
Project description:<b>Background: </b>Prophages play a significant role in prokaryotic evolution, often altering the function of the cell that they infect via transfer of new genes e.g., virulence or antibiotic resistance factors, inactivation of existing genes or by modifying gene expression. Recently, phage therapy has gathered renewed interest as a promising alternative to control bacterial infections. Cataloging the repertoire of prophages in large collections of species' genomes is an important initial step in understanding their evolution and potential therapeutic utility. However, current widely-used tools for identifying prophages within bacterial genome sequences are mainly web-based, can have long response times, and do not scale to keep pace with the many thousands of genomes currently being sequenced routinely.<br><br><b>Methodology: </b>In this work, we present ProphET, an easy to install prophage predictor to be used in Linux operation system, without the constraints associated with a web-based tool. ProphET predictions rely on similarity searches against a database of prophage genes, taking as input a bacterial genome sequence in FASTA format and its corresponding gene annotation in GFF. ProphET identifies prophages in three steps: similarity search, calculation of the density of prophage genes, and edge refinement. ProphET performance was evaluated and compared with other phage predictors based on a set of 54 bacterial genomes containing 267 manually annotated prophages.<br><br><b>Findings and conclusions: </b>ProphET identifies prophages in bacterial genomes with high precision and offers a fast, highly scalable alternative to widely-used web-based applications for prophage detection.
Project description:Pseudomonas aeruginosa is an important opportunistic pathogen responsible for many infections in hospitalized and immunocompromised patients. Previous reports estimated that approximately 10% of its 6.6 Mbp genome varies from strain to strain and is therefore referred to as "accessory genome". Elements within the accessory genome of P. aeruginosa have been associated with differences in virulence and antibiotic resistance. As whole genome sequencing of bacterial strains becomes more widespread and cost-effective, methods to quickly and reliably identify accessory genomic elements in newly sequenced P. aeruginosa genomes will be needed.We developed a bioinformatic method for identifying the accessory genome of P. aeruginosa. First, the core genome was determined based on sequence conserved among the completed genomes of twelve reference strains using Spine, a software program developed for this purpose. The core genome was 5.84 Mbp in size and contained 5,316 coding sequences. We then developed an in silico genome subtraction program named AGEnt to filter out core genomic sequences from P. aeruginosa whole genomes to identify accessory genomic sequences of these reference strains. This analysis determined that the accessory genome of P. aeruginosa ranged from 6.9-18.0% of the total genome, was enriched for genes associated with mobile elements, and was comprised of a majority of genes with unknown or unclear function. Using these genomes, we showed that AGEnt performed well compared to other publically available programs designed to detect accessory genomic elements. We then demonstrated the utility of the AGEnt program by applying it to the draft genomes of two previously unsequenced P. aeruginosa strains, PA99 and PA103.The P. aeruginosa genome is rich in accessory genetic material. The AGEnt program accurately identified the accessory genomes of newly sequenced P. aeruginosa strains, even when draft genomes were used. As P. aeruginosa genomes become available at an increasingly rapid pace, this program will be useful in cataloging the expanding accessory genome of this bacterium and in discerning correlations between phenotype and accessory genome makeup. The combination of Spine and AGEnt should be useful in defining the accessory genomes of other bacterial species as well.
Project description:Analysis of microbial epidemics has been revolutionized by whole-genome sequencing. We recently sequenced the genomes of 601 type emm59 Group A Streptococcus (GAS) organisms responsible for an ongoing epidemic of invasive infections in Canada and some of the United States. The epidemic has been caused by the emergence of a genetically distinct, hypervirulent clone that has genetically diversified. The ease of obtaining genomic data contrasts with the relatively difficult task of translating them into insightful epidemiological information. Here, we sequenced the genomes of 90 additional invasive Canadian emm59 GAS organisms, including 80 isolated recently in 2010-2011. We used an improved bioinformatics pipeline designed to rapidly process and analyze whole-genome data and integrate strain metadata. We discovered that emm59 GAS organisms are undergoing continued multiclonal evolutionary expansion. Previously identified geographic patterns of strain dissemination are being diluted as mixing of subclones over time and space occurs. Our integrated data analysis strategy permits prompt and accurate mapping of the dissemination of bacterial organisms in an epidemic wave, permitting rapid generation of hypotheses that inform public health and virulence studies.
Project description:BACKGROUND: Orthologs are genes derived from the same ancestor gene loci after speciation events. Orthologous proteins usually have similar sequences and perform comparable biological functions. Therefore, ortholog identification is useful in annotations of newly sequenced genomes. With rapidly increasing number of sequenced genomes, constructing or updating ortholog relationship between all genomes requires lots of effort and computation time. In addition, elucidating ortholog relationships between distantly related genomes is challenging because of the lower sequence similarity. Therefore, an efficient ortholog detection method that can deal with large number of distantly related genomes is desired. RESULTS: An efficient ortholog detection pipeline DODO (DOmain based Detection of Orthologs) is created on the basis of domain architectures in this study. Supported by domain composition, which usually directly related with protein function, DODO could facilitate orthologs detection across distantly related genomes. DODO works in two main steps. Starting from domain information, it first assigns protein groups according to their domain architectures and further identifies orthologs within those groups with much reduced complexity. Here DODO is shown to detect orthologs between two genomes in considerably shorter period of time than traditional methods of reciprocal best hits and it is more significant when analyzed a large number of genomes. The output results of DODO are highly comparable with other known ortholog databases. CONCLUSIONS: DODO provides a new efficient pipeline for detection of orthologs in a large number of genomes. In addition, a database established with DODO is also easier to maintain and could be updated relatively effortlessly. The pipeline of DODO could be downloaded from http://220.127.116.11:16080/dodo_web/home.htm.
Project description:The Mycobacterium tuberculosis complex causes tuberculosis (TB) in humans and other animal species, but Mycobacterium tuberculosis has a distinct host preference to humans. The present study aimed to determine whether a bovine M. tb strain 1458 has evolved some genetic properties in their genome that might be associated with their bovine adaptation. The genome of the M. tb strain 1458 was sequenced and subjected to an extensive comparative genomic analysis. A phylogenetic analysis showed that strain 1458 is most closely related to a Chinese M. tb strain, CCDC5079, of the same Beijing family. Compared with three human M. tb Beijing family strains, the strain 1458 has the fewest unique genes. However, there are most (21) IS6110 insertion sequences in the strain 1458 genome at either intragenic or intergenic sites, resulting in the interruption of 11 genes including three PPE family-encoding genes (PPE16, PPE38, and PPE59). Only the strain 1458 genome has the upstream insertion in esxS and phoP genes. PCR confirmed four upstream insertions and qPCR determined that transcription of esxS, phoP, dnaN, and ctpD genes differed significantly between M. tb strain 1458 and H37Rv or M. bovis. A Kyoto Encyclopedia of Genes and Genomes pathway enrichment analysis revealed that the genes affected by non-synonymous SNPs are enriched in RNA polymerase. Moreover, 127 of the 133 unique SNPs in strain 1458 are either different to those in the M. bovis genome. In conclusion, some critical genes responsible for bacterial virulence and immunogenicity were interrupted in the genome of bovine M. tb strain 1458 by IS insertions and non-synonymous SNPs, which might contribute to its bovine adaptation, and the modification of its virulence and immunogenicity in cattle.