ISMapper: identifying transposase insertion sites in bacterial genomes from short read sequence data.
ABSTRACT: Insertion sequences (IS) are small transposable elements, commonly found in bacterial genomes. Identifying the location of IS in bacterial genomes can be useful for a variety of purposes including epidemiological tracking and predicting antibiotic resistance. However IS are commonly present in multiple copies in a single genome, which complicates genome assembly and the identification of IS insertion sites. Here we present ISMapper, a mapping-based tool for identification of the site and orientation of IS insertions in bacterial genomes, directly from paired-end short read data.ISMapper was validated using three types of short read data: (i) simulated reads from a variety of species, (ii) Illumina reads from 5 isolates for which finished genome sequences were available for comparison, and (iii) Illumina reads from 7 Acinetobacter baumannii isolates for which predicted IS locations were tested using PCR. A total of 20 genomes, including 13 species and 32 distinct IS, were used for validation. ISMapper correctly identified 97 % of known IS insertions in the analysis of simulated reads, and 98 % in real Illumina reads. Subsampling of real Illumina reads to lower depths indicated ISMapper was able to correctly detect insertions for average genome-wide read depths >20x, although read depths >50x were required to obtain confident calls that were highly-supported by evidence from reads. All ISAba1 insertions identified by ISMapper in the A. baumannii genomes were confirmed by PCR. In each A. baumannii genome, ISMapper successfully identified an IS insertion upstream of the ampC beta-lactamase that could explain phenotypic resistance to third-generation cephalosporins. The utility of ISMapper was further demonstrated by profiling genome-wide IS6110 insertions in 138 publicly available Mycobacterium tuberculosis genomes, revealing lineage-specific insertions and multiple insertion hotspots.ISMapper provides a rapid and robust method for identifying IS insertion sites directly from short read data, with a high degree of accuracy demonstrated across a wide range of bacteria.
Project description:BACKGROUND:Short-read sequencing technologies have made microbial genome sequencing cheap and accessible. However, closing genomes is often costly and assembling short reads from genomes that are repetitive and/or have extreme %GC content remains challenging. Long-read, single-molecule sequencing technologies such as the Oxford Nanopore MinION have the potential to overcome these difficulties, although the best approach for harnessing their potential remains poorly evaluated. RESULTS:We sequenced nine bacterial genomes spanning a wide range of GC contents using Illumina MiSeq and Oxford Nanopore MinION sequencing technologies to determine the advantages of each approach, both individually and combined. Assemblies using only MiSeq reads were highly accurate but lacked contiguity, a deficiency that was partially overcome by adding MinION reads to these assemblies. Even more contiguous genome assemblies were generated by using MinION reads for initial assembly, but these assemblies were more error-prone and required further polishing. This was especially pronounced when Illumina libraries were biased, as was the case for our strains with both high and low GC content. Increased genome contiguity dramatically improved the annotation of insertion sequences and secondary metabolite biosynthetic gene clusters, likely because long-reads can disambiguate these highly repetitive but biologically important genomic regions. CONCLUSIONS:Genome assembly using short-reads is challenged by repetitive sequences and extreme GC contents. Our results indicate that these difficulties can be largely overcome by using single-molecule, long-read sequencing technologies such as the Oxford Nanopore MinION. Using MinION reads for assembly followed by polishing with Illumina reads generated the most contiguous genomes with sufficient accuracy to enable the accurate annotation of important but difficult to sequence genomic features such as insertion sequences and secondary metabolite biosynthetic gene clusters. The combination of Oxford Nanopore and Illumina sequencing can therefore cost-effectively advance studies of microbial evolution and genome-driven drug discovery.
Project description:Short-read sequencing can provide detection of multiple genomic determinants of antimicrobial resistance from single bacterial genomes and metagenomic samples. Despite its increasing application in human, animal, and environmental microbiology, including human clinical trials, the performance of short-read Illumina sequencing for antimicrobial resistance gene (ARG) detection, including resistance-conferring single nucleotide polymorphisms (SNPs), has not been systematically characterized. Using paired-end 2 × 150 bp (base pair) Illumina sequencing and an assembly-based method for ARG prediction, we determined sensitivity, positive predictive value (PPV), and sequencing depths required for ARG detection in an Escherichia coli isolate of sequence type (ST) 38 spiked into a synthetic microbial community at varying abundances. Approximately 300,000 reads or 15× genome coverage was sufficient to detect ARGs in E. coli ST38, with comparable sensitivity and PPV to ~100× genome coverage. Using metagenome assembly of mixed microbial communities, ARG detection at E. coli relative abundances of 1% would require assembly of approximately 30 million reads to achieve 15× target coverage. The minimum sequencing depths were validated using public data sets of 948 E. coli genomes and 10 metagenomic rectal swab samples. A read-based approach using <i>k-mer</i> alignment (KMA) for ARG prediction did not substantially improve minimum sequencing depths for ARG detection compared to assembly of the E. coli ST38 genome or the combined metagenomic samples. Analysis of sequencing depths from recent studies assessing ARG content in metagenomic samples demonstrated that sequencing depths had a median estimated detection frequency of 84% (interquartile range: 30%-92%) for a relative abundance of 1%. <b>IMPORTANCE</b> Systematically determining Illumina sequencing performance characteristics for detection of ARGs in metagenomic samples is essential to inform study design and appraisal of human, animal, and environmental metagenomic antimicrobial resistance studies. In this study, we quantified the performance characteristics of ARG detection in E. coli genomes and metagenomes and established a benchmark of ~15× coverage for ARG detection for E. coli in metagenomes. We demonstrate that for low relative abundances, sequencing depths of ~30 million reads or more may be required for adequate sensitivity for many applications.
Project description:Our knowledge of the diversity and frequency of genomic structural variation segregating in populations of large double-stranded (ds) DNA viruses is limited. Here, we sequenced the genome of a baculovirus (Autographa californica multiple nucleopolyhedrovirus [AcMNPV]) purified from beet armyworm (Spodoptera exigua) larvae at depths >195,000× using both short- (Illumina) and long-read (PacBio) technologies. Using a pipeline relying on hierarchical clustering of structural variants (SVs) detected in individual short- and long-reads by six variant callers, we identified a total of 1,141 SVs in AcMNPV, including 464 deletions, 443 inversions, 160 duplications, and 74 insertions. These variants are considered robust and unlikely to result from technical artifacts because they were independently detected in at least three long reads as well as at least three short reads. SVs are distributed along the entire AcMNPV genome and may involve large genomic regions (30,496?bp on average). We show that no less than 39.9 per cent of genomes carry at least one SV in AcMNPV populations, that the vast majority of SVs (75%) segregate at very low frequency (<0.01%) and that very few SVs persist after ten replication cycles, consistent with a negative impact of most SVs on AcMNPV fitness. Using short-read sequencing datasets, we then show that populations of two iridoviruses and one herpesvirus are also full of SVs, as they contain between 426 and 1,102 SVs carried by 52.4-80.1 per cent of genomes. Finally, AcMNPV long reads allowed us to identify 1,757 transposable elements (TEs) insertions, 895 of which are truncated and occur at one extremity of the reads. This further supports the role of baculoviruses as possible vectors of horizontal transfer of TEs. Altogether, we found that SVs, which evolve mostly under rapid dynamics of gain and loss in viral populations, represent an important feature in the biology of large dsDNA viruses.
Project description:MOTIVATION: Insertions play an important role in genome evolution. However, such variants are difficult to detect from short-read sequencing data, especially when they exceed the paired-end insert size. Many approaches have been proposed to call short insertion variants based on paired-end mapping. However, there remains a lack of practical methods to detect and assemble long variants. RESULTS: We propose here an original method, called MindTheGap, for the integrated detection and assembly of insertion variants from re-sequencing data. Importantly, it is designed to call insertions of any size, whether they are novel or duplicated, homozygous or heterozygous in the donor genome. MindTheGap uses an efficient k-mer-based method to detect insertion sites in a reference genome, and subsequently assemble them from the donor reads. MindTheGap showed high recall and precision on simulated datasets of various genome complexities. When applied to real Caenorhabditis elegans and human NA12878 datasets, MindTheGap detected and correctly assembled insertions >1 kb, using at most 14 GB of memory.
Project description:Illumina sequencing has allowed for population-level surveys of transposable element (TE) polymorphism via split alignment approaches, which has provided important insight into the population dynamics of TEs. However, such approaches are not able to identify insertions of uncharacterized TEs, nor can they assemble the full sequence of inserted elements. Here, we use nanopore sequencing and Hi-C scaffolding to produce de novo genome assemblies for two wild strains of Drosophila melanogaster from the Drosophila Genetic Reference Panel (DGRP). Ovarian piRNA populations and Illumina split-read TE insertion profiles have been previously produced for both strains. We find that nanopore sequencing with Hi-C scaffolding produces highly contiguous, chromosome-length scaffolds, and we identify hundreds of TE insertions that were missed by Illumina-based methods, including a novel micropia-like element that has recently invaded the DGRP population. We also find hundreds of piRNA-producing loci that are specific to each strain. Some of these loci are created by strain-specific TE insertions, while others appear to be epigenetically controlled. Our results suggest that Illumina approaches reveal only a portion of the repetitive sequence landscape of eukaryotic genomes and that population-level resequencing using long reads is likely to provide novel insight into the evolutionary dynamics of repetitive elements.
Project description:OBJECTIVES:To investigate the genomic context of a novel resistance island (RI) in multiply antibiotic-resistant Acinetobacter baumannii clinical isolates and global isolates. METHODS:Using a combination of long and short reads generated from the Oxford Nanopore and Illumina platforms, contiguous chromosomes and plasmid sequences were determined. BLAST-based analysis was used to identify the RI insertion target. RESULTS:Genomes of four multiply antibiotic-resistant A. baumannii clinical strains, from a US hospital system, belonging to prevalent MLST ST2 (Pasteur scheme) and ST281 (Oxford scheme) clade F isolates were sequenced to completion. A class 1 integron carrying aadB (tobramycin resistance) and aadA2 (streptomycin/spectinomycin resistance) was identified. The class 1 integron was 6.8?kb, bounded by IS26 at both ends, and embedded in a new target location between an ?/?-hydrolase and a reductase. Due to its novel insertion site and unique RI composition, we suggest naming this novel RI AbGRI4. Molecular analysis of global A. baumannii isolates identified multiple AbGRI4 RI variants in non-ST2 clonal lineages, including variations in the resistance gene cassettes, integron backbone and insertion breakpoints at the hydrolase gene. CONCLUSIONS:A novel RI insertion target harbouring a class 1 integron was identified in a subgroup of ST2/ST281 clinical isolates. Variants of the RI suggested evolution and horizontal transfer of the RI across clonal lineages. Long- and short-read hybrid assembly technology completely resolved the genomic context of IS-bounded RIs, which was not possible using short reads alone.
Project description:Alu insertions have contributed to >11% of the human genome and ?30-35 Alu subfamilies remain actively mobile, yet the characterization of polymorphic Alu insertions from short-read data remains a challenge. We build on existing computational methods to combine Alu detection and de novo assembly of WGS data as a means to reconstruct the full sequence of insertion events from Illumina paired end reads. Comparison with published calls obtained using PacBio long-reads indicates a false discovery rate below 5%, at the cost of reduced sensitivity due to the colocation of reference and non-reference repeats. We generate a highly accurate call set of 1614 completely assembled Alu variants from 53 samples from the Human Genome Diversity Project (HGDP) panel. We utilize the reconstructed alternative insertion haplotypes to genotype 1010 fully assembled insertions, obtaining >99% agreement with genotypes obtained by PCR. In our assembled sequences, we find evidence of premature insertion mechanisms and observe 5' truncation in 16% of AluYa5 and AluYb8 insertions. The sites of truncation coincide with stem-loop structures and SRP9/14 binding sites in the Alu RNA, implicating L1 ORF2p pausing in the generation of 5' truncations. Additionally, we identified variable AluJ and AluS elements that likely arose due to non-retrotransposition mechanisms.
Project description:Transposon insertion site sequencing (TIS) is a powerful method for associating genotype to phenotype. However, all TIS methods described to date use short nucleotide sequence reads which cannot uniquely determine the locations of transposon insertions within repeating genomic sequences where the repeat units are longer than the sequence read length. To overcome this limitation, we have developed a TIS method using Oxford Nanopore sequencing technology that generates and uses long nucleotide sequence reads; we have called this method LoRTIS (Long Read Transposon Insertion-site Sequencing). This experiment data contains sequence files generated using Nanopore and Illumina platforms. Biotin1308.fastq.gz and Biotin2508.fastq.gz are fastq files generated from nanopore technology. Rep1-Tn.fastq.gz and Rep1-Tn.fastq.gz are fastq files generated using Illumina platform. In this study, we have compared the efficiency of two methods in identification of transposon insertion sites.
Project description:Rapid cycle breeding uses transgenic early flowering plants as crossbreed parents to facilitate the shortening of breeding programs for perennial crops with long-lasting juvenility. Rapid cycle breeding in apple was established using the transgenic genotype T1190 expressing the <i>BpMADS4</i> gene of silver birch. In this study, the genomes of T1190 and its non-transgenic wild-type PinS (F1-offspring of 'Pinova' and 'Idared') were sequenced by Illumina short-read sequencing in two separate experiments resulting in a mean sequencing depth of 182× for T1190 and 167× for PinS. The sequencing revealed 8,450 reads, which contain sequences of ≥20 bp identical to the plant transformation vector. These reads were assembled into 125 contigs, which were examined to see whether they contained transgenic insertions or if they are not using a five-step procedure. The sequence of one contig represents the known T-DNA insertion on chromosome 4 of T1190. The sequences of the remaining contigs were either equally present in T1190 and PinS, their part with sequence identity to the vector was equally present in apple reference genomes, or they seem to result from endophytic contaminations rather than from additional transgenic insertions. Therefore, we conclude that the transgenic apple plant T1190 contains only one transgenic insertion, located on chromosome 4, and shows no further partial insertions of the transformation vector. Accession Numbers: JQ974028.1.
Project description:<b>Motivation: </b>Transposable elements (TEs) constitute a significant proportion of the majority of genomes sequenced to date. TEs are responsible for a considerable fraction of the genetic variation within and among species. Accurate genotyping of TEs in genomes is therefore crucial for a complete identification of the genetic differences among individuals, populations and species.<br><br><b>Results: </b>In this work, we present a new version of T-lex, a computational pipeline that accurately genotypes and estimates the population frequencies of reference TE insertions using short-read high-throughput sequencing data. In this new version, we have re-designed the T-lex algorithm to integrate the BWA-MEM short-read aligner, which is one of the most accurate short-read mappers and can be launched on longer short-reads (e.g. reads >150?bp). We have added new filtering steps to increase the accuracy of the genotyping, and new parameters that allow the user to control both the minimum and maximum number of reads, and the minimum number of strains to genotype a TE insertion. We also showed for the first time that T-lex3 provides accurate TE calls in a plant genome.<br><br><b>Availability and implementation: </b>To test the accuracy of T-lex3, we called 1630 individual TE insertions in Drosophila melanogaster, 1600 individual TE insertions in humans, and 3067 individual TE insertions in the rice genome. We showed that this new version of T-lex is a broadly applicable and accurate tool for genotyping and estimating TE frequencies in organisms with different genome sizes and different TE contents. T-lex3 is available at Github: https://github.com/GonzalezLab/T-lex3.<br><br><b>Supplementary information: </b>Supplementary data are available at Bioinformatics online.