The hidden perils of read mapping as a quality assessment tool in genome sequencing.
ABSTRACT: This article provides a comparative analysis of the various methods of genome sequencing focusing on verification of the assembly quality. The results of a comparative assessment of various de novo assembly tools, as well as sequencing technologies, are presented using a recently completed sequence of the genome of Lactobacillus fermentum 3872. In particular, quality of assemblies is assessed by using CLC Genomics Workbench read mapping and Optical mapping developed by OpGen. Over-extension of contigs without prior knowledge of contig location can lead to misassembled contigs, even when commonly used quality indicators such as read mapping suggest that a contig is well assembled. Precautions must also be undertaken when using long read sequencing technology, which may also lead to misassembled contigs.
Project description:A crucial problem in genome assembly is the discovery and correction of misassembly errors in draft genomes. We develop a method called misSEQuel that enhances the quality of draft genomes by identifying misassembly errors and their breakpoints using paired-end sequence reads and optical mapping data. Our method also fulfills the critical need for open source computational methods for analyzing optical mapping data. We apply our method to various assemblies of the loblolly pine, Francisella tularensis, rice and budgerigar genomes. We generated and used stimulated optical mapping data for loblolly pine and F.tularensis and used real optical mapping data for rice and budgerigar.Our results demonstrate that we detect more than 54% of extensively misassembled contigs and more than 60% of locally misassembled contigs in assemblies of F.tularensis and between 31% and 100% of extensively misassembled contigs and between 57% and 73% of locally misassembled contigs in assemblies of loblolly pine. Using the real optical mapping data, we correctly identified 75% of extensively misassembled contigs and 100% of locally misassembled contigs in rice, and 77% of extensively misassembled contigs and 80% of locally misassembled contigs in budgerigar.misSEQuel can be used as a post-processing step in combination with any genome assembler and is freely available at http://www.cs.colostate.edu/seq/.
Project description:BACKGROUND: Whole genome (optical) mapping (WGM), a state-of-the-art mapping technology based on the generation of high resolution restriction maps, has so far been used for typing clinical outbreak strains and for mapping de novo sequence contigs in genome sequencing projects. We employed WGM to assess the genomic stability of previously sequenced Staphylococcus aureus strains that are commonly used in laboratories as reference standards. RESULTS: S. aureus strains (n?=?12) were mapped on the Argus™ Optical Mapping System (Opgen Inc, Gaithersburg, USA). Assembly of NcoI-restricted DNA molecules, visualization, and editing of whole genome maps was performed employing MapManager and MapSolver softwares (Opgen Inc). In silico whole genome NcoI-restricted maps were also generated from available sequence data, and compared to the laboratory-generated maps. Strains showing differences between the two maps were resequenced using Nextera XT DNA Sample Preparation Kit and Miseq Reagent Kit V2 (MiSeq, Illumina) and de novo assembled into sequence contigs using the Velvet assembly tool. Sequence data were correlated with corresponding whole genome maps to perform contig mapping and genome assembly using MapSolver. Of the twelve strains tested, one (USA300_FPR3757) showed a 19-kbp deletion on WGM compared to its in silico generated map and reference sequence data. Resequencing of the USA300_FPR3757 identified the deleted fragment to be a 13 kbp-long integrative conjugative element ICE6013. CONCLUSIONS: Frequent subculturing and inter-laboratory transfers can induce genomic and therefore, phenotypic changes that could compromise the utility of standard reference strains. WGM can thus be used as a rapid genome screening method to identify genomic rearrangements whose size and type can be confirmed by sequencing.
Project description:BACKGROUND: Until recently, read lengths on the Solexa/Illumina system were too short to reliably assemble transcriptomes without a reference sequence, especially for non-model organisms. However, with read lengths up to 100 nucleotides available in the current version, an assembly without reference genome should be possible. For this study we created an EST data set for the common pond snail Radix balthica by Illumina sequencing of a normalized transcriptome. Performance of three different short read assemblers was compared with respect to: the number of contigs, their length, depth of coverage, their quality in various BLAST searches and the alignment to mitochondrial genes. RESULTS: A single sequencing run of a normalized RNA pool resulted in 16,923,850 paired end reads with median read length of 61 bases. The assemblies generated by VELVET, OASES, and SeqMan NGEN differed in the total number of contigs, contig length, the number and quality of gene hits obtained by BLAST searches against various databases, and contig performance in the mt genome comparison. While VELVET produced the highest overall number of contigs, a large fraction of these were of small size (< 200bp), and gave redundant hits in BLAST searches and the mt genome alignment. The best overall contig performance resulted from the NGEN assembly. It produced the second largest number of contigs, which on average were comparable to the OASES contigs but gave the highest number of gene hits in two out of four BLAST searches against different reference databases. A subsequent meta-assembly of the four contig sets resulted in larger contigs, less redundancy and a higher number of BLAST hits. CONCLUSION: Our results document the first de novo transcriptome assembly of a non-model species using Illumina sequencing data. We show that de novo transcriptome assembly using this approach yields results useful for downstream applications, in particular if a meta-assembly of contig sets is used to increase contig quality. These results highlight the ongoing need for improvements in assembly methodology.
Project description:The use of next generation sequencing (NGS) to identify novel viral sequences from eukaryotic tissue samples is challenging. Issues can include the low proportion and copy number of viral reads and the high number of contigs (post-assembly), making subsequent viral analysis difficult. Comparison of assembly algorithms with pre-assembly host-mapping subtraction using a short-read mapping tool, a k-mer frequency based filter and a low complexity filter, has been validated for viral discovery with Illumina data derived from naturally infected liver tissue and simulated data. Assembled contig numbers were significantly reduced (up to 99.97%) by the application of these pre-assembly filtering methods. This approach provides a validated method for maximizing viral contig size as well as reducing the total number of assembled contigs that require down-stream analysis as putative viral nucleic acids.
Project description:Next-generation sequencing projects commonly commence by aligning reads to a reference genome assembly. While improvements in alignment algorithms and computational hardware have greatly enhanced the efficiency and accuracy of alignments, a significant percentage of reads often remain unmapped.We generated de novo assemblies of unmapped reads from the DNA and RNA sequencing of the Bos taurus reference individual and identified the closest matching sequence to each contig by alignment to the NCBI non-redundant nucleotide database using BLAST. As expected, many of these contigs represent vertebrate sequence that is absent, incomplete, or misassembled in the UMD3.1 reference assembly. However, numerous additional contigs represent invertebrate species. Most prominent were several species of Spirurid nematodes and a blood-borne parasite, Babesia bigemina. These species are either not present in the US or are not known to infect taurine cattle and the reference animal appears to have been host to unsequenced sister species.We demonstrate the importance of exploring unmapped reads to ascertain sequences that are either absent or misassembled in the reference assembly and for detecting sequences indicative of parasitic or commensal organisms.
Project description:Viral communities of two different salt pans located in the Namib Desert, Hosabes and Eisfeld, were investigated using a combination of multiple displacement amplification of metaviromic DNA and deep sequencing, and provided comprehensive sequence data on both ssDNA and dsDNA viral community structures. Read and contig annotations through online pipelines showed that the salt pans harbored largely unknown viral communities. Through network analysis, we were able to assign a large portion of the unknown reads to a diverse group of ssDNA viruses. Contigs belonging to the subfamily Gokushovirinae were common in both environmental datasets. Analysis of haloarchaeal virus contigs revealed the presence of three contigs distantly related with His1, indicating a possible new lineage of salterproviruses in the Hosabes playa. Based on viral richness and read mapping analyses, the salt pan metaviromes were novel and most closely related to each other while showing a low degree of overlap with other environmental viromes.
Project description:The third generation PacBio SMRT long reads can effectively address the read length issue of the second generation sequencing technology, but contain approximately 15% sequencing errors. Several error correction algorithms have been designed to efficiently reduce the error rate to 1%, but they discard large amounts of uncorrected bases and thus lead to low throughput. This loss of bases could limit the completeness of downstream assemblies and the accuracy of analysis.Here, we introduce HALC, a high throughput algorithm for long read error correction. HALC aligns the long reads to short read contigs from the same species with a relatively low identity requirement so that a long read region can be aligned to at least one contig region, including its true genome region's repeats in the contigs sufficiently similar to it (similar repeat based alignment approach). It then constructs a contig graph and, for each long read, references the other long reads' alignments to find the most accurate alignment and correct it with the aligned contig regions (long read support based validation approach). Even though some long read regions without the true genome regions in the contigs are corrected with their repeats, this approach makes it possible to further refine these long read regions with the initial insufficient short reads and correct the uncorrected regions in between. In our performance tests on E. coli, A. thaliana and Maylandia zebra data sets, HALC was able to obtain 6.7-41.1% higher throughput than the existing algorithms while maintaining comparable accuracy. The HALC corrected long reads can thus result in 11.4-60.7% longer assembled contigs than the existing algorithms.The HALC software can be downloaded for free from this site: https://github.com/lanl001/halc .
Project description:Assembly of metagenomic samples is a very complex process, with algorithms designed to address sequencing platform-specific issues, (read length, data volume, and/or community complexity), while also faced with genomes that differ greatly in nucleotide compositional biases and in abundance. To address these issues, we have developed a post-assembly process: MetaGenomic Assembly by Merging (MeGAMerge). We compare this process to the performance of several assemblers, using both real, and in-silico generated samples of different community composition and complexity. MeGAMerge consistently outperforms individual assembly methods, producing larger contigs with an increased number of predicted genes, without replication of data. MeGAMerge contigs are supported by read mapping and contig alignment data, when using synthetically-derived and real metagenomic data, as well as by gene prediction analyses and similarity searches. MeGAMerge is a flexible method that generates improved metagenome assemblies, with the ability to accommodate upcoming sequencing platforms, as well as present and future assembly algorithms.
Project description:BACKGROUND:Long DNA reads produced by single-molecule and pore-based sequencers are more suitable for assembly and structural variation discovery than short-read DNA fragments. For de novo assembly, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) are the favorite options. However, PacBio's SMRT sequencing is expensive for a full human genome assembly and costs more than $40,000 US for 30× coverage as of 2019. ONT PromethION sequencing, on the other hand, is 1/12 the price of PacBio for the same coverage. This study aimed to compare the cost-effectiveness of ONT PromethION and PacBio's SMRT sequencing in relation to the quality. FINDINGS:We performed whole-genome de novo assemblies and comparison to construct an improved version of KOREF, the Korean reference genome, using sequencing data produced by PromethION and PacBio. With PromethION, an assembly using sequenced reads with 64× coverage (193 Gb, 3 flowcell sequencing) resulted in 3,725 contigs with N50s of 16.7 Mb and a total genome length of 2.8 Gb. It was comparable to a KOREF assembly constructed using PacBio at 62× coverage (188 Gb, 2,695 contigs, and N50s of 17.9 Mb). When we applied Hi-C-derived long-range mapping data, an even higher quality assembly for the 64× coverage was achieved, resulting in 3,179 scaffolds with an N50 of 56.4 Mb. CONCLUSION:The pore-based PromethION approach provided a high-quality chromosome-scale human genome assembly at a low cost with long maximum contig and scaffold lengths and was more cost-effective than PacBio at comparable quality measurements.
Project description:BACKGROUND: Second generation sequencing has permitted detailed sequence characterisation at the whole genome level of a growing number of non-model organisms, but the data produced have short read-lengths and biased genome coverage leading to fragmented genome assemblies. The PacBio RS long-read sequencing platform offers the promise of increased read length and unbiased genome coverage and thus the potential to produce genome sequence data of a finished quality containing fewer gaps and longer contigs. However, these advantages come at a much greater cost per nucleotide and with a perceived increase in error-rate. In this investigation, we evaluated the performance of the PacBio RS sequencing platform through the sequencing and de novo assembly of the Potentilla micrantha chloroplast genome. RESULTS: Following error-correction, a total of 28,638 PacBio RS reads were recovered with a mean read length of 1,902 bp totalling 54,492,250 nucleotides and representing an average depth of coverage of 320× the chloroplast genome. The dataset covered the entire 154,959 bp of the chloroplast genome in a single contig (100% coverage) compared to seven contigs (90.59% coverage) recovered from an Illumina data, and revealed no bias in coverage of GC rich regions. Post-assembly the data were largely concordant with the Illumina data generated and allowed 187 ambiguities in the Illumina data to be resolved. The additional read length also permitted small differences in the two inverted repeat regions to be assigned unambiguously. CONCLUSIONS: This is the first report to our knowledge of a chloroplast genome assembled de novo using PacBio sequence data. The PacBio RS data generated here were assembled into a single large contig spanning the P. micrantha chloroplast genome, with a higher degree of accuracy than an Illumina dataset generated at a much greater depth of coverage, due to longer read lengths and lower GC bias in the data. The results we present suggest PacBio data will be of immense utility for the development of genome sequence assemblies containing fewer unresolved gaps and ambiguities and a significantly smaller number of contigs than could be produced using short-read sequence data alone.