Completing bacterial genome assemblies: strategy and performance comparisons.
ABSTRACT: Determining the genomic sequences of microorganisms is the basis and prerequisite for understanding their biology and functional characterization. While the advent of low-cost, extremely high-throughput second-generation sequencing technologies and the parallel development of assembly algorithms have generated rapid and cost-effective genome assemblies, such assemblies are often unfinished, fragmented draft genomes as a result of short read lengths and long repeats present in multiple copies. Third-generation, PacBio sequencing technologies circumvented this problem by greatly increasing read length. Hybrid approaches including ALLPATHS-LG, PacBio corrected reads pipeline, SPAdes, and SSPACE-LongRead, and non-hybrid approaches--hierarchical genome-assembly process (HGAP) and PacBio corrected reads pipeline via self-correction--have therefore been proposed to utilize the PacBio long reads that can span many thousands of bases to facilitate the assembly of complete microbial genomes. However, standardized procedures that aim at evaluating and comparing these approaches are currently insufficient. To address the issue, we herein provide a comprehensive comparison by collecting datasets for the comparative assessment on the above-mentioned five assemblers. In addition to offering explicit and beneficial recommendations to practitioners, this study aims to aid in the design of a paradigm positioned to complete bacterial genome assembly.
Project description:The development of long-read sequencing technologies, such as single-molecule real-time (SMRT) sequencing by PacBio, has produced a revolution in the sequencing of small genomes. Sequencing organelle genomes using PacBio long-read data is a cost effective, straightforward approach. Nevertheless, the availability of simple-to-use software to perform the assembly from raw reads is limited at present.We present Organelle-PBA, a Perl program designed specifically for the assembly of chloroplast and mitochondrial genomes. For chloroplast genomes, the program selects the chloroplast reads from a whole genome sequencing pool, maps the reads to a reference sequence from a closely related species, and then performs read correction and de novo assembly using Sprai. Organelle-PBA completes the assembly process with the additional step of scaffolding by SSPACE-LongRead. The program then detects the chloroplast inverted repeats and reassembles and re-orients the assembly based on the organelle origin of the reference. We have evaluated the performance of the software using PacBio reads from different species, read coverage, and reference genomes. Finally, we present the assembly of two novel chloroplast genomes from the species Picea glauca (Pinaceae) and Sinningia speciosa (Gesneriaceae).Organelle-PBA is an easy-to-use Perl-based software pipeline that was written specifically to assemble mitochondrial and chloroplast genomes from whole genome PacBio reads. The program is available at https://github.com/aubombarely/Organelle_PBA .
Project description:Despite the ever-increasing output of next-generation sequencing data along with developing assemblers, dozens to hundreds of gaps still exist in de novo microbial assemblies due to uneven coverage and large genomic repeats. Third-generation single-molecule, real-time (SMRT) sequencing technology avoids amplification artifacts and generates kilobase-long reads with the potential to complete microbial genome assembly. However, due to the low accuracy (~85%) of third-generation sequences, a considerable amount of long reads (>50X) are required for self-correction and for subsequent de novo assembly. Recently-developed hybrid approaches, using next-generation sequencing data and as few as 5X long reads, have been proposed to improve the completeness of microbial assembly. In this study we have evaluated the contemporary hybrid approaches and demonstrated that assembling corrected long reads (by runCA) produced the best assembly compared to long-read scaffolding (e.g., AHA, Cerulean and SSPACE-LongRead) and gap-filling (SPAdes). For generating corrected long reads, we further examined long-read correction tools, such as ECTools, LSC, LoRDEC, PBcR pipeline and proovread. We have demonstrated that three microbial genomes including Escherichia coli K12 MG1655, Meiothermus ruber DSM1279 and Pdeobacter heparinus DSM2366 were successfully hybrid assembled by runCA into near-perfect assemblies using ECTools-corrected long reads. In addition, we developed a tool, Patch, which implements corrected long reads and pre-assembled contigs as inputs, to enhance microbial genome assemblies. With the additional 20X long reads, short reads of S. cerevisiae W303 were hybrid assembled into 115 contigs using the verified strategy, ECTools + runCA. Patch was subsequently applied to upgrade the assembly to a 35-contig draft genome. Our evaluation of the hybrid approaches shows that assembling the ECTools-corrected long reads via runCA generates near complete microbial genomes, suggesting that genome assembly could benefit from re-analyzing the available hybrid datasets that were not assembled in an optimal fashion.
Project description:<h4>Background</h4>The recent introduction of the Pacific Biosciences RS single molecule sequencing technology has opened new doors to scaffolding genome assemblies in a cost-effective manner. The long read sequence information is promised to enhance the quality of incomplete and inaccurate draft assemblies constructed from Next Generation Sequencing (NGS) data.<h4>Results</h4>Here we propose a novel hybrid assembly methodology that aims to scaffold pre-assembled contigs in an iterative manner using PacBio RS long read information as a backbone. On a test set comprising six bacterial draft genomes, assembled using either a single Illumina MiSeq or Roche 454 library, we show that even a 50× coverage of uncorrected PacBio RS long reads is sufficient to drastically reduce the number of contigs. Comparisons to the AHA scaffolder indicate our strategy is better capable of producing (nearly) complete bacterial genomes.<h4>Conclusions</h4>The current work describes our SSPACE-LongRead software which is designed to upgrade incomplete draft genomes using single molecule sequences. We conclude that the recent advances of the PacBio sequencing technology and chemistry, in combination with the limited computational resources required to run our program, allow to scaffold genomes in a fast and reliable manner.
Project description:Long-read sequencing technologies have the potential to produce gold-standard de novo genome assemblies, but fully exploiting error-prone reads to resolve repeats remains a challenge. Aggressive approaches to repeat resolution often produce misassemblies, and conservative approaches lead to unnecessary fragmentation. We present HINGE, an assembler that seeks to achieve optimal repeat resolution by distinguishing repeats that can be resolved given the data from those that cannot. This is accomplished by adding "hinges" to reads for constructing an overlap graph where only unresolvable repeats are merged. As a result, HINGE combines the error resilience of overlap-based assemblers with repeat-resolution capabilities of de Bruijn graph assemblers. HINGE was evaluated on the long-read bacterial data sets from the NCTC project. HINGE produces more finished assemblies than Miniasm and the manual pipeline of NCTC based on the HGAP assembler and Circlator. HINGE also allows us to identify 40 data sets where unresolvable repeats prevent the reliable construction of a unique finished assembly. In these cases, HINGE outputs a visually interpretable assembly graph that encodes all possible finished assemblies consistent with the reads, while other approaches such as the NCTC pipeline and FALCON either fragment the assembly or resolve the ambiguity arbitrarily.
Project description:Unknown sequences, or gaps, are present in many published genomes across public databases. Gap filling is an important finishing step in de novo genome assembly, especially in large genomes. The gap filling problem is nontrivial and while there are many computational tools partially solving the problem, several have shortcomings as to the reliability and correctness of the output, i.e. the gap filled draft genome. SSPACE-LongRead is a scaffolding tool that utilizes long reads from multiple third-generation sequencing platforms in finding links between contigs and combining them. The long reads potentially contain sequence information to fill the gaps created in the scaffolding, but SSPACE-LongRead currently lacks this functionality. We present an automated pipeline called gapFinisher to process SSPACE-LongRead output to fill gaps after the scaffolding. gapFinisher is based on the controlled use of a previously published gap filling tool FGAP and works on all standard Linux/UNIX command lines. We compare the performance of gapFinisher against two other published gap filling tools PBJelly and GMcloser. We conclude that gapFinisher can fill gaps in draft genomes quickly and reliably. In addition, the serial design of gapFinisher makes it scale well from prokaryote genomes to larger genomes with no increase in the computational footprint.
Project description:BACKGROUND:The advent of third-generation sequencing (TGS) technologies opens the door to improve genome assembly. Long reads are promising for enhancing the quality of fragmented draft assemblies constructed from next-generation sequencing (NGS) technologies. To date, a few algorithms that are capable of improving draft assemblies have released. There are SSPACE-LongRead, OPERA-LG, SMIS, npScarf, DBG2OLC, Unicycler, and LINKS. Hybrid assembly on large genomes remains challenging, however. RESULTS:We develop a scalable and computationally efficient scaffolder, Long Reads Scaffolder (LRScaf, https://github.com/shingocat/lrscaf), that is capable of significantly boosting assembly contiguity using long reads. In this study, we summarise a comprehensive performance assessment for state-of-the-art scaffolders and LRScaf on seven organisms, i.e., E. coli, S. cerevisiae, A. thaliana, O. sativa, S. pennellii, Z. mays, and H. sapiens. LRScaf significantly improves the contiguity of draft assemblies, e.g., increasing the NGA50 value of CHM1 from 127.1 kbp to 9.4 Mbp using 20-fold coverage PacBio dataset and the NGA50 value of NA12878 from 115.3 kbp to 12.9 Mbp using 35-fold coverage Nanopore dataset. Besides, LRScaf generates the best contiguous NGA50 on A. thaliana, S. pennellii, Z. mays, and H. sapiens. Moreover, LRScaf has the shortest run time compared with other scaffolders, and the peak RAM of LRScaf remains practical for large genomes (e.g., 20.3 and 62.6?GB on CHM1 and NA12878, respectively). CONCLUSIONS:The new algorithm, LRScaf, yields the best or, at least, moderate scaffold contiguity and accuracy in the shortest run time compared with other scaffolding algorithms. Furthermore, LRScaf provides a cost-effective way to improve contiguity of draft assemblies on large genomes.
Project description:PacBio long reads sequencing presents several potential advantages for DNA assembly, including being able to provide more complete gene profiling of metagenomic samples. However, lower single-pass accuracy can make gene discovery and assembly for low-abundance organisms difficult. To evaluate the application and performance of PacBio long reads and Illumina HiSeq short reads in metagenomic analyses, we directly compared various assemblies involving PacBio and Illumina sequencing reads based on two anaerobic digestion microbiome samples from a biogas fermenter. Using a PacBio platform, 1.58 million long reads (19.6 Gb) were produced with an average length of 7,604 bp. Using an Illumina HiSeq platform, 151.2 million read pairs (45.4 Gb) were produced. Hybrid assemblies using PacBio long reads and HiSeq contigs produced improvements in assembly statistics, including an increase in the average contig length, contig N50 size, and number of large contigs. Interestingly, depth-based hybrid assemblies generated a higher percentage of complete genes (98.86%) compared to those based on HiSeq contigs only (40.29%), because the PacBio reads were long enough to cover many repeating short elements and capture multiple genes in a single read. Additionally, the incorporation of PacBio long reads led to considerable advantages regarding reducing contig numbers and increasing the completeness of the genome reconstruction, which was poorly assembled and binned when using HiSeq data alone. From this comparison of PacBio long reads with Illumina HiSeq short reads related to complex microbiome samples, we conclude that PacBio long reads can produce longer contigs, more complete genes, and better genome binning, thereby offering more information about metagenomic samples.
Project description:BACKGROUND:De novo RNA-Seq assembly is a powerful method for analysing transcriptomes when the reference genome is not available or poorly annotated. However, due to the short length of Illumina reads it is usually impossible to reconstruct complete sequences of complex genes and alternative isoforms. Recently emerged possibility to generate long RNA reads, such as PacBio and Oxford Nanopores, may dramatically improve the assembly quality, and thus the consecutive analysis. While reference-based tools for analysing long RNA reads were recently developed, there is no established pipeline for de novo assembly of such data. RESULTS:In this work we present a novel method that allows to perform high-quality de novo transcriptome assemblies by combining accuracy and reliability of short reads with exon structure information carried out from long error-prone reads. The algorithm is designed by incorporating existing hybridSPAdes approach into rnaSPAdes pipeline and adapting it for transcriptomic data. CONCLUSION:To evaluate the benefit of using long RNA reads we selected several datasets containing both Illumina and Iso-seq or Oxford Nanopore Technologies (ONT) reads. Using an existing quality assessment software, we show that hybrid assemblies performed with rnaSPAdes contain more full-length genes and alternative isoforms comparing to the case when only short-read data is used.
Project description:Plasmodium coatneyi is a protozoan parasite species that causes simian malaria and is an excellent model for studying disease caused by the human malaria parasite, P. falciparum Here we report the complete (nontelomeric) genome sequence of P. coatneyi Hackeri generated by the application of only Pacific Biosciences RS II (PacBio RS II) single-molecule real-time (SMRT) high-resolution sequence technology and assembly using the Hierarchical Genome Assembly Process (HGAP). This is the first Plasmodium genome sequence reported to use only PacBio technology. This approach has proven to be superior to short-read only approaches for this species.
Project description:Genome assembly depends critically on read length. Two recent technologies, from Pacific Biosciences (PacBio) and Oxford Nanopore, produce read lengths >20 kb, which yield de novo genome assemblies with vastly greater contiguity than those based on Sanger, Illumina, or other technologies. However, the very high error rates of these two new technologies (?15% per base) makes assembly imprecise at repeats longer than the read length and computationally expensive. Here we show that the contiguity and quality of the assembly of these noisy long reads can be significantly improved at a minimal cost, by leveraging on the low error rate and low cost of Illumina short reads. Namely, k-mers from the PacBio raw reads that are not present in Illumina reads (which account for ?95% of the distinct k-mers) are deemed sequencing errors and ignored at the seed alignment step. By focusing on the ?5% of k-mers that are error free, read overlap sensitivity is dramatically increased. Of equal importance, the validation procedure can be extended to exclude repetitive k-mers, which prevents read miscorrection at repeats and further improves the resulting assemblies. We tested the k-mer validation procedure using one long-read technology (PacBio) and one assembler (MHAP/Celera Assembler), but it is very likely to yield analogous improvements with alternative long-read technologies and assemblers, such as Oxford Nanopore and BLASR/DALIGNER/Falcon, respectively.