Evaluating de Bruijn graph assemblers on 454 transcriptomic data.
ABSTRACT: Next generation sequencing (NGS) technologies have greatly changed the landscape of transcriptomic studies of non-model organisms. Since there is no reference genome available, de novo assembly methods play key roles in the analysis of these data sets. Because of the huge amount of data generated by NGS technologies for each run, many assemblers, e.g., ABySS, Velvet and Trinity, are developed based on a de Bruijn graph due to its time- and space-efficiency. However, most of these assemblers were developed initially for the Illumina/Solexa platform. The performance of these assemblers on 454 transcriptomic data is unknown. In this study, we evaluated and compared the relative performance of these de Bruijn graph based assemblers on both simulated and real 454 transcriptomic data. The results suggest that Trinity, the Illumina/Solexa-specialized transcriptomic assembler, performs the best among the multiple de Bruijn graph assemblers, comparable to or even outperforming the standard 454 assembler Newbler which is based on the overlap-layout-consensus algorithm. Our evaluation is expected to provide helpful guidance for researchers to choose assemblers when analyzing 454 transcriptomic data.
Project description:The advent of next-generation sequencing technologies is accompanied with the development of many whole-genome sequence assembly methods and software, especially for de novo fragment assembly. Due to the poor knowledge about the applicability and performance of these software tools, choosing a befitting assembler becomes a tough task. Here, we provide the information of adaptivity for each program, then above all, compare the performance of eight distinct tools against eight groups of simulated datasets from Solexa sequencing platform. Considering the computational time, maximum random access memory (RAM) occupancy, assembly accuracy and integrity, our study indicate that string-based assemblers, overlap-layout-consensus (OLC) assemblers are well-suited for very short reads and longer reads of small genomes respectively. For large datasets of more than hundred millions of short reads, De Bruijn graph-based assemblers would be more appropriate. In terms of software implementation, string-based assemblers are superior to graph-based ones, of which SOAPdenovo is complex for the creation of configuration file. Our comparison study will assist researchers in selecting a well-suited assembler and offer essential information for the improvement of existing assemblers or the developing of novel assemblers.
Project description:While the number of sequenced diploid genomes have been steadily increasing in the last few years, assembly of highly polymorphic (HP) diploid genomes remains challenging. As a result, there is a shortage of tools for assembling HP genomes from the next generation sequencing (NGS) data. The initial approaches to assembling HP genomes were proposed in the pre-NGS era and are not well suited for NGS projects. To address this limitation, we developed the first de Bruijn graph assembler, dipSPAdes, for HP genomes that significantly improves on the state-of-the-art assemblers for HP diploid genomes.
Project description:An important step in 'metagenomics' analysis is the assembly of multiple genomes from mixed sequence reads of multiple species in a microbial community. Most conventional pipelines use a single-genome assembler with carefully optimized parameters. A limitation of a single-genome assembler for de novo metagenome assembly is that sequences of highly abundant species are likely misidentified as repeats in a single genome, resulting in a number of small fragmented scaffolds. We extended a single-genome assembler for short reads, known as 'Velvet', to metagenome assembly, which we called 'MetaVelvet', for mixed short reads of multiple species. Our fundamental concept was to first decompose a de Bruijn graph constructed from mixed short reads into individual sub-graphs, and second, to build scaffolds based on each decomposed de Bruijn sub-graph as an isolate species genome. We made use of two features, the coverage (abundance) difference and graph connectivity, for the decomposition of the de Bruijn graph. For simulated datasets, MetaVelvet succeeded in generating significantly higher N50 scores than any single-genome assemblers. MetaVelvet also reconstructed relatively low-coverage genome sequences as scaffolds. On real datasets of human gut microbial read data, MetaVelvet produced longer scaffolds and increased the number of predicted genes.
Project description:BACKGROUND: The de Bruijn graph data structure is widely used in next-generation sequencing (NGS). Many programs, e.g. de novo assemblers, rely on in-memory representation of this graph. However, current techniques for representing the de Bruijn graph of a human genome require a large amount of memory (≥30 GB). RESULTS: We propose a new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations. The encoding is based on a Bloom filter, with an additional structure to remove critical false positives. CONCLUSIONS: An assembly software implementing this structure, Minia, performed a complete de novo assembly of human genome short reads using 5.7 GB of memory in 23 hours.
Project description:Intimately tied to assembly quality is the complexity of the de Bruijn graph built by the assembler. Thus, there have been many paradigms developed to decrease the complexity of the de Bruijn graph. One obvious combinatorial paradigm for this is to allow the value of k to vary; having a larger value of k where the graph is more complex and a smaller value of k where the graph would likely contain fewer spurious edges and vertices. One open problem that affects the practicality of this method is how to predict the value of k prior to building the de Bruijn graph. We show that optimal values of k can be predicted prior to assembly by using the information contained in a phylogenetically-close genome and therefore, help make the use of multiple values of k practical for genome assembly.We present HyDA-Vista, which is a genome assembler that uses homology information to choose a value of k for each read prior to the de Bruijn graph construction. The chosen k is optimal if there are no sequencing errors and the coverage is sufficient. Fundamental to our method is the construction of the maximal sequence landscape, which is a data structure that stores for each position in the input string, the largest repeated substring containing that position. In particular, we show the maximal sequence landscape can be constructed in O(n+n log n)-time and O(n)-space. HyDA-Vista first constructs the maximal sequence landscape for a homologous genome. The reads are then aligned to this reference genome, and values of k are assigned to each read using the maximal sequence landscape and the alignments. Eventually, all the reads are assembled by an iterative de Bruijn graph construction method. Our results and comparison to other assemblers demonstrate that HyDA-Vista achieves the best assembly of E. coli before repeat resolution or scaffolding.HyDA-Vista is freely available 1. The code for constructing the maximal sequence landscape and choosing the optimal value of k for each read is also separately available on the website and could be incorporated into any genome assembler.
Project description:Long-read sequencing technologies have the potential to produce gold-standard de novo genome assemblies, but fully exploiting error-prone reads to resolve repeats remains a challenge. Aggressive approaches to repeat resolution often produce misassemblies, and conservative approaches lead to unnecessary fragmentation. We present HINGE, an assembler that seeks to achieve optimal repeat resolution by distinguishing repeats that can be resolved given the data from those that cannot. This is accomplished by adding "hinges" to reads for constructing an overlap graph where only unresolvable repeats are merged. As a result, HINGE combines the error resilience of overlap-based assemblers with repeat-resolution capabilities of de Bruijn graph assemblers. HINGE was evaluated on the long-read bacterial data sets from the NCTC project. HINGE produces more finished assemblies than Miniasm and the manual pipeline of NCTC based on the HGAP assembler and Circlator. HINGE also allows us to identify 40 data sets where unresolvable repeats prevent the reliable construction of a unique finished assembly. In these cases, HINGE outputs a visually interpretable assembly graph that encodes all possible finished assemblies consistent with the reads, while other approaches such as the NCTC pipeline and FALCON either fragment the assembly or resolve the ambiguity arbitrarily.
Project description:De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA (String Graph Assembler). We describe algorithms to error-correct, assemble, and scaffold large sets of sequence data. SGA uses the overlap-based string graph model of assembly, unlike most de novo assemblers that rely on de Bruijn graphs, and is simply parallelizable. We demonstrate the error correction and assembly performance of SGA on 1.2 billion sequence reads from a human genome, which we are able to assemble using 54 GB of memory. The resulting contigs are highly accurate and contiguous, while covering 95% of the reference genome (excluding contigs <200 bp in length). Because of the low memory requirements and parallelization without requiring inter-process communication, SGA provides the first practical assembler to our knowledge for a mammalian-sized genome on a low-end computing cluster.
Project description:Second-generation sequencing technologies produce high coverage of the genome by short reads at a low cost, which has prompted development of new assembly methods. In particular, multiple algorithms based on de Bruijn graphs have been shown to be effective for the assembly problem. In this article, we describe a new hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies, and which allows variable read lengths while tolerating a significant level of sequencing error. Our method transforms large numbers of paired-end reads into a much smaller number of longer 'super-reads'. The use of super-reads allows us to assemble combinations of Illumina reads of differing lengths together with longer reads from 454 and Sanger sequencing technologies, making it one of the few assemblers capable of handling such mixtures. We call our system the Maryland Super-Read Celera Assembler (abbreviated MaSuRCA and pronounced 'mazurka').We evaluate the performance of MaSuRCA against two of the most widely used assemblers for Illumina data, Allpaths-LG and SOAPdenovo2, on two datasets from organisms for which high-quality assemblies are available: the bacterium Rhodobacter sphaeroides and chromosome 16 of the mouse genome. We show that MaSuRCA performs on par or better than Allpaths-LG and significantly better than SOAPdenovo on these data, when evaluated against the finished sequence. We then show that MaSuRCA can significantly improve its assemblies when the original data are augmented with long reads.MaSuRCA is available as open-source code at ftp://ftp.genome.umd.edu/pub/MaSuRCA/. Previous (pre-publication) releases have been publicly available for over a firstname.lastname@example.org.Supplementary data are available at Bioinformatics online.
Project description:Memory consumption of de Bruijn graphs is often prohibitive. Most de Bruijn graph-based assemblers reduce the complexity by compacting paths into single vertices, but this is challenging as it requires the uncompacted de Bruijn graph to be available in memory. We present a parallel and memory-efficient algorithm enabling the direct construction of the compacted de Bruijn graph without producing the intermediate uncompacted graph. Bifrost features a broad range of functions, such as indexing, editing, and querying the graph, and includes a graph coloring method that maps each k-mer of the graph to the genomes it occurs in.Availability https://github.com/pmelsted/bifrost.
Project description:The "Graphical Fragment Assembly" (GFA) is an emerging format for the representation of sequence assembly graphs, which can be adopted by both de Bruijn graph- and string graph-based assemblers. Here we present RGFA, an implementation of the proposed GFA specification in Ruby. It allows the user to conveniently parse, edit and write GFA files. Complex operations such as the separation of the implicit instances of repeats and the merging of linear paths can be performed. A typical application of RGFA is the editing of a graph, to finish the assembly of a sequence, using information not available to the assembler. We illustrate a use case, in which the assembly of a repetitive metagenomic fosmid insert was completed using a script based on RGFA. Furthermore, we show how the API provided by RGFA can be employed to design complex graph editing algorithms. As an example, we developed a detection algorithm for CRISPRs in a de Bruijn graph. Finally, RGFA can be used for comparing assembly graphs, e.g., to document the changes in a graph after applying a GUI editor. A program, GFAdiff is provided, which compares the information in two graphs, and generate a report or a Ruby script documenting the transformation steps between the graphs.