Dataset Information

HaVec: An Efficient de Bruijn Graph Construction Algorithm for Genome Assembly.

ABSTRACT:

Background

The rapid advancement of sequencing technologies has made it possible to regularly produce millions of high-quality reads from the DNA samples in the sequencing laboratories. To this end, the de Bruijn graph is a popular data structure in the genome assembly literature for efficient representation and processing of data. Due to the number of nodes in a de Bruijn graph, the main barrier here is the memory and runtime. Therefore, this area has received significant attention in contemporary literature.

Results

In this paper, we present an approach called HaVec that attempts to achieve a balance between the memory consumption and the running time. HaVec uses a hash table along with an auxiliary vector data structure to store the de Bruijn graph thereby improving the total memory usage and the running time. A critical and noteworthy feature of HaVec is that it exhibits no false positive error.

Conclusions

In general, the graph construction procedure takes the major share of the time involved in an assembly process. HaVec can be seen as a significant advancement in this aspect. We anticipate that HaVec will be extremely useful in the de Bruijn graph-based genome assembly.

SUBMITTER: Rahman MM

PROVIDER: S-EPMC5591975 | biostudies-literature | 2017

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

HaVec: An Efficient de Bruijn Graph Construction Algorithm for Genome Assembly.

Rahman Md Mahfuzer MM Sharker Ratul R Biswas Sajib S Rahman M Sohel MS

International journal of genomics 20170827

<h4>Background</h4>The rapid advancement of sequencing technologies has made it possible to regularly produce millions of high-quality reads from the DNA samples in the sequencing laboratories. To this end, the <i>de Bruijn graph</i> is a popular data structure in the genome assembly literature for efficient representation and processing of data. Due to the number of nodes in a de Bruijn graph, the main barrier here is the memory and runtime. Therefore, this area has received significant attenti ...[more]

PMID: 28929105

Similar Datasets

Project description:MotivationIndexing reference sequences for search-both individual genomes and collections of genomes-is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large.ResultsWe present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences. Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment.Availability and implementationpufferfish is written in C++11, is open source, and is available at https://github.com/COMBINE-lab/pufferfish.Supplementary informationSupplementary data are available at Bioinformatics online.

Dataset Information

HaVec: An Efficient de Bruijn Graph Construction Algorithm for Genome Assembly.

Background

Results

Conclusions

Publications

HaVec: An Efficient de Bruijn Graph Construction Algorithm for Genome Assembly.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets