Unknown

Dataset Information

0

Kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections.


ABSTRACT:

Summary

When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers which approximates the desired set of all the non-erroneous k-mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Erroneous but abundant k-mers are wrongly included, and non-erroneous but low-abundant ones are wrongly discarded. We propose kmtricks, a novel approach for generating Bloom filters from terabase-sized collections of sequencing data. Our main contributions are (i) an efficient method for jointly counting k-mers across multiple samples, including a streamlined Bloom filter construction by directly counting, partitioning and sorting hashes instead of k-mers, which is approximately four times faster than state-of-the-art tools; (ii) a novel technique that takes advantage of joint counting to preserve low-abundant k-mers present in several samples, improving the recovery of non-erroneous k-mers. Our experiments highlight that this technique preserves around 8× more k-mers than the usual yet crude filtering of low-abundance k-mers in a large metagenomics dataset.

Availability and implementation

https://github.com/tlemane/kmtricks.

Supplementary information

Supplementary data are available at Bioinformatics Advances online.

SUBMITTER: Lemane T 

PROVIDER: S-EPMC9710589 | biostudies-literature | 2022

REPOSITORIES: biostudies-literature

altmetric image

Publications

kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections.

Lemane Téo T   Medvedev Paul P   Chikhi Rayan R   Peterlongo Pierre P  

Bioinformatics advances 20220429 1


<h4>Summary</h4>When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of <i>k</i>-mers which approximates the desired set of all the non-erroneous <i>k</i>-mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Err  ...[more]

Similar Datasets

| S-EPMC5467106 | biostudies-literature
| S-EPMC7552494 | biostudies-literature
| S-EPMC7849385 | biostudies-literature
| S-EPMC4816029 | biostudies-literature
| S-EPMC10080837 | biostudies-literature
| S-EPMC2887045 | biostudies-literature
| S-EPMC7992843 | biostudies-literature
| S-EPMC4657956 | biostudies-literature
| S-EPMC10318387 | biostudies-literature
| S-EPMC5411771 | biostudies-literature