Unknown

Dataset Information

0

Efficient minimizer orders for large values of k using minimum decycling sets.


ABSTRACT: Minimizers are ubiquitously used in data structures and algorithms for efficient searching, mapping, and indexing of high-throughput DNA sequencing data. Minimizer schemes select a minimum k-mer in every L-long subsequence of the target sequence, where minimality is with respect to a predefined k-mer order. Commonly used minimizer orders select more k-mers than necessary and therefore provide limited improvement in runtime and memory usage of downstream analysis tasks. The recently introduced universal k-mer hitting sets produce minimizer orders with fewer selected k-mers. Generating compact universal k-mer hitting sets is currently infeasible for k > 13, and thus, they cannot help in the many applications that require minimizer orders for larger k Here, we close the gap of efficient minimizer orders for large values of k by introducing decycling-set-based minimizer orders: new minimizer orders based on minimum decycling sets. We show that in practice these new minimizer orders select a number of k-mers comparable to that of minimizer orders based on universal k-mer hitting sets and can also scale to a larger k Furthermore, we developed a method that computes the minimizers in a sequence on the fly without keeping the k-mers of a decycling set in memory. This enables the use of these minimizer orders for any value of k We expect the new orders to improve the runtime and memory usage of algorithms and data structures in high-throughput DNA sequencing analysis.

SUBMITTER: Pellow D 

PROVIDER: S-EPMC10538483 | biostudies-literature | 2023 Jul

REPOSITORIES: biostudies-literature

altmetric image

Publications

Efficient minimizer orders for large values of <i>k</i> using minimum decycling sets.

Pellow David D   Pu Lianrong L   Ekim Bariş B   Kotlar Lior L   Berger Bonnie B   Shamir Ron R   Orenstein Yaron Y  

Genome research 20230701 7


Minimizers are ubiquitously used in data structures and algorithms for efficient searching, mapping, and indexing of high-throughput DNA sequencing data. Minimizer schemes select a minimum <i>k</i>-mer in every <i>L</i>-long subsequence of the target sequence, where minimality is with respect to a predefined <i>k</i>-mer order. Commonly used minimizer orders select more <i>k</i>-mers than necessary and therefore provide limited improvement in runtime and memory usage of downstream analysis tasks  ...[more]

Similar Datasets

| S-EPMC7015965 | biostudies-literature
| S-EPMC10659450 | biostudies-literature
| S-EPMC10541625 | biostudies-literature
| S-EPMC3185442 | biostudies-literature
| S-EPMC4697868 | biostudies-literature
| S-EPMC10538364 | biostudies-literature
| S-EPMC3439324 | biostudies-literature
| S-EPMC3636516 | biostudies-literature
| S-EPMC6096449 | biostudies-literature
| S-EPMC10373352 | biostudies-literature