Unknown

Dataset Information

0

SAKE: Strobemer-assisted k-mer extraction.


ABSTRACT: K-mer-based analysis plays an important role in many bioinformatics applications, such as de novo assembly, sequencing error correction, and genotyping. To take full advantage of such methods, the k-mer content of a read set must be captured as accurately as possible. Often the use of long k-mers is preferred because they can be uniquely associated with a specific genomic region. Unfortunately, it is not possible to reliably extract long k-mers in high error rate reads with standard exact k-mer counting methods. We propose SAKE, a method to extract long k-mers from high error rate reads by utilizing strobemers and consensus k-mer generation through partial order alignment. Our experiments show that on simulated data with up to 6% error rate, SAKE can extract 97-mers with over 90% recall. Conversely, the recall of DSK, an exact k-mer counter, drops to less than 20%. Furthermore, the precision of SAKE remains similar to DSK. On real bacterial data, SAKE retrieves 97-mers with a recall of over 90% and slightly lower precision than DSK, while the recall of DSK already drops to 50%. We show that SAKE can extract more k-mers from uncorrected high error rate reads compared to exact k-mer counting. However, exact k-mer counters run on corrected reads can extract slightly more k-mers than SAKE run on uncorrected reads.

SUBMITTER: Leinonen M 

PROVIDER: S-EPMC10686461 | biostudies-literature | 2023

REPOSITORIES: biostudies-literature

altmetric image

Publications

SAKE: Strobemer-assisted k-mer extraction.

Leinonen Miika M   Salmela Leena L  

PloS one 20231129 11


K-mer-based analysis plays an important role in many bioinformatics applications, such as de novo assembly, sequencing error correction, and genotyping. To take full advantage of such methods, the k-mer content of a read set must be captured as accurately as possible. Often the use of long k-mers is preferred because they can be uniquely associated with a specific genomic region. Unfortunately, it is not possible to reliably extract long k-mers in high error rate reads with standard exact k-mer  ...[more]

Similar Datasets

| S-EPMC9346074 | biostudies-literature
| S-EPMC8401966 | biostudies-literature
| S-EPMC10014295 | biostudies-literature
| S-EPMC10486403 | biostudies-literature
| S-EPMC6080726 | biostudies-literature
| S-EPMC8949413 | biostudies-literature
| S-EPMC5412662 | biostudies-literature
| S-EPMC9502848 | biostudies-literature
| S-EPMC7727962 | biostudies-literature
| S-EPMC6826910 | biostudies-literature