Unknown

Dataset Information

0

How to optimally sample a sequence for rapid analysis.


ABSTRACT:

Motivation

We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal.

Results

We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible.

Availability and implementation

Source code is freely available at https://gitlab.com/mcfrith/noverlap.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Frith MC 

PROVIDER: S-EPMC9907223 | biostudies-literature | 2023 Feb

REPOSITORIES: biostudies-literature

altmetric image

Publications

How to optimally sample a sequence for rapid analysis.

Frith Martin C MC   Shaw Jim J   Spouge John L JL  

Bioinformatics (Oxford, England) 20230201 2


<h4>Motivation</h4>We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal.<h4>Results</h4>We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence com  ...[more]

Similar Datasets

| S-EPMC7984032 | biostudies-literature
| S-EPMC1180355 | biostudies-literature
| PRJNA725596 | ENA
| S-EPMC4133582 | biostudies-literature
| S-EPMC6533916 | biostudies-literature
| S-EPMC3371582 | biostudies-literature
| S-EPMC8382918 | biostudies-literature
| S-EPMC10104307 | biostudies-literature
| S-EPMC9882026 | biostudies-literature
| S-EPMC8721559 | biostudies-literature