Unknown

Dataset Information

0

Minimally-overlapping words for sequence similarity search.


ABSTRACT: Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via "seeds": simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. Here we study a simple sparse-seeding method: using seeds at positions of certain "words" (e.g. ac, at, gc, or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally-overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed "minimizer" sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it. Software to design and test minimally-overlapping words is freely available at https://gitlab.com/mcfrith/noverlap. Supplementary data are available at Bioinformatics online.

SUBMITTER: Frith MC 

PROVIDER: S-EPMC8016470 | biostudies-literature | 2020 Dec

REPOSITORIES: biostudies-literature

altmetric image

Publications

Minimally overlapping words for sequence similarity search.

Frith Martin C MC   Noé Laurent L   Kucherov Gregory G  

Bioinformatics (Oxford, England) 20210401 22-23


<h4>Motivation</h4>Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via 'seeds': simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence.<h4>Results</h4>Here, we study a simple sparse-seeding method: using seeds at positions of certain 'words' (e.g. ac, at, gc or gt). Sensi  ...[more]

Similar Datasets

| S-EPMC4699916 | biostudies-literature
| S-EPMC3213098 | biostudies-literature
| S-EPMC5666806 | biostudies-literature
| S-EPMC2587480 | biostudies-literature
| S-EPMC4460465 | biostudies-literature
| S-EPMC8570820 | biostudies-literature
| S-EPMC5274646 | biostudies-literature
| S-EPMC1421445 | biostudies-literature
| S-EPMC2194796 | biostudies-literature
| S-EPMC2796334 | biostudies-literature