Dataset Information

MicroTaboo: a general and practical solution to the k-disjoint problem.

ABSTRACT: A common challenge in bioinformatics is to identify short sub-sequences that are unique in a set of genomes or reference sequences, which can efficiently be achieved by k-mer (k consecutive nucleotides) counting. However, there are several areas that would benefit from a more stringent definition of "unique", requiring that these sub-sequences of length W differ by more than k mismatches (i.e. a Hamming distance greater than k) from any other sub-sequence, which we term the k-disjoint problem. Examples include finding sequences unique to a pathogen for probe-based infection diagnostics; reducing off-target hits for re-sequencing or genome editing; detecting sequence (e.g. phage or viral) insertions; and multiple substitution mutations. Since both sensitivity and specificity are critical, an exhaustive, yet efficient solution is desirable.We present microTaboo, a method that allows for efficient and extensive sequence mining of unique (k-disjoint) sequences of up to 100 nucleotides in length. On a number of simulated and real data sets ranging from microbe- to mammalian-size genomes, we show that microTaboo is able to efficiently find all sub-sequences of a specified length W that do not occur within a threshold of k mismatches in any other sub-sequence. We exemplify that microTaboo has many practical applications, including point substitution detection, sequence insertion detection, padlock probe target search, and candidate CRISPR target mining.microTaboo implements a solution to the k-disjoint problem in an alignment- and assembly free manner. microTaboo is available for Windows, Mac OS X, and Linux, running Java 7 and higher, under the GNU GPLv3 license, at: https://MohammedAlJaff.github.io/microTaboo.

SUBMITTER: Al-Jaff M

PROVIDER: S-EPMC5414201 | biostudies-other | 2017 May

REPOSITORIES: biostudies-other

ACCESS DATA

Publications

microTaboo: a general and practical solution to the k-disjoint problem.

Al-Jaff Mohammed M Sandström Eric E Grabherr Manfred M

BMC bioinformatics 20170502 1

<h4>Background</h4>A common challenge in bioinformatics is to identify short sub-sequences that are unique in a set of genomes or reference sequences, which can efficiently be achieved by k-mer (k consecutive nucleotides) counting. However, there are several areas that would benefit from a more stringent definition of "unique", requiring that these sub-sequences of length W differ by more than k mismatches (i.e. a Hamming distance greater than k) from any other sub-sequence, which we term the k- ...[more]

PMID: 28464826

Dataset Information

MicroTaboo: a general and practical solution to the k-disjoint problem.

Publications

microTaboo: a general and practical solution to the k-disjoint problem.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Similar Datasets

A general solution for the 2-pyridyl problem.
| S-EPMC3433254 | biostudies-literature

General solution for rapid scan EPR deconvolution problem.
| S-EPMC7575242 | biostudies-literature

Promising general solution to the problem of ligating peptides and glycopeptides.
| S-EPMC3020898 | biostudies-literature

A practical solution for preserving single cells for RNA sequencing
2017-05-11 | GSE98734 | GEO

Practical structure solution with ARCIMBOLDO.
| S-EPMC3322593 | biostudies-literature

Preemption versus Entrenchment: Towards a Construction-General Solution to the Problem of the Retreat from Verb Argument Structure Overgeneralization.
| S-EPMC4412412 | biostudies-literature

Genetic algorithm solution for double digest problem.
| S-EPMC3374354 | biostudies-literature

Solution of the Kirchhoff-Plateau Problem.
| S-EPMC5479363 | biostudies-other

Hemodialysis in children: general practical guidelines.
| S-EPMC1766474 | biostudies-other

A cellular solution to an information-processing problem.
| S-EPMC4151762 | biostudies-literature