Dataset Information

How to optimally sample a sequence for rapid analysis.

ABSTRACT:

Motivation

We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal.

Results

We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible.

Availability and implementation

Source code is freely available at https://gitlab.com/mcfrith/noverlap.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Frith MC

PROVIDER: S-EPMC9907223 | biostudies-literature | 2023 Feb

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

How to optimally sample a sequence for rapid analysis.

Frith Martin C MC Shaw Jim J Spouge John L JL

Bioinformatics (Oxford, England) 20230201 2

<h4>Motivation</h4>We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal.<h4>Results</h4>We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence com ...[more]

PMID: 36702468

Dataset Information

How to optimally sample a sequence for rapid analysis.

Motivation

Results

Availability and implementation

Supplementary information

Publications

How to optimally sample a sequence for rapid analysis.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Cognitive ageing is premature among a community sample of optimally treated people living with HIV.
| S-EPMC7984032 | biostudies-literature

Rapid direct sequence analysis of the dystrophin gene.
| S-EPMC1180355 | biostudies-literature

Power analysis and sample size estimation for sequence-based association studies.
| S-EPMC4133582 | biostudies-literature

CRISPResso2 provides accurate and rapid genome editing sequence analysis.
| S-EPMC6533916 | biostudies-literature

Sample Sequence Analysis Uncovers Recurrent Horizontal Transfers of Transposable Elements among Grasses.
| S-EPMC8382918 | biostudies-literature

A comprehensive performance analysis of sequence-based within-sample testing NIPT methods.
| S-EPMC10104307 | biostudies-literature

A hepatitis B virus (HBV) sequence variation graph improves sequence alignment and sample-specific consensus sequence construction for genetic analysis of HBV.
| S-EPMC9882026 | biostudies-literature

Sample-to-analysis platform for rapid intracellular mass spectrometry from small numbers of cells.
| S-EPMC8721559 | biostudies-literature