Dataset Information

Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes.

ABSTRACT:

Motivation

Huge datasets containing whole-genome sequences of bacterial strains are now commonplace and represent a rich and important resource for modern genomic epidemiology and metagenomics. In order to efficiently make use of these datasets, efficient indexing data structures-that are both scalable and provide rapid query throughput-are paramount.

Results

Here, we present Themisto, a scalable colored k-mer index designed for large collections of microbial reference genomes, that works for both short and long read data. Themisto indexes 179 thousand Salmonella enterica genomes in 9 h. The resulting index takes 142 gigabytes. In comparison, the best competing tools Metagraph and Bifrost were only able to index 11 000 genomes in the same time. In pseudoalignment, these other tools were either an order of magnitude slower than Themisto, or used an order of magnitude more memory. Themisto also offers superior pseudoalignment quality, achieving a higher recall than previous methods on Nanopore read sets.

Availability and implementation

Themisto is available and documented as a C++ package at https://github.com/algbio/themisto available under the GPLv2 license.

SUBMITTER: Alanko JN

PROVIDER: S-EPMC10311346 | biostudies-literature | 2023 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes.

Alanko Jarno N JN Vuohtoniemi Jaakko J Mäklin Tommi T Puglisi Simon J SJ

Bioinformatics (Oxford, England) 20230601 39 Suppl 1

<h4>Motivation</h4>Huge datasets containing whole-genome sequences of bacterial strains are now commonplace and represent a rich and important resource for modern genomic epidemiology and metagenomics. In order to efficiently make use of these datasets, efficient indexing data structures-that are both scalable and provide rapid query throughput-are paramount.<h4>Results</h4>Here, we present Themisto, a scalable colored k-mer index designed for large collections of microbial reference genomes, th ...[more]

PMID: 37387143

Dataset Information

Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes.

Motivation

Results

Availability and implementation

Publications

Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Robust and scalable inference of population history from hundreds of unphased whole genomes.
| S-EPMC5470542 | biostudies-literature

Assembly of hundreds of novel bacterial genomes from the chicken caecum.
| S-EPMC7014784 | biostudies-literature

Human contamination in bacterial genomes has created thousands of spurious proteins.
| S-EPMC6581058 | biostudies-literature

From hundreds to thousands: Widening the normal human Urinome.
| S-EPMC4459867 | biostudies-literature

Author Correction: Assembly of hundreds of novel bacterial genomes from the chicken caecum.
| S-EPMC7879605 | biostudies-literature

A scalable analytical approach from bacterial genomes to epidemiology.
| S-EPMC9393561 | biostudies-literature

Thousands of missed genes found in bacterial genomes and their analysis with COMBREX.
| S-EPMC3534567 | biostudies-literature

Genotype imputation with thousands of genomes.
| S-EPMC3276165 | biostudies-literature

Scalable colored Janus fabric scheme for dynamic thermal management.
| S-EPMC11471193 | biostudies-literature

Centromere Landscapes Resolved from Hundreds of Human Genomes.
| S-EPMC11652271 | biostudies-literature