Unknown

Dataset Information

0

Catwalk: identifying closely related sequences in large microbial sequence databases.


ABSTRACT: There is a need to identify microbial sequences that may form part of transmission chains, or that may represent importations across national boundaries, amidst large numbers of SARS-CoV-2 and other bacterial or viral sequences. Reference-based compression is a sequence analysis technique that allows both a compact storage of sequence data and comparisons between sequences. Published implementations of the approach are being challenged by the large sample collections now being generated. Our aim was to develop a fast software detecting highly similar sequences in large collections of microbial genomes, including millions of SARS-CoV-2 genomes. To do so, we developed Catwalk, a tool that bypasses bottlenecks in the generation, comparison and in-memory storage of microbial genomes generated by reference mapping. It is a compiled solution, coded in Nim to increase performance. It can be accessed via command line, rest api or web server interfaces. We tested Catwalk using both SARS-CoV-2 and Mycobacterium tuberculosis genomes generated by prospective public-health sequencing programmes. Pairwise sequence comparisons, using clinically relevant similarity cut-offs, took about 0.39 and 0.66 μs, respectively; in 1 s, between 1 and 2 million sequences can be searched. Catwalk operates about 1700 times faster than, and uses about 8 % of the RAM of, a Python reference-based compression and comparison tool in current use for outbreak detection. Catwalk can rapidly identify close relatives of a SARS-CoV-2 or M. tuberculosis genome amidst millions of samples.

SUBMITTER: Volk D 

PROVIDER: S-EPMC9455716 | biostudies-literature | 2022 Jun

REPOSITORIES: biostudies-literature

altmetric image

Publications

Catwalk: identifying closely related sequences in large microbial sequence databases.

Volk Denis D   Yang-Turner Fan F   Didelot Xavier X   Crook Derrick W DW   Wyllie David D  

Microbial genomics 20220601 6


There is a need to identify microbial sequences that may form part of transmission chains, or that may represent importations across national boundaries, amidst large numbers of SARS-CoV-2 and other bacterial or viral sequences. Reference-based compression is a sequence analysis technique that allows both a compact storage of sequence data and comparisons between sequences. Published implementations of the approach are being challenged by the large sample collections now being generated. Our aim  ...[more]

Similar Datasets

| S-EPMC5469481 | biostudies-literature
| S-EPMC549560 | biostudies-literature
| S-EPMC2718670 | biostudies-literature
| S-EPMC419798 | biostudies-literature
| S-EPMC2719781 | biostudies-literature
| S-EPMC5943621 | biostudies-literature
| S-EPMC7821992 | biostudies-literature
| S-EPMC9202424 | biostudies-literature
| S-EPMC3843501 | biostudies-literature
| S-EPMC1762357 | biostudies-literature