Unknown

Dataset Information

0

MBGC: Multiple Bacteria Genome Compressor.


ABSTRACT: Genomes within the same species reveal large similarity, exploited by specialized multiple genome compressors. The existing algorithms and tools are however targeted at large, e.g., mammalian, genomes, and their performance on bacteria strains is rather moderate. In this work, we propose MBGC, a specialized genome compressor making use of specific redundancy of bacterial genomes. Its characteristic features are finding both direct and reverse-complemented LZ-matches, as well as a careful management of a reference buffer in a multi-threaded implementation. Our tool is not only compression efficient but also fast. On a collection of 168,311 bacterial genomes, totalling 587 GB, we achieve a compression ratio of approximately a factor of 1,265 and compression (respectively decompression) speed of ∼1,580 MB/s (respectively 780 MB/s) using 8 hardware threads, on a computer with a 14-core/28-thread CPU and a fast SSD, being almost 3 times more succinct and >6 times faster in the compression than the next best competitor.

SUBMITTER: Grabowski S 

PROVIDER: S-EPMC8848312 | biostudies-literature | 2022 Jan

REPOSITORIES: biostudies-literature

altmetric image

Publications

MBGC: Multiple Bacteria Genome Compressor.

Grabowski Szymon S   Kowalski Tomasz M TM  

GigaScience 20220101


<h4>Background</h4>Genomes within the same species reveal large similarity, exploited by specialized multiple genome compressors. The existing algorithms and tools are however targeted at large, e.g., mammalian, genomes, and their performance on bacteria strains is rather moderate.<h4>Results</h4>In this work, we propose MBGC, a specialized genome compressor making use of specific redundancy of bacterial genomes. Its characteristic features are finding both direct and reverse-complemented LZ-mat  ...[more]

Similar Datasets

| S-EPMC5399254 | biostudies-literature
| S-EPMC5054319 | biostudies-literature
| S-EPMC8388020 | biostudies-literature
| S-EPMC6662292 | biostudies-literature
| S-EPMC7592040 | biostudies-literature
| S-EPMC7660259 | biostudies-literature
| PRJEB42478 | ENA
| S-EPMC9308261 | biostudies-literature
| S-EPMC21091 | biostudies-other
| PRJNA605091 | ENA