Unknown

Dataset Information

0

Optimal compressed representation of high throughput sequence data via light assembly.


ABSTRACT: The most effective genomic data compression methods either assemble reads into contigs, or replace them with their alignment positions on a reference genome. Such methods require significant computational resources, but faster alternatives that avoid using explicit or de novo-constructed references fail to match their performance. Here, we introduce a new reference-free compressed representation for genomic data based on light de novo assembly of reads, where each read is represented as a node in a (compact) trie. We show how to efficiently build such tries to compactly represent reads and demonstrate that among all methods using this representation (including all de novo assembly based methods), our method achieves the shortest possible output. We also provide an lower bound on the compression rate achievable on uniformly sampled genomic read data, which is approximated by our method well. Our method significantly improves the compression performance of alternatives without compromising speed.

SUBMITTER: Ginart AA 

PROVIDER: S-EPMC5805770 | biostudies-literature | 2018 Feb

REPOSITORIES: biostudies-literature

altmetric image

Publications

Optimal compressed representation of high throughput sequence data via light assembly.

Ginart Antonio A AA   Hui Joseph J   Zhu Kaiyuan K   Numanagić Ibrahim I   Courtade Thomas A TA   Sahinalp S Cenk SC   Tse David N DN  

Nature communications 20180208 1


The most effective genomic data compression methods either assemble reads into contigs, or replace them with their alignment positions on a reference genome. Such methods require significant computational resources, but faster alternatives that avoid using explicit or de novo-constructed references fail to match their performance. Here, we introduce a new reference-free compressed representation for genomic data based on light de novo assembly of reads, where each read is represented as a node i  ...[more]

Similar Datasets

| S-EPMC3908319 | biostudies-literature
| S-EPMC3706340 | biostudies-literature
| S-EPMC3728768 | biostudies-literature
| S-EPMC2911117 | biostudies-literature
| S-EPMC8758712 | biostudies-literature
| S-EPMC3290790 | biostudies-literature
| S-EPMC7671303 | biostudies-literature
| S-EPMC3295828 | biostudies-literature
| S-EPMC3494082 | biostudies-literature