Unknown

Dataset Information

0

CRAFT: Compact genome Representation toward large-scale Alignment-Free daTabase.


ABSTRACT:

Motivation

Rapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.

Results

We report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102-104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.

Availability and implementation

CRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/CRAFT.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Lu YY 

PROVIDER: S-EPMC9431648 | biostudies-literature | 2021 Apr

REPOSITORIES: biostudies-literature

altmetric image

Publications

CRAFT: Compact genome Representation toward large-scale Alignment-Free daTabase.

Lu Yang Young YY   Bai Jiaxing J   Wang Yiwen Y   Wang Ying Y   Sun Fengzhu F  

Bioinformatics (Oxford, England) 20210401 2


<h4>Motivation</h4>Rapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.<h4>Results</h4>We report CRAFT, a general genomic/metagenomic search  ...[more]

Similar Datasets

| S-EPMC2770639 | biostudies-literature
2012-08-29 | GSE40429 | GEO
2012-08-29 | E-GEOD-40429 | biostudies-arrayexpress
| S-EPMC3491420 | biostudies-literature
| S-EPMC4015147 | biostudies-literature
| S-EPMC4271471 | biostudies-literature
| S-EPMC8794728 | biostudies-literature
| S-EPMC7355300 | biostudies-literature
2005-06-01 | GSE2409 | GEO
| S-EPMC7056612 | biostudies-literature