Genomics

Dataset Information

0

A systematic evaluation of pattern discovery algorithms


ABSTRACT: Pattern discovery algorithms are methods for discovering recurrent, non-random motifs widely used in the analysis of biological sequences. Many algorithms exist but few comparisons have been made amongst them. We systematically profile eight representative methods at multiple parameter settings across 174 diverse experimental datasets, including ten novel ChIP-on-chip datasets. We executed 16,777 pattern discovery analyses to assess prediction accuracy, CPU usage and memory consumption. For 144 datasets we developed a gold-standard using machine-learning algorithms; cross-validation was used for the remaining datasets. Performance was highly disparate, with median accuracy ranging from 32% to 96%. Importantly we were unable to replicate previously reported algorithm-rankings, emphasizing the need to use many and diverse experimental datasets. We found deterministic algorithms like Projection and Oligo/Dyad had the highest prediction accuracy. Computational efficiency was not linearly related to dataset size and becomes critical: some algorithms are intractably slow on large datasets. This work provides the first combined assessment of the CPU, memory, and prediction accuracies of pattern discovery algorithms on real experimental datasets.

ORGANISM(S): Homo sapiens

PROVIDER: GSE15370 | GEO | 2009/11/24

SECONDARY ACCESSION(S): PRJNA117101

REPOSITORIES: GEO

Similar Datasets

2010-05-19 | E-GEOD-15370 | biostudies-arrayexpress
| PRJNA117101 | ENA
2022-01-20 | MTBLS587 | MetaboLights
2014-05-15 | E-GEOD-41726 | biostudies-arrayexpress
2022-08-14 | GSE184943 | GEO
2014-05-15 | GSE41726 | GEO
2022-10-01 | GSE200096 | GEO
2016-01-12 | PXD003317 | Pride
2021-08-27 | GSE149438 | GEO
2017-02-21 | GSE93315 | GEO