Dataset Information

Parallel clustering algorithm for large-scale biological data sets.

ABSTRACT: BACKGROUNDS: Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs. METHODS: Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes. RESULT: A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies.

SUBMITTER: Wang M

PROVIDER: S-EPMC3976248 | biostudies-literature | 2014

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Parallel clustering algorithm for large-scale biological data sets.

Wang Minchao M Zhang Wu W Ding Wang W Dai Dongbo D Zhang Huiran H Xie Hao H Chen Luonan L Guo Yike Y Xie Jiang J

PloS one 20140404 4

<h4>Backgrounds</h4>Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets ...[more]

PMID: 24705246

Dataset Information

Parallel clustering algorithm for large-scale biological data sets.

Publications

Parallel clustering algorithm for large-scale biological data sets.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets.
| S-EPMC10164572 | biostudies-literature

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks.
| S-EPMC5888241 | biostudies-literature

Algorithm for large-scale clustering across multiple genomes.
| S-EPMC3218420 | biostudies-literature

SPICi: a fast clustering algorithm for large biological networks.
| S-EPMC2853685 | biostudies-literature

Repulsive parallel MCMC algorithm for discovering diverse motifs from large sequence sets.
| S-EPMC4426842 | biostudies-literature

MCbiclust: a novel algorithm to discover large-scale functionally related gene sets from massive transcriptomics data collections.
| S-EPMC5587796 | biostudies-literature

V3D enables real-time 3D visualization and quantitative analysis of large-scale biological image data sets.
| S-EPMC2857929 | biostudies-literature

A parallel computational framework for ultra-large-scale sequence clustering analysis.
| S-EPMC6931356 | biostudies-literature

CLAG: an unsupervised non hierarchical clustering algorithm handling biological data.
| S-EPMC3519615 | biostudies-literature

Large-scale clustering of CAGE tag expression data.
| S-EPMC1890301 | biostudies-literature