Unknown

Dataset Information

0

Clustering biological sequences with dynamic sequence similarity threshold.


ABSTRACT:

Background

Biological sequence clustering is a complicated data clustering problem owing to the high computation costs incurred for pairwise sequence distance calculations through sequence alignments, as well as difficulties in determining parameters for deriving robust clusters. While current approaches are successful in reducing the number of sequence alignments performed, the generated clusters are based on a single sequence identity threshold applied to every cluster. Poor choices of this identity threshold would thus lead to low quality clusters. There is however little support provided to users in selecting thresholds that are well matched with the input sequences.

Results

We present a novel sequence clustering approach called ALFATClust that exploits rapid pairwise alignment-free sequence distance calculations and community detection in graph for clusters generation. Instead of a single threshold applied to every generated cluster, ALFATClust is capable of dynamically determining the cut-off threshold for each individual cluster by considering both cluster separation and intra-cluster sequence similarity. Benchmarking analysis shows that ALFATClust generally outperforms existing approaches by simultaneously maintaining cluster robustness and substantial cluster separation for the benchmark datasets. The software also provides an evaluation report for verifying the quality of the non-singleton clusters obtained.

Conclusions

ALFATClust is able to generate sequence clusters having high intra-cluster sequence similarity and substantial separation between clusters without having users to decide precise similarity cut-off thresholds.

SUBMITTER: Chiu JKH 

PROVIDER: S-EPMC8969259 | biostudies-literature | 2022 Mar

REPOSITORIES: biostudies-literature

altmetric image

Publications

Clustering biological sequences with dynamic sequence similarity threshold.

Chiu Jimmy Ka Ho JKH   Ong Rick Twee-Hee RT  

BMC bioinformatics 20220330 1


<h4>Background</h4>Biological sequence clustering is a complicated data clustering problem owing to the high computation costs incurred for pairwise sequence distance calculations through sequence alignments, as well as difficulties in determining parameters for deriving robust clusters. While current approaches are successful in reducing the number of sequence alignments performed, the generated clusters are based on a single sequence identity threshold applied to every cluster. Poor choices of  ...[more]

Similar Datasets

| S-EPMC1261163 | biostudies-literature
| S-EPMC3110597 | biostudies-literature
| S-EPMC6554434 | biostudies-literature
| S-EPMC6705769 | biostudies-literature
| S-EPMC1976428 | biostudies-literature
| S-EPMC5524321 | biostudies-literature
| S-EPMC10313010 | biostudies-literature
| S-EPMC6403383 | biostudies-literature
| S-EPMC11001989 | biostudies-literature
| S-EPMC2901495 | biostudies-literature