Unknown

Dataset Information

0

Multiple alignment-free sequence comparison.


ABSTRACT:

Motivation

Recently, a range of new statistics have become available for the alignment-free comparison of two sequences based on k-tuple word content. Here, we extend these statistics to the simultaneous comparison of more than two sequences. Our suite of statistics contains, first, C(*)1 and C(S)1, extensions of statistics for pairwise comparison of the joint k-tuple content of all the sequences, and second, C(*)2, C(S)2 and C(geo)2, averages of sums of pairwise comparison statistics. The two tasks we consider are, first, to identify sequences that are similar to a set of target sequences, and, second, to measure the similarity within a set of sequences.

Results

Our investigation uses both simulated data as well as cis-regulatory module data where the task is to identify cis-regulatory modules with similar transcription factor binding sites. We find that although for real data, all of our statistics show a similar performance, on simulated data the Shepp-type statistics are in some instances outperformed by star-type statistics. The multiple alignment-free statistics are more sensitive to contamination in the data than the pairwise average statistics.

Availability

Our implementation of the five statistics is available as R package named 'multiAlignFree' at be http://www-rcf.usc.edu/?fsun/Programs/multiAlignFree/multiAlignFreemain.html.

Contact

reinert@stats.ox.ac.uk.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Ren J 

PROVIDER: S-EPMC3799466 | biostudies-literature | 2013 Nov

REPOSITORIES: biostudies-literature

altmetric image

Publications

Multiple alignment-free sequence comparison.

Ren Jie J   Song Kai K   Sun Fengzhu F   Deng Minghua M   Reinert Gesine G  

Bioinformatics (Oxford, England) 20130829 21


<h4>Motivation</h4>Recently, a range of new statistics have become available for the alignment-free comparison of two sequences based on k-tuple word content. Here, we extend these statistics to the simultaneous comparison of more than two sequences. Our suite of statistics contains, first, C(*)1 and C(S)1, extensions of statistics for pairwise comparison of the joint k-tuple content of all the sequences, and second, C(*)2, C(S)2 and C(geo)2, averages of sums of pairwise comparison statistics. T  ...[more]

Similar Datasets

| S-EPMC6659240 | biostudies-literature
| S-EPMC3123933 | biostudies-literature
| S-EPMC5627421 | biostudies-literature
| S-EPMC2818754 | biostudies-literature
| S-EPMC3704055 | biostudies-literature
| S-EPMC4080745 | biostudies-literature
| S-EPMC3581251 | biostudies-literature
| S-EPMC6937637 | biostudies-literature
| S-EPMC3133551 | biostudies-literature
| S-EPMC3375188 | biostudies-literature