Unknown

Dataset Information

0

Quantifying the uncertainty of assembly-free genome-wide distance estimates and phylogenetic relationships using subsampling.


ABSTRACT: Computing distance between two genomes without alignments or even access to assemblies has many downstream analyses. However, alignment-free methods, including in the fast-growing field of genome skimming, are hampered by a significant methodological gap. While accurate methods (many k-mer-based) for assembly-free distance calculation exist, measuring the uncertainty of estimated distances has not been sufficiently studied. In this paper, we show that bootstrapping, the standard non-parametric method of measuring estimator uncertainty, is not accurate for k-mer-based methods that rely on k-mer frequency profiles. Instead, we propose using subsampling (with no replacement) in combination with a correction step to reduce the variance of the inferred distribution. We show that the distribution of distances using our procedure matches the true uncertainty of the estimator. The resulting phylogenetic support values effectively differentiate between correct and incorrect branches and identify controversial branches that change across alignment-free and alignment-based phylogenies reported in the literature.

SUBMITTER: Rachtman E 

PROVIDER: S-EPMC9589918 | biostudies-literature | 2022 Oct

REPOSITORIES: biostudies-literature

altmetric image

Publications

Quantifying the uncertainty of assembly-free genome-wide distance estimates and phylogenetic relationships using subsampling.

Rachtman Eleonora E   Sarmashghi Shahab S   Bafna Vineet V   Mirarab Siavash S  

Cell systems 20221001 10


Computing distance between two genomes without alignments or even access to assemblies has many downstream analyses. However, alignment-free methods, including in the fast-growing field of genome skimming, are hampered by a significant methodological gap. While accurate methods (many k-mer-based) for assembly-free distance calculation exist, measuring the uncertainty of estimated distances has not been sufficiently studied. In this paper, we show that bootstrapping, the standard non-parametric m  ...[more]

Similar Datasets

| S-EPMC9383262 | biostudies-literature
| S-EPMC6582606 | biostudies-literature
| S-EPMC4382342 | biostudies-literature
| S-EPMC4521294 | biostudies-literature
| S-EPMC2804295 | biostudies-literature
2017-01-24 | GSE93879 | GEO
| S-EPMC6961477 | biostudies-literature
| S-EPMC6741136 | biostudies-literature
| S-EPMC2048840 | biostudies-literature
| PRJEB56667 | ENA