Unknown

Dataset Information

0

An improved framework for detecting discrete epidemiologically meaningful partitions in hierarchically clustered genetic data.


ABSTRACT:

Motivation

Hierarchical clustering of microbial genotypes has the limitation that hierarchical clusters are nested, where smaller groups of related isolates exist within larger groups that get progressively larger as relationships become increasingly distant. In an epidemiologic context, investigators must dissect hierarchical trees into discrete groupings that are epidemiologically meaningful. We recently described a statistical framework (Method A) for dissecting hierarchical trees that attempts to minimize investigator bias. Here, we apply a modified version of that framework (Method B) to a hierarchical tree constructed from 2111 genotypes of the foodborne parasite Cyclospora, including 639 genotypes linked to epidemiologically defined outbreaks. To evaluate Method B's performance, we examined the concordance between these epidemiologically defined groupings and the genetic partitions identified. We also used the same epidemiologic clusters to evaluate the performance of Method A, plus two tree-dissection methods (cutreeHybrid and cutreeDynamic) available within the Dynamic Tree Cut R package, in addition to the TreeCluster method and PARNAS.

Results

Compared to the other methods, Method B, TreeCluster, and PARNAS were the most accurate (99.4%) in identifying genetic groups that reflected the epidemiologic groupings, noting that TreeCluster and PARNAS performed identically on our dataset. CutreeHybrid identified groups reflecting patterns in the wider Cyclospora population structure but lacked finer, strain-level discrimination (Simpson's D: cutreeHybrid=0.785). CutreeDynamic displayed good strain discrimination (Simpson's D = 0.933), though lacked sensitivity (77%). At two different threshold/radius settings TreeCluster/PARNAS displayed similar utility to Method B. However, Method B computes a tree-dissection threshold automatically, and the threshold/radius settings used when executing TreeCluster/PARNAS here were computed using Method B. Using a TreeCluster threshold of 0.045 as recommended in the TreeCluster documentation, epidemiologic utility dropped markedly below that of Method B.

Availability and implementation

Relevant code and data are publicly available. Source code (Method B) and instructions for its use are available here: https://github.com/Joel-Barratt/Hierarchical-tree-dissection-framework.

SUBMITTER: Jacobson DK 

PROVIDER: S-EPMC10517639 | biostudies-literature | 2023

REPOSITORIES: biostudies-literature

altmetric image

Publications

An improved framework for detecting discrete epidemiologically meaningful partitions in hierarchically clustered genetic data.

Jacobson David K DK   Low Ross R   Plucinski Mateusz M MM   Barratt Joel L N JLN  

Bioinformatics advances 20230901 1


<h4>Motivation</h4>Hierarchical clustering of microbial genotypes has the limitation that hierarchical clusters are nested, where smaller groups of related isolates exist within larger groups that get progressively larger as relationships become increasingly distant. In an epidemiologic context, investigators must dissect hierarchical trees into discrete groupings that are epidemiologically meaningful. We recently described a statistical framework (Method A) for dissecting hierarchical trees tha  ...[more]

Similar Datasets

| S-EPMC10165878 | biostudies-literature
| S-EPMC10695330 | biostudies-literature
| S-EPMC2975584 | biostudies-literature
| S-EPMC88014 | biostudies-literature
| S-EPMC4701012 | biostudies-literature
| S-EPMC10248541 | biostudies-literature
| S-EPMC1636450 | biostudies-literature
| S-EPMC9534885 | biostudies-literature
| S-EPMC4587485 | biostudies-other
| S-EPMC5869657 | biostudies-literature