A program to compute the soft Robinson-Foulds distance between phylogenetic networks.
ABSTRACT: Over the past two decades, phylogenetic networks have been studied to model reticulate evolutionary events. The relationships among phylogenetic networks, phylogenetic trees and clusters serve as the basis for reconstruction and comparison of phylogenetic networks. To understand these relationships, two problems are raised: the tree containment problem, which asks whether a phylogenetic tree is displayed in a phylogenetic network, and the cluster containment problem, which asks whether a cluster is represented at a node in a phylogenetic network. Both the problems are NP-complete.A fast exponential-time algorithm for the cluster containment problem on arbitrary networks is developed and implemented in C. The resulting program is further extended into a computer program for fast computation of the Soft Robinson-Foulds distance between phylogenetic networks.Two computer programs are developed for facilitating reconstruction and validation of phylogenetic network models in evolutionary and comparative genomics. Our simulation tests indicated that they are fast enough for use in practice. Additionally, the distribution of the Soft Robinson-Foulds distance between phylogenetic networks is demonstrated to be unlikely normal by our simulation data.
Project description:When a phylogenetic reconstruction does not result in one tree but in several, tree metrics permit finding out how far the reconstructed trees are from one another. They also permit to assess the accuracy of a reconstruction if a true tree is known. TreeCmp implements eight metrics that can be calculated in polynomial time for arbitrary (not only bifurcating) trees: four for unrooted (Matching Split metric, which we have recently proposed, Robinson-Foulds, Path Difference, Quartet) and four for rooted trees (Matching Cluster, Robinson-Foulds cluster, Nodal Splitted and Triple). TreeCmp is the first implementation of Matching Split/Cluster metrics and the first efficient and convenient implementation of Nodal Splitted. It allows to compare relatively large trees. We provide an example of the application of TreeCmp to compare the accuracy of ten approaches to phylogenetic reconstruction with trees up to 5000 external nodes, using a measure of accuracy based on normalized similarity between trees.
Project description:HIV-1 viruses, which are predominant in the family of HIV viruses, have strong pathogenicity and infectivity. They can evolve into many different variants in a very short time. In this study, we propose a new and effective alignment-free method for the phylogenetic analysis of HIV-1 viruses using complete genome sequences. Our method combines the position distribution information and the counts of the k-mers together. We also propose a metric to determine the optimal k value. We name our method the Position-Weighted k-mers (PWkmer) method. Validation and comparison with the Robinson–Foulds distance method and the modified bootstrap method on a benchmark dataset show that our method is reliable for the phylogenetic analysis of HIV-1 viruses. PWkmer can resolve within-group variations for different known subtypes of Group M of HIV-1 viruses. This method is simple and computationally fast for whole genome phylogenetic analysis.
Project description:The estimation of phylogenetic trees is a major part of many biological dataset analyses, but maximum likelihood approaches are NP-hard and Bayesian MCMC methods do not scale well to even moderate-sized datasets. Supertree methods, which are used to construct trees from trees computed on subsets, are critically important tools for enabling the statistical estimation of phylogenies for large and potentially heterogeneous datasets. Supertree estimation is itself NP-hard, and no current supertree method has sufficient accuracy and scalability to provide good accuracy on the large datasets that supertree methods were designed for, containing thousands of species and many subset trees.We present FastRFS, a new method based on a dynamic programming method we have developed to find an exact solution to the Robinson-Foulds Supertree problem within a constrained search space. FastRFS has excellent accuracy in terms of criterion scores and topological accuracy of the resultant trees, substantially improving on competing methods on a large collection of biological and simulated data. In addition, FastRFS is extremely fast, finishing in minutes on even very large datasets, and in under an hour on a biological dataset with 2228 species.FastRFS is available on github at https://github.com/pranjalv123/FastRFS.firstname.lastname@example.org.Supplementary data are available at Bioinformatics online.
Project description:Supertree methods reconcile a set of phylogenetic trees into a single structure that is often interpreted as a branching history of species. A key challenge is combining conflicting evolutionary histories that are due to artifacts of phylogenetic reconstruction and phenomena such as lateral gene transfer (LGT). Many supertree approaches use optimality criteria that do not reflect underlying processes, have known biases, and may be unduly influenced by LGT. We present the first method to construct supertrees by using the subtree prune-and-regraft (SPR) distance as an optimality criterion. Although calculating the rooted SPR distance between a pair of trees is NP-hard, our new maximum agreement forest-based methods can reconcile trees with hundreds of taxa and>50 transfers in fractions of a second, which enables repeated calculations during the course of an iterative search. Our approach can accommodate trees in which uncertain relationships have been collapsed to multifurcating nodes. Using a series of benchmark datasets simulated under plausible rates of LGT, we show that SPR supertrees are more similar to correct species histories than supertrees based on parsimony or Robinson-Foulds distance criteria. We successfully constructed an SPR supertree from a phylogenomic dataset of 40,631 gene trees that covered 244 genomes representing several major bacterial phyla. Our SPR-based approach also allowed direct inference of highways of gene transfer between bacterial classes and genera. A Small number of these highways connect genera in different phyla and can highlight specific genes implicated in long-distance LGT. [Lateral gene transfer; matrix representation with parsimony; phylogenomics; prokaryotic phylogeny; Robinson-Foulds; subtree prune-and-regraft; supertrees.].
Project description:Ability to quantify dissimilarity of different phylogenetic trees describing the relationship between the same group of taxa is required in various types of phylogenetic studies. For example, such metrics are used to assess the quality of phylogeny construction methods, to define optimization criteria in supertree building algorithms, or to find horizontal gene transfer (HGT) events. Among the set of metrics described so far in the literature, the most commonly used seems to be the Robinson-Foulds distance. In this article, we define a new metric for rooted trees-the Matching Pair (MP) distance. The MP metric uses the concept of the minimum-weight perfect matching in a complete bipartite graph constructed from partitions of all pairs of leaves of the compared phylogenetic trees. We analyze the properties of the MP metric and present computational experiments showing its potential applicability in tasks related to finding the HGT events.
Project description:BACKGROUND: Constructing species trees from multi-copy gene trees remains a challenging problem in phylogenetics. One difficulty is that the underlying genes can be incongruent due to evolutionary processes such as gene duplication and loss, deep coalescence, or lateral gene transfer. Gene tree estimation errors may further exacerbate the difficulties of species tree estimation. RESULTS: We present a new approach for inferring species trees from incongruent multi-copy gene trees that is based on a generalization of the Robinson-Foulds (RF) distance measure to multi-labeled trees (mul-trees). We prove that it is NP-hard to compute the RF distance between two mul-trees; however, it is easy to calculate this distance between a mul-tree and a singly-labeled species tree. Motivated by this, we formulate the RF problem for mul-trees (MulRF) as follows: Given a collection of multi-copy gene trees, find a singly-labeled species tree that minimizes the total RF distance from the input mul-trees. We develop and implement a fast SPR-based heuristic algorithm for the NP-hard MulRF problem.We compare the performance of the MulRF method (available at http://genome.cs.iastate.edu/CBL/MulRF/) with several gene tree parsimony approaches using gene tree simulations that incorporate gene tree error, gene duplications and losses, and/or lateral transfer. The MulRF method produces more accurate species trees than gene tree parsimony approaches. We also demonstrate that the MulRF method infers in minutes a credible plant species tree from a collection of nearly 2,000 gene trees. CONCLUSIONS: Our new phylogenetic inference method, based on a generalized RF distance, makes it possible to quickly estimate species trees from large genomic data sets. Since the MulRF method, unlike gene tree parsimony, is based on a generic tree distance measure, it is appealing for analyses of genomic data sets, in which many processes such as deep coalescence, recombination, gene duplication and losses as well as phylogenetic error may contribute to gene tree discord. In experiments, the MulRF method estimated species trees accurately and quickly, demonstrating MulRF as an efficient alternative approach for phylogenetic inference from large-scale genomic data sets.
Project description:Sequencing is important for understanding the molecular epidemiology and viral evolution of hepatitis C virus (HCV) infection. To date, there is little standardisation among sequencing protocols, in-part due to the high genetic diversity that is observed within HCV. This study aimed to develop a novel, practical sequencing protocol that covered both conserved and variable regions of the viral genome and assess the influence of each subregion, sequence concatenation and unrelated reference sequences on phylogenetic clustering analysis. The Core to the hypervariable region 1 (HVR1) of envelope-2 (E2) and non-structural-5B (NS5B) regions of the HCV genome were amplified and sequenced from participants from the Australian Trial in Acute Hepatitis C (ATAHC), a prospective study of the natural history and treatment of recent HCV infection. Phylogenetic trees were constructed using a general time-reversible substitution model and sensitivity analyses were completed for every subregion. Pairwise distance, genetic distance and bootstrap support were computed to assess the impact of HCV region on clustering results as measured by the identification and percentage of participants falling within all clusters, cluster size, average patristic distance, and bootstrap value. The Robinson-Foulds metrics was also used to compare phylogenetic trees among the different HCV regions. Our results demonstrated that the genomic region of HCV analysed influenced phylogenetic tree topology and clustering results. The HCV Core region alone was not suitable for clustering analysis; NS5B concatenation, the inclusion of reference sequences and removal of HVR1 all influenced clustering outcome. The Core-E2 region, which represented the highest genetic diversity and longest sequence length in this study, provides an ideal method for clustering analysis to address a range of molecular epidemiological questions.
Project description:BACKGROUND: Phylogenetic tree comparison metrics are an important tool in the study of evolution, and hence the definition of such metrics is an interesting problem in phylogenetics. In a paper in Taxon fifty years ago, Sokal and Rohlf proposed to measure quantitatively the difference between a pair of phylogenetic trees by first encoding them by means of their half-matrices of cophenetic values, and then comparing these matrices. This idea has been used several times since then to define dissimilarity measures between phylogenetic trees but, to our knowledge, no proper metric on weighted phylogenetic trees with nested taxa based on this idea has been formally defined and studied yet. Actually, the cophenetic values of pairs of different taxa alone are not enough to single out phylogenetic trees with weighted arcs or nested taxa. RESULTS: For every (rooted) phylogenetic tree T, let its cophenetic vector?(T) consist of all pairs of cophenetic values between pairs of taxa in T and all depths of taxa in T. It turns out that these cophenetic vectors single out weighted phylogenetic trees with nested taxa. We then define a family of cophenetic metrics d?,p by comparing these cophenetic vectors by means of Lp norms, and we study, either analytically or numerically, some of their basic properties: neighbors, diameter, distribution, and their rank correlation with each other and with other metrics. CONCLUSIONS: The cophenetic metrics can be safely used on weighted phylogenetic trees with nested taxa and no restriction on degrees, and they can be computed in O(n2) time, where n stands for the number of taxa. The metrics d?,1 and d?,2 have positive skewed distributions, and they show a low rank correlation with the Robinson-Foulds metric and the nodal metrics, and a very high correlation with each other and with the splitted nodal metrics. The diameter of d?,p, for p?1 , is in O(n(p+2)/p), and thus for low p they are more discriminative, having a wider range of values.
Project description:BACKGROUND: A goal of systems biology is to analyze large-scale molecular networks including gene expressions and protein-protein interactions, revealing the relationships between network structures and their biological functions. Dividing a protein-protein interaction (PPI) network into naturally grouped parts is an essential way to investigate the relationship between topology of networks and their functions. However, clear modular decomposition is often hard due to the heterogeneous or scale-free properties of PPI networks. METHODOLOGY/PRINCIPAL FINDINGS: To address this problem, we propose a diffusion model-based spectral clustering algorithm, which analytically solves the cluster structure of PPI networks as a problem of random walks in the diffusion process in them. To cope with the heterogeneity of the networks, the power factor is introduced to adjust the diffusion matrix by weighting the transition (adjacency) matrix according to a node degree matrix. This algorithm is named adjustable diffusion matrix-based spectral clustering (ADMSC). To demonstrate the feasibility of ADMSC, we apply it to decomposition of a yeast PPI network, identifying biologically significant clusters with approximately equal size. Compared with other established algorithms, ADMSC facilitates clear and fast decomposition of PPI networks. CONCLUSIONS/SIGNIFICANCE: ADMSC is proposed by introducing the power factor that adjusts the diffusion matrix to the heterogeneity of the PPI networks. ADMSC effectively partitions PPI networks into biologically significant clusters with almost equal sizes, while being very fast, robust and appealing simple.
Project description:Tree reconstruction methods are often judged by their accuracy, measured by how close they get to the true tree. Yet, most reconstruction methods like maximum likelihood (ML) do not explicitly maximize this accuracy. To address this problem, we propose a Bayesian solution. Given tree samples, we propose finding the tree estimate that is closest on average to the samples. This "median" tree is known as the Bayes estimator (BE). The BE literally maximizes posterior expected accuracy, measured in terms of closeness (distance) to the true tree. We discuss a unified framework of BE trees, focusing especially on tree distances that are expressible as squared euclidean distances. Notable examples include Robinson-Foulds (RF) distance, quartet distance, and squared path difference. Using both simulated and real data, we show that BEs can be estimated in practice by hill-climbing. In our simulation, we find that BEs tend to be closer to the true tree, compared with ML and neighbor joining. In particular, the BE under squared path difference tends to perform well in terms of both path difference and RF distances.