Bacterial phylogenetic reconstruction from whole genomes is robust to recombination but demographic inference is not.
ABSTRACT: Phylogenetic inference in bacterial genomics is fundamental to understanding problems such as population history, antimicrobial resistance, and transmission dynamics. The field has been plagued by an apparent state of contradiction since the distorting effects of recombination on phylogeny were discovered more than a decade ago. Researchers persist with detailed phylogenetic analyses while simultaneously acknowledging that recombination seriously misleads inference of population dynamics and selection. Here we resolve this paradox by showing that phylogenetic tree topologies based on whole genomes robustly reconstruct the clonal frame topology but that branch lengths are badly skewed. Surprisingly, removing recombining sites can exacerbate branch length distortion caused by recombination.Phylogenetic tree reconstruction is a popular approach for understanding the relatedness of bacteria in a population from differences in their genome sequences. However, bacteria frequently exchange regions of their genomes by a process called homologous recombination, which violates a fundamental assumption of phylogenetic methods. Since many researchers continue to use phylogenetics for recombining bacteria, it is important to understand how recombination affects the conclusions drawn from these analyses. We find that whole-genome sequences afford great accuracy in reconstructing evolutionary relationships despite concerns surrounding the presence of recombination, but the branch lengths of the phylogenetic tree are indeed badly distorted. Surprisingly, methods to reduce the impact of recombination on branch lengths can exacerbate the problem.
Project description:Although an increasing number of phylogenetic data sets are incomplete, the effect of ambiguous data on phylogenetic accuracy is not well understood. We use 4-taxon simulations to study the effects of ambiguous data (i.e., missing characters or gaps) in maximum likelihood (ML) and Bayesian frameworks. By introducing ambiguous data in a way that removes confounding factors, we provide the first clear understanding of 1 mechanism by which ambiguous data can mislead phylogenetic analyses. We find that in both ML and Bayesian frameworks, among-site rate variation can interact with ambiguous data to produce misleading estimates of topology and branch lengths. Furthermore, within a Bayesian framework, priors on branch lengths and rate heterogeneity parameters can exacerbate the effects of ambiguous data, resulting in strongly misleading bipartition posterior probabilities. The magnitude and direction of the ambiguous data bias are a function of the number and taxonomic distribution of ambiguous characters, the strength of topological support, and whether or not the model is correctly specified. The results of this study have major implications for all analyses that rely on accurate estimates of topology or branch lengths, including divergence time estimation, ancestral state reconstruction, tree-dependent comparative methods, rate variation analysis, phylogenetic hypothesis testing, and phylogeographic analysis.
Project description:We present a series of simulation studies that explore the relative performance of several phylogenetic network approaches (statistical parsimony, split decomposition, union of maximum parsimony trees, neighbor-net, simulated history recombination upper bound, median-joining, reduced median joining and minimum spanning network) compared to standard tree approaches, (neighbor-joining and maximum parsimony) in the presence and absence of recombination.In the absence of recombination, all methods recovered the correct topology and branch lengths nearly all of the time when the substitution rate was low, except for minimum spanning networks, which did considerably worse. At a higher substitution rate, maximum parsimony and union of maximum parsimony trees were the most accurate. With recombination, the ability to infer the correct topology was halved for all methods and no method could accurately estimate branch lengths.Our results highlight the need for more accurate phylogenetic network methods and the importance of detecting and accounting for recombination in phylogenetic studies. Furthermore, we provide useful information for choosing a network algorithm and a framework in which to evaluate improvements to existing methods and novel algorithms developed in the future.
Project description:Evolutionary relationships among species are often assumed to be fundamentally unambiguous, where genes within a genome are thought to evolve in concert and phylogenetic incongruence between individual orthologs is attributed to idiosyncrasies in their evolution. We have identified substantial incongruence between the phylogenies of orthologous genes in Escherichia, Salmonella, and Citrobacter, or E. coli, E. fergusonii, and E. albertii. The source of incongruence was inferred to be recombination, because individual genes support conflicting topology more robustly than expected from stochastic sequence homoplasies. Clustering of phylogenetically informative sites on the genome indicated that the regions of recombination extended over several kilobases. Analysis of phylogenetically distant taxa resulted in consensus among individual gene phylogenies, suggesting that recombination is not ongoing; instead, conflicting relationships among genes in descendent taxa reflect recombination among their ancestors. Incongruence could have resulted from random assortment of ancestral polymorphisms if species were instantly created from the division of a recombining population. However, the estimated branch lengths in alternative phylogenies would require ancestral populations with far more diversity than is found in extant populations. Rather, these and previous data collectively suggest that genome-wide recombination rates decreased gradually, with variation in rate among loci, leading to pluralistic relationships among their descendent taxa.
Project description:The hypothesis of wide spread reticulate evolution in Tick-Borne Encephalitis virus (TBEV) has recently gained momentum with several publications describing past recombination events involving various TBEV clades. Despite a large body of work, no consensus has yet emerged on TBEV evolutionary dynamics. Understanding the occurrence and frequency of recombination in TBEV bears significant impact on epidemiology, evolution, and vaccination with live vaccines. In this study, we investigated the possibility of detecting recombination events in TBEV by simulating recombinations at several locations on the virus' phylogenetic tree and for different lengths of recombining fragments. We derived estimations of rates of true and false positive for the detection of past recombination events for seven recombination detection algorithms. Our analytical framework can be applied to any investigation dealing with the difficult task of distinguishing genuine recombination signal from background noise. Our results suggest that the problem of false positives associated with low detection P-values in TBEV, is more insidious than generally acknowledged. We reappraised the recombination signals present in the empirical data, and showed that reliable signals could only be obtained in a few cases when highly genetically divergent strains were involved, whereas false positives were common among genetically similar strains. We thus conclude that recombination among wild-type TBEV strains may occur, which has potential implications for vaccination with live vaccines, but that these events are surprisingly rare.
Project description:Several popular methods for phylogenetic inference (or hierarchical clustering) are based on a matrix of pairwise distances between taxa (or any kind of objects): The objective is to construct a tree with branch lengths so that the distances between the leaves in that tree are as close as possible to the input distances. If we hold the structure (topology) of the tree fixed, in some relevant cases (e.g., ordinary least squares) the optimal values for the branch lengths can be expressed using simple combinatorial formulae. Here we define a general form for these formulae and show that they all have two desirable properties: First, the common tree reconstruction approaches (least squares, minimum evolution), when used in combination with these formulae, are guaranteed to infer the correct tree when given enough data (consistency); second, the branch lengths of all the simple (nearest neighbor interchange) rearrangements of a tree can be calculated, optimally, in quadratic time in the size of the tree, thus allowing the efficient application of hill climbing heuristics. The study presented here is a continuation of that by Mihaescu and Pachter on branch length estimation [Mihaescu R, Pachter L (2008) Proc Natl Acad Sci USA 105:13206-13211]. The focus here is on the inference of the tree itself and on providing a basis for novel algorithms to reconstruct trees from distances.
Project description:BACKGROUND:A frequent event in the evolution of prokaryotic genomes is homologous recombination, where a foreign DNA stretch replaces a genomic region similar in sequence. Recombination can affect the relative position of two genomes in a phylogenetic reconstruction in two different ways: (i) one genome can recombine with a DNA stretch that is similar to the other genome, thereby reducing their pairwise sequence divergence; (ii) one genome can recombine with a DNA stretch from an outgroup genome, increasing the pairwise divergence. While several recombination-aware phylogenetic algorithms exist, many of these cannot account for both types of recombination; some algorithms can, but do so inefficiently. Moreover, many of them reconstruct the ancestral recombination graph (ARG) to help infer the genome tree, and require that a substantial portion of each genome has not been affected by recombination, a sometimes unrealistic assumption. METHODS:Here, we propose a Coarse-Graining approach for Phylogenetic reconstruction (CGP), which is recombination-aware but forgoes ARG reconstruction. It accounts for the tendency of a higher effective recombination rate between genomes with a lower phylogenetic distance. It is applicable even if all genomic regions have experienced substantial amounts of recombination, and can be used on both nucleotide and amino acid sequences. CGP considers the local density of substitutions along pairwise genome alignments, fitting a model to the empirical distribution of substitution density to infer the pairwise coalescent time. Given all pairwise coalescent times, CGP reconstructs an ultrametric tree representing vertical inheritance. RESULTS:Based on simulations, we show that the proposed approach can reconstruct ultrametric trees with accurate topology, branch lengths, and root positioning. Applied to a set of E. coli strains, the reconstructed trees are most consistent with gene distributions when inferred from amino acid sequences, a data type that cannot be utilized by many alternative approaches. CONCLUSIONS:The CGP algorithm is more accurate than alternative recombination-aware methods for ultrametric phylogenetic reconstructions.
Project description:BACKGROUND: Bayesian phylogenetic analysis generates a set of trees which are often condensed into a single tree representing the whole set. Many methods exist for selecting a representative topology for a set of unrooted trees, few exist for assigning branch lengths to a fixed topology, and even fewer for simultaneously setting the topology and branch lengths. However, there is very little research into locating a good representative for a set of rooted time trees like the ones obtained from a BEAST analysis. RESULTS: We empirically compare new and known methods for generating a summary tree. Some new methods are motivated by mathematical constructions such as tree metrics, while the rest employ tree concepts which work well in practice. These use more of the posterior than existing methods, which discard information not directly mapped to the chosen topology. Using results from a large number of simulations we assess the quality of a summary tree, measuring (a) how well it explains the sequence data under the model and (b) how close it is to the "truth", i.e to the tree used to generate the sequences. CONCLUSIONS: Our simulations indicate that no single method is "best". Methods producing good divergence time estimates have poor branch lengths and lower model fit, and vice versa. Using the results presented here, a user can choose the appropriate method based on the purpose of the summary tree.
Project description:Recent years have seen an increasing effort to incorporate phylogenetic hypotheses to the study of community assembly processes. The incorporation of such evolutionary information has been eased by the emergence of specialized software for the automatic estimation of partially resolved supertrees based on published phylogenies. Despite this growing interest in the use of phylogenies in ecological research, very few studies have attempted to quantify the potential biases related to the use of partially resolved phylogenies and to branch length accuracy, and no work has examined how tree shape may affect inference of community phylogenetic metrics. In this study, using a large plant community and elevational dataset, we tested the influence of phylogenetic resolution and branch length information on the quantification of phylogenetic structure; and also explored the impact of tree shape (stemminess) on the loss of accuracy in phylogenetic structure quantification due to phylogenetic resolution. For this purpose, we used 9 sets of phylogenetic hypotheses of varying resolution and branch lengths to calculate three indices of phylogenetic structure: the mean phylogenetic distance (NRI), the mean nearest taxon distance (NTI) and phylogenetic diversity (stdPD) metrics. The NRI metric was the less sensitive to phylogenetic resolution, stdPD showed an intermediate sensitivity, and NTI was the most sensitive one; NRI was also less sensitive to branch length accuracy than NTI and stdPD, the degree of sensitivity being strongly dependent on the dating method and the sample size. Directional biases were generally towards type II errors. Interestingly, we detected that tree shape influenced the accuracy loss derived from the lack of phylogenetic resolution, particularly for NRI and stdPD. We conclude that well-resolved molecular phylogenies with accurate branch length information are needed to identify the underlying phylogenetic structure of communities, and also that sensitivity of phylogenetic structure measures to low phylogenetic resolution can strongly differ depending on phylogenetic tree shape.
Project description:Sampling tree space is the most challenging aspect of Bayesian phylogenetic inference. The sheer number of alternative topologies is problematic by itself. In addition, the complex dependency between branch lengths and topology increases the difficulty of moving efficiently among topologies. Current tree proposals are fast but sample new trees using primitive transformations or re-mappings of old branch lengths. This reduces acceptance rates and presumably slows down convergence and mixing. Here, we explore branch proposals that do not rely on old branch lengths but instead are based on approximations of the conditional posterior. Using a diverse set of empirical data sets, we show that most conditional branch posteriors can be accurately approximated via a [Formula: see text] distribution. We empirically determine the relationship between the logarithmic conditional posterior density, its derivatives, and the characteristics of the branch posterior. We use these relationships to derive an independence sampler for proposing branches with an acceptance ratio of ~90% on most data sets. This proposal samples branches between 2× and 3× more efficiently than traditional proposals with respect to the effective sample size per unit of runtime. We also compare the performance of standard topology proposals with hybrid proposals that use the new independence sampler to update those branches that are most affected by the topological change. Our results show that hybrid proposals can sometimes noticeably decrease the number of generations necessary for topological convergence. Inconsistent performance gains indicate that branch updates are not the limiting factor in improving topological convergence for the currently employed set of proposals. However, our independence sampler might be essential for the construction of novel tree proposals that apply more radical topology changes.
Project description:Background: Phylogenetic reconstruction is a necessary first step in many analyses which use whole genome sequence data from bacterial populations. There are many available methods to infer phylogenies, and these have various advantages and disadvantages, but few unbiased comparisons of the range of approaches have been made. Methods: We simulated data from a defined "true tree" using a realistic evolutionary model. We built phylogenies from this data using a range of methods, and compared reconstructed trees to the true tree using two measures, noting the computational time needed for different phylogenetic reconstructions. We also used real data from Streptococcus pneumoniae alignments to compare individual core gene trees to a core genome tree. Results: We found that, as expected, maximum likelihood trees from good quality alignments were the most accurate, but also the most computationally intensive. Using less accurate phylogenetic reconstruction methods, we were able to obtain results of comparable accuracy; we found that approximate results can rapidly be obtained using genetic distance based methods. In real data we found that highly conserved core genes, such as those involved in translation, gave an inaccurate tree topology, whereas genes involved in recombination events gave inaccurate branch lengths. We also show a tree-of-trees, relating the results of different phylogenetic reconstructions to each other. Conclusions: We recommend three approaches, depending on requirements for accuracy and computational time. Quicker approaches that do not perform full maximum likelihood optimisation may be useful for many analyses requiring a phylogeny, as generating a high quality input alignment is likely to be the major limiting factor of accurate tree topology. We have publicly released our simulated data and code to enable further comparisons.