Bayesian and parsimony approaches reconstruct informative trees from simulated morphological datasets.
ABSTRACT: Phylogenetic analysis aims to establish the true relationships between taxa. Different analytical methods, however, can reach different conclusions. In order to establish which approach best reconstructs true relationships, previous studies have simulated datasets from known tree topologies, and identified the method that reconstructs the generative tree most accurately. On this basis, researchers have argued that morphological datasets should be analysed by Bayesian approaches, which employ an explicit probabilistic model of evolution, rather than parsimony methods-with implied weights parsimony sometimes identified as particularly inaccurate. Accuracy alone, however, is an inadequate measure of a tree's utility: a fully unresolved tree is perfectly accurate, yet contains no phylogenetic information. The highly resolved trees recovered by implied weights parsimony in fact contain as much useful information as the more accurate, but less resolved, trees recovered by Bayesian methods. By collapsing poorly supported groups, this superior resolution can be traded for accuracy, resulting in trees as accurate as those obtained by a Bayesian approach. By contrast, equally weighted parsimony analysis produces trees that are less resolved and less accurate, leading to less reliable evolutionary conclusions.
Project description:Morphological data provide the only means of classifying the majority of life's history, but the choice between competing phylogenetic methods for the analysis of morphology is unclear. Traditionally, parsimony methods have been favoured but recent studies have shown that these approaches are less accurate than the Bayesian implementation of the Mk model. Here we expand on these findings in several ways: we assess the impact of tree shape and maximum-likelihood estimation using the Mk model, as well as analysing data composed of both binary and multistate characters. We find that all methods struggle to correctly resolve deep clades within asymmetric trees, and when analysing small character matrices. The Bayesian Mk model is the most accurate method for estimating topology, but with lower resolution than other methods. Equal weights parsimony is more accurate than implied weights parsimony, and maximum-likelihood estimation using the Mk model is the least accurate method. We conclude that the Bayesian implementation of the Mk model should be the default method for phylogenetic estimation from phenotype datasets, and we explore the implications of our simulations in reanalysing several empirical morphological character matrices. A consequence of our finding is that high levels of resolution or the ability to classify species or groups with much confidence should not be expected when using small datasets. It is now necessary to depart from the traditional parsimony paradigms of constructing character matrices, towards datasets constructed explicitly for Bayesian methods.
Project description:Different analytical methods can yield competing interpretations of evolutionary history and, currently, there is no definitive method for phylogenetic reconstruction using morphological data. Parsimony has been the primary method for analysing morphological data, but there has been a resurgence of interest in the likelihood-based Mk-model. Here, we test the performance of the Bayesian implementation of the Mk-model relative to both equal and implied-weight implementations of parsimony. Using simulated morphological data, we demonstrate that the Mk-model outperforms equal-weights parsimony in terms of topological accuracy, and implied-weights performs the most poorly. However, the Mk-model produces phylogenies that have less resolution than parsimony methods. This difference in the accuracy and precision of parsimony and Bayesian approaches to topology estimation needs to be considered when selecting a method for phylogeny reconstruction.
Project description:As a result of their plastic body plan, the relationships of the annelid worms and even the taxonomic makeup of the phylum have long been contentious. Morphological cladistic analyses have typically recovered a monophyletic Polychaeta, with the simple-bodied forms assigned to an early-diverging clade or grade. This is in stark contrast to molecular trees, in which polychaetes are paraphyletic and include clitellates, echiurans and sipunculans. Cambrian stem group annelid body fossils are complex-bodied polychaetes that possess well-developed parapodia and paired head appendages (palps), suggesting that the root of annelids is misplaced in morphological trees. We present a reinvestigation of the morphology of key fossil taxa and include them in a comprehensive phylogenetic analysis of annelids. Analyses using probabilistic methods and both equal- and implied-weights parsimony recover paraphyletic polychaetes and support the conclusion that echiurans and clitellates are derived polychaetes. Morphological trees including fossils depict two main clades of crown-group annelids that are similar, but not identical, to Errantia and Sedentaria, the fundamental groupings in transcriptomic analyses. Removing fossils yields trees that are often less resolved and/or root the tree in greater conflict with molecular topologies. While there are many topological similarities between the analyses herein and recent phylogenomic hypotheses, differences include the exclusion of Sipuncula from Annelida and the taxa forming the deepest crown-group divergences.
Project description:A recent study of early dinosaur evolution using equal-weights parsimony recovered a scheme of dinosaur interrelationships and classification that differed from historical consensus in a single, but significant, respect; Ornithischia and Saurischia were not recovered as monophyletic sister-taxa, but rather Ornithischia and Theropoda formed a novel clade named Ornithoscelida. However, these analyses only used maximum parsimony, and numerous recent simulation studies have questioned the accuracy of parsimony under equal weights. Here, we provide additional support for this alternative hypothesis using Bayesian implementation of the Mkv model, as well as through number of additional parsimony analyses, including implied weighting. Using Bayesian inference and implied weighting, we recover the same fundamental topology for Dinosauria as the original study, with a monophyletic Ornithoscelida, demonstrating that the main suite of methods used in morphological phylogenetics recover this novel hypothesis. This result was further scrutinized through the systematic exclusion of different character sets. Novel characters from the original study (those not taken or adapted from previous phylogenetic studies) were found to be more important for resolving the relationships within Dinosauromorpha than the relationships within Dinosauria. Reanalysis of a modified version of the character matrix that supports the Ornithischia-Saurischia dichotomy under maximum parsimony also supports this hypothesis under implied weighting, but not under the Mkv model, with both Theropoda and Sauropodomorpha becoming paraphyletic with respect to Ornithischia.
Project description:Fossil taxa are critical to inferences of historical diversity and the origins of modern biodiversity, but realizing their evolutionary significance is contingent on restoring fossil species to their correct position within the tree of life. For most fossil species, morphology is the only source of data for phylogenetic inference; this has traditionally been analysed using parsimony, the predominance of which is currently challenged by the development of probabilistic models that achieve greater phylogenetic accuracy. Here, based on simulated and empirical datasets, we explore the relative efficacy of competing phylogenetic methods in terms of clade support. We characterize clade support using bootstrapping for parsimony and Maximum Likelihood, and intrinsic Bayesian posterior probabilities, collapsing branches that exhibit less than 50% support. Ignoring node support, Bayesian inference is the most accurate method in estimating the tree used to simulate the data. After assessing clade support, Bayesian and Maximum Likelihood exhibit comparable levels of accuracy, and parsimony remains the least accurate method. However, Maximum Likelihood is less precise than Bayesian phylogeny estimation, and Bayesian inference recaptures more correct nodes with higher support compared to all other methods, including Maximum Likelihood. We assess the effects of these findings on empirical phylogenies. Our results indicate probabilistic methods should be favoured over parsimony.
Project description:Reconstructing evolutionary histories requires accurate phylogenetic trees. Recent simulation studies suggest that probabilistic phylogenetic analyses of morphological data are more accurate than traditional parsimony techniques. Here, we use empirical data to compare Bayesian and parsimony phylogenies in terms of their congruence with the distribution of age ranges of the component taxa. Analysis of 167 independent morphological data matrices of fossil tetrapods finds that Bayesian trees exhibit significantly lower stratigraphic congruence than the equivalent parsimony trees. As such, taking stratigraphic data as an independent benchmark indicates that parsimony analyses are more accurate for phylogenetic reconstruction of morphological data. The discrepancy between simulated and empirical studies may result from historic data peaking practices or some complexities of empirical data as yet unaccounted for.
Project description:BACKGROUND:Cetacea (dolphins, porpoises, and whales) is a clade of aquatic species that includes the most massive, deepest diving, and largest brained mammals. Understanding the temporal pattern of diversification in the group as well as the evolution of cetacean anatomy and behavior requires a robust and well-resolved phylogenetic hypothesis. Although a large body of molecular data has accumulated over the past 20 years, DNA sequences of cetaceans have not been directly integrated with the rich, cetacean fossil record to reconcile discrepancies among molecular and morphological characters. RESULTS:We combined new nuclear DNA sequences, including segments of six genes (~2800 basepairs) from the functionally extinct Yangtze River dolphin, with an expanded morphological matrix and published genomic data. Diverse analyses of these data resolved the relationships of 74 taxa that represent all extant families and 11 extinct families of Cetacea. The resulting supermatrix (61,155 characters) and its sub-partitions were analyzed using parsimony methods. Bayesian and maximum likelihood (ML) searches were conducted on the molecular partition, and a molecular scaffold obtained from these searches was used to constrain a parsimony search of the morphological partition. Based on analysis of the supermatrix and model-based analyses of the molecular partition, we found overwhelming support for 15 extant clades. When extinct taxa are included, we recovered trees that are significantly correlated with the fossil record. These trees were used to reconstruct the timing of cetacean diversification and the evolution of characters shared by "river dolphins," a non-monophyletic set of species according to all of our phylogenetic analyses. CONCLUSIONS:The parsimony analysis of the supermatrix and the analysis of morphology constrained to fit the ML/Bayesian molecular tree yielded broadly congruent phylogenetic hypotheses. In trees from both analyses, all Oligocene taxa included in our study fell outside crown Mysticeti and crown Odontoceti, suggesting that these two clades radiated in the late Oligocene or later, contra some recent molecular clock studies. Our trees also imply that many character states shared by river dolphins evolved in their oceanic ancestors, contradicting the hypothesis that these characters are convergent adaptations to fluvial habitats.
Project description:The vast majority of phylogenetic models focus on resolution of gene trees, despite the fact that phylogenies of species in which gene trees are embedded are of primary interest. We analyze a Bayesian model for estimating species trees that accounts for the stochastic variation expected for gene trees from multiple unlinked loci sampled from a single species history after a coalescent process. Application of the model to a 106-gene data set from yeast shows that the set of gene trees recovered by statistically acknowledging the shared but unknown species tree from which gene trees are sampled is much reduced compared with treating the history of each locus independently of an overarching species tree. The analysis also yields a concentrated posterior distribution of the yeast species tree whose mode is congruent with the concatenated gene tree but can do so with less than half the loci required by the concatenation method. Using simulations, we show that, with large numbers of loci, highly resolved species trees can be estimated under conditions in which concatenation of sequence data will positively mislead phylogeny, and when the proportion of gene trees matching the species tree is <10%. However, when gene tree/species tree congruence is high, species trees can be resolved with just two or three loci. These results make accessible an alternative paradigm for combining data in phylogenomics that focuses attention on the singularity of species histories and away from the idiosyncrasies and multiplicities of individual gene histories.
Project description:BACKGROUND: Constructing species trees from multi-copy gene trees remains a challenging problem in phylogenetics. One difficulty is that the underlying genes can be incongruent due to evolutionary processes such as gene duplication and loss, deep coalescence, or lateral gene transfer. Gene tree estimation errors may further exacerbate the difficulties of species tree estimation. RESULTS: We present a new approach for inferring species trees from incongruent multi-copy gene trees that is based on a generalization of the Robinson-Foulds (RF) distance measure to multi-labeled trees (mul-trees). We prove that it is NP-hard to compute the RF distance between two mul-trees; however, it is easy to calculate this distance between a mul-tree and a singly-labeled species tree. Motivated by this, we formulate the RF problem for mul-trees (MulRF) as follows: Given a collection of multi-copy gene trees, find a singly-labeled species tree that minimizes the total RF distance from the input mul-trees. We develop and implement a fast SPR-based heuristic algorithm for the NP-hard MulRF problem.We compare the performance of the MulRF method (available at http://genome.cs.iastate.edu/CBL/MulRF/) with several gene tree parsimony approaches using gene tree simulations that incorporate gene tree error, gene duplications and losses, and/or lateral transfer. The MulRF method produces more accurate species trees than gene tree parsimony approaches. We also demonstrate that the MulRF method infers in minutes a credible plant species tree from a collection of nearly 2,000 gene trees. CONCLUSIONS: Our new phylogenetic inference method, based on a generalized RF distance, makes it possible to quickly estimate species trees from large genomic data sets. Since the MulRF method, unlike gene tree parsimony, is based on a generic tree distance measure, it is appealing for analyses of genomic data sets, in which many processes such as deep coalescence, recombination, gene duplication and losses as well as phylogenetic error may contribute to gene tree discord. In experiments, the MulRF method estimated species trees accurately and quickly, demonstrating MulRF as an efficient alternative approach for phylogenetic inference from large-scale genomic data sets.
Project description:Despite the introduction of likelihood-based methods for estimating phylogenetic trees from phenotypic data, parsimony remains the most widely-used optimality criterion for building trees from discrete morphological data. However, it has been known for decades that there are regions of solution space in which parsimony is a poor estimator of tree topology. Numerous software implementations of likelihood-based models for the estimation of phylogeny from discrete morphological data exist, especially for the Mk model of discrete character evolution. Here we explore the efficacy of Bayesian estimation of phylogeny, using the Mk model, under conditions that are commonly encountered in paleontological studies. Using simulated data, we describe the relative performances of parsimony and the Mk model under a range of realistic conditions that include common scenarios of missing data and rate heterogeneity.