Bayesian phylodynamic inference with complex models.
ABSTRACT: Population genetic modeling can enhance Bayesian phylogenetic inference by providing a realistic prior on the distribution of branch lengths and times of common ancestry. The parameters of a population genetic model may also have intrinsic importance, and simultaneous estimation of a phylogeny and model parameters has enabled phylodynamic inference of population growth rates, reproduction numbers, and effective population size through time. Phylodynamic inference based on pathogen genetic sequence data has emerged as useful supplement to epidemic surveillance, however commonly-used mechanistic models that are typically fitted to non-genetic surveillance data are rarely fitted to pathogen genetic data due to a dearth of software tools, and the theory required to conduct such inference has been developed only recently. We present a framework for coalescent-based phylogenetic and phylodynamic inference which enables highly-flexible modeling of demographic and epidemiological processes. This approach builds upon previous structured coalescent approaches and includes enhancements for computational speed, accuracy, and stability. A flexible markup language is described for translating parametric demographic or epidemiological models into a structured coalescent model enabling simultaneous estimation of demographic or epidemiological parameters and time-scaled phylogenies. We demonstrate the utility of these approaches by fitting compartmental epidemiological models to Ebola virus and Influenza A virus sequence data, demonstrating how important features of these epidemics, such as the reproduction number and epidemic curves, can be gleaned from genetic data. These approaches are provided as an open-source package PhyDyn for the BEAST2 phylogenetics platform.
Project description:Within-host genetic diversity and large transmission bottlenecks confound phylodynamic inference of epidemiological dynamics. Conventional phylodynamic approaches assume that nodes in a time-scaled pathogen phylogeny correspond closely to the time of transmission between hosts that are ancestral to the sample. However, when hosts harbor diverse pathogen populations, node times can substantially pre-date infection times. Imperfect bottlenecks can cause lineages sampled in different individuals to coalesce in unexpected patterns. To address realistic violations of standard phylodynamic assumptions we developed a new inference approach based on a multi-scale coalescent model, accounting for nonlinear epidemiological dynamics, heterogeneous sampling through time, non-negligible genetic diversity of pathogens within hosts, and imperfect transmission bottlenecks. We apply this method to HIV-1 and Ebola virus (EBOV) outbreak sequence data, illustrating how and when conventional phylodynamic inference may give misleading results. Within-host diversity of HIV-1 causes substantial upwards bias in the number of infected hosts using conventional coalescent models, but estimates using the multi-scale model have greater consistency with reported number of diagnoses through time. In contrast, we find that within-host diversity of EBOV has little influence on estimated numbers of infected hosts or reproduction numbers, and estimates are highly consistent with the reported number of diagnoses through time. The multi-scale coalescent also enables estimation of within-host effective population size using single sequences from a random sample of patients. We find within-host population genetic diversity of HIV-1 p17 to be 2N?=0.012 (95% CI 0.0066-0.023), which is lower than estimates based on HIV envelope serial sequencing of individual patients.
Project description:One of the central objectives in the field of phylodynamics is the quantification of population dynamic processes using genetic sequence data or in some cases phenotypic data. Phylodynamics has been successfully applied to many different processes, such as the spread of infectious diseases, within-host evolution of a pathogen, macroevolution and even language evolution. Phylodynamic analysis requires a probability distribution on phylogenetic trees spanned by the genetic data. Because such a probability distribution is not available for many common stochastic population dynamic processes, coalescent-based approximations assuming deterministic population size changes are widely employed. Key to many population dynamic models, in particular epidemiological models, is a period of exponential population growth during the initial phase. Here, we show that the coalescent does not well approximate stochastic exponential population growth, which is typically modelled by a birth-death process. We demonstrate that introducing demographic stochasticity into the population size function of the coalescent improves the approximation for values of R0 close to 1, but substantial differences remain for large R0. In addition, the computational advantage of using an approximation over exact models vanishes when introducing such demographic stochasticity. These results highlight that we need to increase efforts to develop phylodynamic tools that correctly account for the stochasticity of population dynamic models for inference.
Project description:Phylodynamics - the field aiming to quantitatively integrate the ecological and evolutionary dynamics of rapidly evolving populations like those of RNA viruses - increasingly relies upon coalescent approaches to infer past population dynamics from reconstructed genealogies. As sequence data have become more abundant, these approaches are beginning to be used on populations undergoing rapid and rather complex dynamics. In such cases, the simple demographic models that current phylodynamic methods employ can be limiting. First, these models are not ideal for yielding biological insight into the processes that drive the dynamics of the populations of interest. Second, these models differ in form from mechanistic and often stochastic population dynamic models that are currently widely used when fitting models to time series data. As such, their use does not allow for both genealogical data and time series data to be considered in tandem when conducting inference. Here, we present a flexible statistical framework for phylodynamic inference that goes beyond these current limitations. The framework we present employs a recently developed method known as particle MCMC to fit stochastic, nonlinear mechanistic models for complex population dynamics to gene genealogies and time series data in a Bayesian framework. We demonstrate our approach using a nonlinear Susceptible-Infected-Recovered (SIR) model for the transmission dynamics of an infectious disease and show through simulations that it provides accurate estimates of past disease dynamics and key epidemiological parameters from genealogies with or without accompanying time series data.
Project description:Coalescent methods are widely used to infer the demographic history of populations from gene genealogies. These approaches-often referred to as phylodynamic methods-have proven especially useful for reconstructing the dynamics of rapidly evolving viral pathogens. Yet, population dynamics inferred from viral genealogies often differ widely from those observed from other sources of epidemiological data, such as hospitalization records. We demonstrate how a modeling framework that allows for the direct fitting of mechanistic epidemiological models to genealogies can be used to test different hypotheses about what ecological factors cause phylodynamic inferences to differ from observed dynamics. We use this framework to test different hypotheses about why dengue serotype 1 (DENV-1) population dynamics in southern Vietnam inferred using existing phylodynamic methods differ from hospitalization data. Specifically, we consider how factors such as seasonality, vector dynamics, and spatial structure can affect inferences drawn from genealogies. The coalescent models we derive to take into account vector dynamics and spatial structure reveal that these ecological complexities can substantially affect coalescent rates among lineages. We show that incorporating these additional ecological complexities into coalescent models can also greatly improve estimates of historical population dynamics and lead to new insights into the factors shaping viral genealogies.
Project description:BACKGROUND: Coalescent theory is a general framework to model genetic variation in a population. Specifically, it allows inference about population parameters from sampled DNA sequences. However, most currently employed variants of coalescent theory only consider very simple demographic scenarios of population size changes, such as exponential growth. RESULTS: Here we develop a coalescent approach that allows Bayesian non-parametric estimation of the demographic history using genealogies reconstructed from sampled DNA sequences. In this framework inference and model selection is done using reversible jump Markov chain Monte Carlo (MCMC). This method is computationally efficient and overcomes the limitations of related non-parametric approaches such as the skyline plot. We validate the approach using simulated data. Subsequently, we reanalyze HIV-1 sequence data from Central Africa and Hepatitis C virus (HCV) data from Egypt. CONCLUSIONS: The new method provides a Bayesian procedure for non-parametric estimation of the demographic history. By construction it additionally provides confidence limits and may be used jointly with other MCMC-based coalescent approaches.
Project description:Coalescent theory is routinely used to estimate past population dynamics and demographic parameters from genealogies. While early work in coalescent theory only considered simple demographic models, advances in theory have allowed for increasingly complex demographic scenarios to be considered. The success of this approach has lead to coalescent-based inference methods being applied to populations with rapidly changing population dynamics, including pathogens like RNA viruses. However, fitting epidemiological models to genealogies via coalescent models remains a challenging task, because pathogen populations often exhibit complex, nonlinear dynamics and are structured by multiple factors. Moreover, it often becomes necessary to consider stochastic variation in population dynamics when fitting such complex models to real data. Using recently developed structured coalescent models that accommodate complex population dynamics and population structure, we develop a statistical framework for fitting stochastic epidemiological models to genealogies. By combining particle filtering methods with Bayesian Markov chain Monte Carlo methods, we are able to fit a wide class of stochastic, nonlinear epidemiological models with different forms of population structure to genealogies. We demonstrate our framework using two structured epidemiological models: a model with disease progression between multiple stages of infection and a two-population model reflecting spatial structure. We apply the multi-stage model to HIV genealogies and show that the proposed method can be used to estimate the stage-specific transmission rates and prevalence of HIV. Finally, using the two-population model we explore how much information about population structure is contained in genealogies and what sample sizes are necessary to reliably infer parameters like migration rates.
Project description:Modern phylodynamic methods interpret an inferred phylogenetic tree as a partial transmission chain providing information about the dynamic process of transmission and removal (where removal may be due to recovery, death, or behavior change). Birth-death and coalescent processes have been introduced to model the stochastic dynamics of epidemic spread under common epidemiological models such as the SIS and SIR models and are successfully used to infer phylogenetic trees together with transmission (birth) and removal (death) rates. These methods either integrate analytically over past incidence and prevalence to infer rate parameters, and thus cannot explicitly infer past incidence or prevalence, or allow such inference only in the coalescent limit of large population size. Here, we introduce a particle filtering framework to explicitly infer prevalence and incidence trajectories along with phylogenies and epidemiological model parameters from genomic sequences and case count data in a manner consistent with the underlying birth-death model. After demonstrating the accuracy of this method on simulated data, we use it to assess the prevalence through time of the early 2014 Ebola outbreak in Sierra Leone.
Project description:Estimation of epidemiological and population parameters from molecular sequence data has become central to the understanding of infectious disease dynamics. Various models have been proposed to infer details of the dynamics that describe epidemic progression. These include inference approaches derived from Kingman's coalescent theory. Here, we use recently described coalescent theory for epidemic dynamics to develop stochastic and deterministic coalescent susceptible-infected-removed (SIR) tree priors. We implement these in a Bayesian phylogenetic inference framework to permit joint estimation of SIR epidemic parameters and the sample genealogy. We assess the performance of the two coalescent models and also juxtapose results obtained with a recently published birth-death-sampling model for epidemic inference. Comparisons are made by analyzing sets of genealogies simulated under precisely known epidemiological parameters. Additionally, we analyze influenza A (H1N1) sequence data sampled in the Canterbury region of New Zealand and HIV-1 sequence data obtained from known United Kingdom infection clusters. We show that both coalescent SIR models are effective at estimating epidemiological parameters from data with large fundamental reproductive number [Formula: see text] and large population size [Formula: see text]. Furthermore, we find that the stochastic variant generally outperforms its deterministic counterpart in terms of error, bias, and highest posterior density coverage, particularly for smaller [Formula: see text] and [Formula: see text]. However, each of these inference models is shown to have undesirable properties in certain circumstances, especially for epidemic outbreaks with [Formula: see text] close to one or with small effective susceptible populations.
Project description:Recent studies have linked demographic changes and epidemiological patterns in bacterial populations using coalescent-based approaches. We identified 26 studies using skyline plots and found that 21 inferred overall population expansion. This surprising result led us to analyze the impact of natural selection, recombination (gene conversion), and sampling biases on demographic inference using skyline plots and site frequency spectra (SFS). Forward simulations based on biologically relevant parameters from Escherichia coli populations showed that theoretical arguments on the detrimental impact of recombination and especially natural selection on the reconstructed genealogies cannot be ignored in practice. In fact, both processes systematically lead to spurious interpretations of population expansion in skyline plots (and in SFS for selection). Weak purifying selection, and especially positive selection, had important effects on skyline plots, showing patterns akin to those of population expansions. State-of-the-art techniques to remove recombination further amplified these biases. We simulated three common sampling biases in microbiological research: uniform, clustered, and mixed sampling. Alone, or together with recombination and selection, they further mislead demographic inferences producing almost any possible skyline shape or SFS. Interestingly, sampling sub-populations also affected skyline plots and SFS, because the coalescent rates of populations and their sub-populations had different distributions. This study suggests that extreme caution is needed to infer demographic changes solely based on reconstructed genealogies. We suggest that the development of novel sampling strategies and the joint analyzes of diverse population genetic methods are strictly necessary to estimate demographic changes in populations where selection, recombination, and biased sampling are present.
Project description:The shapes of phylogenetic trees relating virus populations are determined by the adaptation of viruses within each host, and by the transmission of viruses among hosts. Phylodynamic inference attempts to reverse this flow of information, estimating parameters of these processes from the shape of a virus phylogeny reconstructed from a sample of genetic sequences from the epidemic. A key challenge to phylodynamic inference is quantifying the similarity between two trees in an efficient and comprehensive way. In this study, I demonstrate that a new distance measure, based on a subset tree kernel function from computational linguistics, confers a significant improvement over previous measures of tree shape for classifying trees generated under different epidemiological scenarios. Next, I incorporate this kernel-based distance measure into an approximate Bayesian computation (ABC) framework for phylodynamic inference. ABC bypasses the need for an analytical solution of model likelihood, as it only requires the ability to simulate data from the model. I validate this "kernel-ABC" method for phylodynamic inference by estimating parameters from data simulated under a simple epidemiological model. Results indicate that kernel-ABC attained greater accuracy for parameters associated with virus transmission than leading software on the same data sets. Finally, I apply the kernel-ABC framework to study a recent outbreak of a recombinant HIV subtype in China. Kernel-ABC provides a versatile framework for phylodynamic inference because it can fit a broader range of models than methods that rely on the computation of exact likelihoods.