Project description:Modern high-throughput sequencing technologies provide low-cost microbiome survey data across all habitats of life at unprecedented scale. At the most granular level, the primary data consist of sparse counts of amplicon sequence variants or operational taxonomic units that are associated with taxonomic and phylogenetic group information. In this contribution, we leverage the hierarchical structure of amplicon data and propose a data-driven and scalable tree-guided aggregation framework to associate microbial subcompositions with response variables of interest. The excess number of zero or low count measurements at the read level forces traditional microbiome data analysis workflows to remove rare sequencing variants or group them by a fixed taxonomic rank, such as genus or phylum, or by phylogenetic similarity. By contrast, our framework, which we call trac (tree-aggregation of compositional data), learns data-adaptive taxon aggregation levels for predictive modeling, greatly reducing the need for user-defined aggregation in preprocessing while simultaneously integrating seamlessly into the compositional data analysis framework. We illustrate the versatility of our framework in the context of large-scale regression problems in human gut, soil, and marine microbial ecosystems. We posit that the inferred aggregation levels provide highly interpretable taxon groupings that can help microbiome researchers gain insights into the structure and functioning of the underlying ecosystem of interest.
Project description:Collecting complete network data is expensive, time-consuming, and often infeasible. Aggregated Relational Data (ARD), which ask respondents questions of the form "How many people with trait X do you know?" provide a low-cost option when collecting complete network data is not possible. Rather than asking about connections between each pair of individuals directly, ARD collect the number of contacts the respondent knows with a given trait. Despite widespread use and a growing literature on ARD methodology, there is still no systematic understanding of when and why ARD should accurately recover features of the unobserved network. This paper provides such a characterization by deriving conditions under which statistics about the unobserved network (or functions of these statistics like regression coefficients) can be consistently estimated using ARD. We first provide consistent estimates of network model parameters for three commonly used probabilistic models: the beta-model with node-specific unobserved effects, the stochastic block model with unobserved community structure, and latent geometric space models with unobserved latent locations. A key observation is that cross-group link probabilities for a collection of (possibly unobserved) groups identify the model parameters, meaning ARD are sufficient for parameter estimation. With these estimated parameters, it is possible to simulate graphs from the fitted distribution and analyze the distribution of network statistics. We can then characterize conditions under which the simulated networks based on ARD will allow for consistent estimation of the unobserved network statistics, such as eigenvector centrality, or response functions by or of the unobserved network, such as regression coefficients.
Project description:Social network data are often prohibitively expensive to collect, limiting empirical network research. We propose an inexpensive and feasible strategy for network elicitation using Aggregated Relational Data (ARD): responses to questions of the form "how many of your links have trait k ?" Our method uses ARD to recover parameters of a network formation model, which permits sampling from a distribution over node- or graph-level statistics. We replicate the results of two field experiments that used network data and draw similar conclusions with ARD alone.
Project description:Jury deliberations provide a quintessential example of collective decision-making, but few studies have probed the available data to explore how juries reach verdicts. We examine how features of jury dynamics can be better understood from the joint distribution of final votes and deliberation time. To do this, we fit several different decision-making models to jury datasets from different places and times. In our best-fit model, jurors influence each other and have an increasing tendency to stick to their opinion of the defendant's guilt or innocence. We also show that this model can explain spikes in mean deliberation times when juries are hung, sub-linear scaling between mean deliberation times and trial duration, and unexpected final vote and deliberation time distributions. Our findings suggest that both stubbornness and herding play an important role in collective decision-making, providing a nuanced insight into how juries reach verdicts, and more generally, how group decisions emerge.
Project description:Spatial data are often aggregated from a finer (smaller) to a coarser (larger) geographical level. The process of data aggregation induces a scaling effect which smoothes the variation in the data. To address the scaling problem, multiscale models that link the convolution models at different scale levels via the shared random effect have been proposed. One of the main goals in aggregated health data is to investigate the relationship between predictors and an outcome at different geographical levels. In this paper, we extend multiscale models to examine whether a predictor effect at a finer level hold true at a coarser level. To adjust for predictor uncertainty due to aggregation, we applied measurement error models in the framework of multiscale approach. To assess the benefit of using multiscale measurement error models, we compare the performance of multiscale models with and without measurement error in both real and simulated data. We found that ignoring the measurement error in multiscale models underestimates the regression coefficient, while it overestimates the variance of the spatially structured random effect. On the other hand, accounting for the measurement error in multiscale models provides a better model fit and unbiased parameter estimates.
Project description:Advances in Geographical Information Systems (GIS) have led to the enormous recent burgeoning of spatial-temporal databases and associated statistical modeling. Here we depart from the rather rich literature in space-time modeling by considering the setting where space is discrete (e.g., aggregated data over regions), but time is continuous. Our major objective in this application is to carry out inference on gradients of a temporal process in our data set of monthly county level asthma hospitalization rates in the state of California, while at the same time accounting for spatial similarities of the temporal process across neighboring counties. Use of continuous time models here allows inference at a finer resolution than at which the data are sampled. Rather than use parametric forms to model time, we opt for a more flexible stochastic process embedded within a dynamic Markov random field framework. Through the matrix-valued covariance function we can ensure that the temporal process realizations are mean square differentiable, and may thus carry out inference on temporal gradients in a posterior predictive fashion. We use this approach to evaluate temporal gradients where we are concerned with temporal changes in the residual and fitted rate curves after accounting for seasonality, spatiotemporal ozone levels and several spatially-resolved important sociodemographic covariates.
Project description:BackgroundMapping malaria risk is an integral component of efficient resource allocation. Routine health facility data are convenient to collect, but without information on the locations at which transmission occurred, their utility for predicting variation in risk at a sub-catchment level is presently unclear.MethodsUsing routinely collected health facility level case data in Swaziland between 2011-2013, and fine scale environmental and ecological variables, this study explores the use of a hierarchical Bayesian modelling framework for downscaling risk maps from health facility catchment level to a fine scale (1 km x 1 km). Fine scale predictions were validated using known household locations of cases and a random sample of points to act as pseudo-controls.ResultsResults show that fine-scale predictions were able to discriminate between cases and pseudo-controls with an AUC value of 0.84. When scaled up to catchment level, predicted numbers of cases per health facility showed broad correspondence with observed numbers of cases with little bias, with 84 of the 101 health facilities with zero cases correctly predicted as having zero cases.ConclusionsThis method holds promise for helping countries in pre-elimination and elimination stages use health facility level data to produce accurate risk maps at finer scales. Further validation in other transmission settings and an evaluation of the operational value of the approach is necessary.
Project description:It is our primary focus to study the spatial distribution of disease incidence at different geographical levels. Often, spatial data are available in the form of aggregation at multiple scale levels such as census tract, county, state, and so on. When data are aggregated from a fine (e.g. county) to a coarse (e.g. state) geographical level, there will be loss of information. The problem is more challenging when excessive zeros are available at the fine level. After data aggregation, the excessive zeros at the fine level will be reduced at the coarse level. If we ignore the zero inflation and the aggregation effect, we could get inconsistent risk estimates at the fine and coarse levels. Hence, in this paper, we address those problems using zero inflated multiscale models that jointly describe the risk variations at different geographical levels. For the excessive zeros at the fine level, we use a zero inflated convolution model, whereas we consider a regular convolution model for the smoothed data at the coarse level. These methods provide a consistent risk estimate at the fine and coarse levels when high percentages of structural zeros are present in the data.
Project description:Growing concerns about the quality of aggregated biodiversity data are lowering trust in large-scale data networks. Aggregators frequently respond to quality concerns by recommending that biologists work with original data providers to correct errors 'at the source.' We show that this strategy falls systematically short of a full diagnosis of the underlying causes of distrust. In particular, trust in an aggregator is not just a feature of the data signal quality provided by the sources to the aggregator, but also a consequence of the social design of the aggregation process and the resulting power balance between individual data contributors and aggregators. The latter have created an accountability gap by downplaying the authorship and significance of the taxonomic hierarchies-frequently called 'backbones'-they generate, and which are in effect novel classification theories that operate at the core of data-structuring process. The Darwin Core standard for sharing occurrence records plays an under-appreciated role in maintaining the accountability gap, because this standard lacks the syntactic structure needed to preserve the taxonomic coherence of data packages submitted for aggregation, potentially leading to inferences that no individual source would support. Since high-quality data packages can mirror competing and conflicting classifications, i.e. unsettled systematic research, this plurality must be accommodated in the design of biodiversity data integration. Looking forward, a key directive is to develop new technical pathways and social incentives for experts to contribute directly to the validation of taxonomically coherent data packages as part of a greater, trustworthy aggregation process.
Project description:Transient exposures are difficult to measure in epidemiologic studies, especially when both the status of being at risk for an outcome and the exposure change over time and space, as when measuring built-environment risk on transportation injury. Contemporary "big data" generated by mobile sensors can improve measurement of transient exposures. Exposure information generated by these devices typically only samples the experience of the target cohort, so a case-control framework may be useful. However, for anonymity, the data may not be available by individual, precluding a case-crossover approach. We present a method called at-risk-measure sampling. Its goal is to estimate the denominator of an incidence rate ratio (exposed to unexposed measure of the at-risk experience) given an aggregated summary of the at-risk measure from a cohort. Rather than sampling individuals or locations, the method samples the measure of the at-risk experience. Specifically, the method as presented samples person-distance and person-events summarized by location. It is illustrated with data from a mobile app used to record bicycling. The method extends an established case-control sampling principle: sample the at-risk experience of a cohort study such that the sampled exposure distribution approximates that of the cohort. It is distinct from density sampling in that the sample remains in the form of the at-risk measure, which may be continuous, such as person-time or person-distance. This aspect may be both logistically and statistically efficient if such a sample is already available, for example from big-data sources like aggregated mobile-sensor data.