Project description:PurposeTo facilitate use of timely, granular, and publicly available data on COVID-19 mortality, we provide a method for imputing suppressed COVID-19 death counts in the National Center for Health Statistic's 2020 provisional mortality data by quarter, county, and age.MethodsWe used a Bayesian approach to impute suppressed COVID-19 death counts by quarter, county, and age in provisional data for 3,138 US counties. Our model accounts for multilevel data structures; numerous zero death counts among persons aged <50 years, rural counties, early quarters in 2020; highly right-skewed distributions; and different levels of data granularity (county, state or locality, and national levels). We compared three models with different prior assumptions of suppressed COVID-19 deaths, including noninformative priors (M1), the same weakly informative priors for all age groups (M2), and weakly informative priors that differ by age (M3) to impute the suppressed death counts. After the imputed suppressed counts were available, we assessed three prior assumptions at the national, state/locality, and county level, respectively. Finally, we compared US counties by two types of COVID-19 death rates, crude (CDR) and age-standardized death rates (ASDR), which can be estimated only through imputing suppressed death counts.ResultsWithout imputation, the total COVID-19 death counts estimated from the raw data underestimated the reported national COVID-19 deaths by 18.60%. Using imputed data, we overestimated the national COVID-19 deaths by 3.57% (95% CI: 3.37%-3.80%) in model M1, 2.23% (95% CI: 2.04%-2.43%) in model M2, and 2.96% (95% CI: 2.76%-3.16%) in model M3 compared with the national report. The top 20 counties that were most affected by COVID-19 mortality were different between CDR and ASDR.ConclusionsBayesian imputation of suppressed county-level, age-specific COVID-19 deaths in US provisional data can improve county ASDR estimates and aid public health officials in identifying disparities in deaths from COVID-19.
Project description:Unobserved heterogeneity causing overdispersion and the excessive number of zeros take a prominent place in the methodological development on count modeling. An insight into the mechanisms that induce heterogeneity is required for better understanding of the phenomenon of overdispersion. When the heterogeneity is sourced by the stochastic component of the model, the use of a heterogenous Poisson distribution for this part encounters as an elegant solution. Hierarchical design of the study is also responsible for the heterogeneity as the unobservable effects at various levels also contribute to the overdispersion. Zero-inflation, heterogeneity and multilevel nature in the count data present special challenges in their own respect, however the presence of all in one study adds more challenges to the modeling strategies. This study therefore is designed to merge the attractive features of the separate strand of the solutions in order to face such a comprehensive challenge. This study differs from the previous attempts by the choice of two recently developed heterogeneous distributions, namely Poisson-Lindley (PL) and Poisson-Ailamujia (PA) for the truncated part. Using generalized linear mixed modeling settings, predictive performances of the multilevel PL and PA models and their hurdle counterparts were assessed within a comprehensive simulation study in terms of bias, precision and accuracy measures. Multilevel models were applied to two separate real world examples for the assessment of practical implications of the new models proposed in this study.
Project description:BackgroundSpatial transcriptomics are a set of new technologies that profile gene expression on tissues with spatial localization information. With technological advances, recent spatial transcriptomics data are often in the form of sparse counts with an excessive amount of zero values.ResultsWe perform a comprehensive analysis on 20 spatial transcriptomics datasets collected from 11 distinct technologies to characterize the distributional properties of the expression count data and understand the statistical nature of the zero values. Across datasets, we show that a substantial fraction of genes displays overdispersion and/or zero inflation that cannot be accounted for by a Poisson model, with genes displaying overdispersion substantially overlapped with genes displaying zero inflation. In addition, we find that either the Poisson or the negative binomial model is sufficient for modeling the majority of genes across most spatial transcriptomics technologies. We further show major sources of overdispersion and zero inflation in spatial transcriptomics including gene expression heterogeneity across tissue locations and spatial distribution of cell types. In particular, when we focus on a relatively homogeneous set of tissue locations or control for cell type compositions, the number of detected overdispersed and/or zero-inflated genes is substantially reduced, and a simple Poisson model is often sufficient to fit the gene expression data there.ConclusionsOur study provides the first comprehensive evidence that excessive zeros in spatial transcriptomics are not due to zero inflation, supporting the use of count models without a zero inflation component for modeling spatial transcriptomics.
Project description:In 2013, Thailand was ranked second in the world in road accident fatalities (RAFs), with 36.2 per 100,000 people. During the Songkran festival, which takes place during the traditional Thai New Year in April, the number of road traffic accidents (RTAs) and RAFs are markedly higher than on regular days, but few studies have investigated this issue as an effect of festivity. This study investigated the factors that contribute to RAFs using various count regression models. Data on 20,229 accidents in 2015 were collected from the Department of Disaster Prevention and Mitigation in Thailand. The Poisson and Conway-Maxwell-Poisson (CMP) distributions, and their zero-Inflated (ZI) versions were applied to fit the data. The results showed that RAFs in Thailand follow a count distribution with underdispersion and excessive zeros, which is rare. The ZICMP model marginally outperformed the CMP model, suggesting that having many zeros does not necessarily mean that the ZI model is required. The model choice depends on the question of interest, and a separate set of predictors highlights the distinct aspects of the data. Using ZICMP, road, weather, and environmental factors affected the differences in RAFs among all accidents, whereas month distinguished actual non-fatal accidents and crashes with or without deaths. As expected, actual non-fatal accidents were 2.37 times higher in April than in January. Using CMP, these variables were significant predictors of zeros and frequent deaths in each accident. The RAF average was surprisingly higher in other months than in January, except for April, which was unexpectedly lower. Thai authorities have invested considerable effort and resources to improve road safety during festival weeks to no avail. However, our study results indicate that people's risk perceptions and public awareness of RAFs are misleading. Therefore, nationwide road safety should instead be advocated by the authorities to raise society's awareness of everyday personal safety and the safety of others.
Project description:Reporting discrepancies between officially confirmed COVID-19 death counts and unreported COVID-19-like illness (CLI) death counts have been evident across the world, including Bangladesh. Publicly available data were used to explore the differences between confirmed COVID-19 death counts and deaths with possible COVID-19 symptoms between March 2, 2020 and August 22, 2020. Unreported CLI death counts totaled more than half of the confirmed COVID-19 death counts during the study period. However, the reporting authority did not consider CLI deaths, which might produce incomplete and unreliable COVID-19 data and respective mortality rates. All deaths with possible COVID-19 symptoms need to be included in provisional death counts to better estimate the COVID-19 mortality rate and to develop data-driven COVID-19 response strategies. An urgent initiative is needed to prepare a comprehensive guideline for reporting COVID-19 deaths.
Project description:Biological and medical researchers often collect count data in clusters at multiple time points. The data can exhibit excessive zeros and a wide range of dispersion levels. In particular, our research was motivated by a dental dataset with such complex data features: the Iowa Fluoride Study (IFS). The study was designed to investigate the effects of various dietary and nondietary factors on the caries development of a cohort of Iowa school children at the ages of 5, 9, and 13. To analyze the multiyear IFS data, we propose a novel longitudinal method of a generalized estimating equations based marginal regression model. We use a zero-inflated model with a Conway-Maxwell-Poisson (CMP) distribution, which has the flexibility to account for all levels of dispersion. The parameters of interest are estimated through a modified expectation-solution algorithm to account for the clustered and temporal correlation structure. We fit the proposed zero-inflated CMP model and perform a comprehensive secondary analysis of the IFS dataset. It resulted in a number of notable conclusions that also make clinical sense. Additionally, we demonstrated the superiority of this modeling approach over two other popular competing models: the zero-inflated Poisson and negative binomial models. In the simulation studies, we further evaluate the performance of our point estimators, the variance estimators, and that of the large sample confidence intervals for the parameters of interest. It is also demonstrated that our longitudinal CMP model can correctly identify the time-varying dispersion patterns.
Project description:Background: A novel coronavirus disease (COVID-19) outbreak has now spread to a number of countries worldwide. While sustained transmission chains of human-to-human transmission suggest high basic reproduction number R0, variation in the number of secondary transmissions (often characterised by so-called superspreading events) may be large as some countries have observed fewer local transmissions than others. Methods: We quantified individual-level variation in COVID-19 transmission by applying a mathematical model to observed outbreak sizes in affected countries. We extracted the number of imported and local cases in the affected countries from the World Health Organization situation report and applied a branching process model where the number of secondary transmissions was assumed to follow a negative-binomial distribution. Results: Our model suggested a high degree of individual-level variation in the transmission of COVID-19. Within the current consensus range of R0 (2-3), the overdispersion parameter k of a negative-binomial distribution was estimated to be around 0.1 (median estimate 0.1; 95% CrI: 0.05-0.2 for R0 = 2.5), suggesting that 80% of secondary transmissions may have been caused by a small fraction of infectious individuals (~10%). A joint estimation yielded likely ranges for R0 and k (95% CrIs: R0 1.4-12; k 0.04-0.2); however, the upper bound of R0 was not well informed by the model and data, which did not notably differ from that of the prior distribution. Conclusions: Our finding of a highly-overdispersed offspring distribution highlights a potential benefit to focusing intervention efforts on superspreading. As most infected individuals do not contribute to the expansion of an epidemic, the effective reproduction number could be drastically reduced by preventing relatively rare superspreading events.
Project description:Repeated measures are often collected in longitudinal follow-up from clinical trials and observational studies. In many situations, these measures are adherent to some specific event and are only available when it occurs; an example is serum creatinine from laboratory tests for hospitalized acute kidney injuries. The frequency of event recurrences is potentially correlated with overall health condition and hence may influence the distribution of the outcome measure of interest, leading to informative cluster size. In particular, there may be a large portion of subjects without any events, thus no longitudinal measures are available, which may be due to insusceptibility to such events or censoring before any events, and this zero-inflation nature of the data needs to be taken into account. On the other hand, there often exists a terminal event that may be correlated with the recurrent events. Previous work in this area suffered from the limitation that not all these issues were handled simultaneously. To address this deficiency, we propose a novel joint modeling approach for longitudinal data adjusting for zero-inflated and informative cluster size as well as a terminal event. A three-stage semiparametric likelihood-based approach is applied for parameter estimation and inference. Extensive simulations are conducted to evaluate the performance of our proposal. Finally, we utilize the Assessment, Serial Evaluation, and Subsequent Sequelae of Acute Kidney Injury (ASSESS-AKI) study for illustration.
Project description:The bicycle is a low-cost means of transport linked to low risk of transmission of infectious disease. During the COVID-19 crisis, governments have therefore incentivized cycling by provisionally redistributing street space. We evaluate the impact of this new bicycle infrastructure on cycling traffic using a generalized difference in differences design. We scrape daily bicycle counts from 736 bicycle counters in 106 European cities. We combine these with data on announced and completed pop-up bike lane road work projects. Within 4 mo, an average of 11.5 km of provisional pop-up bike lanes have been built per city and the policy has increased cycling between 11 and 48% on average. We calculate that the new infrastructure will generate between $1 and $7 billion in health benefits per year if cycling habits are sticky.
Project description:Coronavirus disease 2019 (COVID-19) caused by the SARS-CoV-2 virus has spread seriously throughout the world. Predicting the spread, or the number of cases, in the future can facilitate preparation for, and prevention of, a worst-case scenario. To achieve these purposes, statistical modeling using past data is one feasible approach. This paper describes spatio-temporal modeling of COVID-19 case counts in 47 prefectures of Japan using a nonlinear random effects model, where random effects are introduced to capture the heterogeneity of a number of model parameters associated with the prefectures. The negative binomial distribution is frequently used with the Paul-Held random effects model to account for overdispersion in count data; however, the negative binomial distribution is known to be incapable of accommodating extreme observations such as those found in the COVID-19 case count data. We therefore propose use of the beta-negative binomial distribution with the Paul-Held model. This distribution is a generalization of the negative binomial distribution that has attracted much attention in recent years because it can model extreme observations with analytical tractability. The proposed beta-negative binomial model was applied to multivariate count time series data of COVID-19 cases in the 47 prefectures of Japan. Evaluation by one-step-ahead prediction showed that the proposed model can accommodate extreme observations without sacrificing predictive performance.