Project description:Overdispersion is a common feature of models of biological data, but researchers often fail to model the excess variation driving the overdispersion, resulting in biased parameter estimates and standard errors. Quantifying and modeling overdispersion when it is present is therefore critical for robust biological inference. One means to account for overdispersion is to add an observation-level random effect (OLRE) to a model, where each data point receives a unique level of a random effect that can absorb the extra-parametric variation in the data. Although some studies have investigated the utility of OLRE to model overdispersion in Poisson count data, studies doing so for Binomial proportion data are scarce. Here I use a simulation approach to investigate the ability of both OLRE models and Beta-Binomial models to recover unbiased parameter estimates in mixed effects models of Binomial data under various degrees of overdispersion. In addition, as ecologists often fit random intercept terms to models when the random effect sample size is low (<5 levels), I investigate the performance of both model types under a range of random effect sample sizes when overdispersion is present. Simulation results revealed that the efficacy of OLRE depends on the process that generated the overdispersion; OLRE failed to cope with overdispersion generated from a Beta-Binomial mixture model, leading to biased slope and intercept estimates, but performed well for overdispersion generated by adding random noise to the linear predictor. Comparison of parameter estimates from an OLRE model with those from its corresponding Beta-Binomial model readily identified when OLRE were performing poorly due to disagreement between effect sizes, and this strategy should be employed whenever OLRE are used for Binomial data to assess their reliability. Beta-Binomial models performed well across all contexts, but showed a tendency to underestimate effect sizes when modelling non-Beta-Binomial data. Finally, both OLRE and Beta-Binomial models performed poorly when models contained <5 levels of the random intercept term, especially for estimating variance components, and this effect appeared independent of total sample size. These results suggest that OLRE are a useful tool for modelling overdispersion in Binomial data, but that they do not perform well in all circumstances and researchers should take care to verify the robustness of parameter estimates of OLRE models.
Project description:Allelic imbalance occurs when the two alleles of a gene are differentially expressed within a diploid organism and can indicate important differences in cis-regulation and epigenetic state across the two chromosomes. Because of this, the ability to accurately quantify the proportion at which each allele of a gene is expressed is of great interest to researchers. This becomes challenging in the presence of small read counts and/or sample sizes, which can cause estimators for allelic expression proportions to have high variance. Investigators have traditionally dealt with this problem by filtering out genes with small counts and samples. However, this may inadvertently remove important genes that have truly large allelic imbalances. Another option is to use pseudocounts or Bayesian estimators to reduce the variance. To this end, we evaluated the accuracy of four different estimators, the latter two of which are Bayesian shrinkage estimators: maximum likelihood, adding a pseudocount to each allele, approximate posterior estimation of GLM coefficients (apeglm) and adaptive shrinkage (ash). We also wrote C++ code to quickly calculate ML and apeglm estimates and integrated it into the apeglm package. The four methods were evaluated on two simulations and one real data set. Apeglm consistently performed better than ML according to a variety of criteria, and generally outperformed use of pseudocounts as well. Ash also performed better than ML in one of the simulations, but in the other performance was more mixed. Finally, when compared to five other packages that also fit beta-binomial models, the apeglm package was substantially faster and more numerically reliable, making our package useful for quick and reliable analyses of allelic imbalance. Apeglm is available as an R/Bioconductor package at http://bioconductor.org/packages/apeglm.
Project description:Background/aimsThis work is motivated by the HEALing Communities Study, which is a post-test only cluster randomized trial in which communities are randomized to two different trial arms. The primary interest is in reducing opioid overdose fatalities, which will be collected as a count outcome at the community level. Communities range in size from thousands to over one million residents, and fatalities are expected to be rare. Traditional marginal modeling approaches in the cluster randomized trial literature include the use of generalized estimating equations with an exchangeable correlation structure when utilizing subject-level data, or analogously quasi-likelihood based on an over-dispersed binomial variance when utilizing community-level data. These approaches account for and estimate the intra-cluster correlation coefficient, which should be provided in the results from a cluster randomized trial. Alternatively, the coefficient of variation or R coefficient could be reported. In this article, we show that negative binomial regression can also be utilized when communities are large and events are rare. The objectives of this article are (1) to show that the negative binomial regression approach targets the same marginal regression parameter(s) as an over-dispersed binomial model and to explain why the estimates may differ; (2) to derive formulas relating the negative binomial overdispersion parameter k with the intra-cluster correlation coefficient, coefficient of variation, and R coefficient; and (3) analyze pre-intervention data from the HEALing Communities Study to demonstrate and contrast models and to show how to report the intra-cluster correlation coefficient, coefficient of variation, and R coefficient when utilizing negative binomial regression.MethodsNegative binomial and over-dispersed binomial regression modeling are contrasted in terms of model setup, regression parameter estimation, and formulation of the overdispersion parameter. Three specific models are used to illustrate concepts and address the third objective.ResultsThe negative binomial regression approach targets the same marginal regression parameter(s) as an over-dispersed binomial model, although estimates may differ. Practical differences arise in regard to how overdispersion, and hence the intra-cluster correlation coefficient is modeled. The negative binomial overdispersion parameter is approximately equal to the ratio of the intra-cluster correlation coefficient and marginal probability, the square of the coefficient of variation, and the R coefficient minus 1. As a result, estimates corresponding to all four of these different types of overdispersion parameterizations can be reported when utilizing negative binomial regression.ConclusionNegative binomial regression provides a valid, practical, alternative approach to the analysis of count data, and corresponding reporting of overdispersion parameters, from community randomized trials in which communities are large and events are rare.
Project description:The metagenomics sequencing data provide valuable resources for investigating the associations between the microbiome and host environmental/clinical factors and the dynamic changes of microbial abundance over time. The distinct properties of microbiome measurements include varied total sequence reads across samples, over-dispersion and zero-inflation. Additionally, microbiome studies usually collect samples longitudinally, which introduces time-dependent and correlation structures among the samples and thus further complicates the analysis and interpretation of microbiome count data. In this article, we propose negative binomial mixed models (NBMMs) for longitudinal microbiome studies. The proposed NBMMs can efficiently handle over-dispersion and varying total reads, and can account for the dynamic trend and correlation among longitudinal samples. We develop an efficient and stable algorithm to fit the NBMMs. We evaluate and demonstrate the NBMMs method via extensive simulation studies and application to a longitudinal microbiome data. The results show that the proposed method has desirable properties and outperform the previously used methods in terms of flexible framework for modeling correlation structures and detecting dynamic effects. We have developed an R package NBZIMM to implement the proposed method, which is freely available from the public GitHub repository http://github.com//nyiuab//NBZIMM and provides a useful tool for analyzing longitudinal microbiome data.
Project description:BackgroundPrevious studies on the relationship of neighborhood disadvantage with alcohol use or misuse have often controlled for individual characteristics on the causal pathway, such as income-thus potentially underestimating the relationship between disadvantage and alcohol consumption.MethodsWe used data from the Coronary Artery Risk Development in Young Adults study of 5115 adults aged 18-30 years at baseline and interviewed 7 times between 1985 and 2006. We estimated marginal structural models using inverse probability-of-treatment and censoring weights to assess the association between point-in-time/cumulative exposure to neighborhood poverty (proportion of census tract residents living in poverty) and alcohol use/binging, after accounting for time-dependent confounders including income, education, and occupation.ResultsThe log-normal model was used to estimate treatment weights while accounting for highly-skewed continuous neighborhood poverty data. In the weighted model, a one-unit increase in neighborhood poverty at the prior examination was associated with a 86% increase in the odds of binging (OR = 1.86 [95% confidence interval = 1.14-3.03]); the estimate from a standard generalized-estimating-equations model controlling for baseline and time-varying covariates was 1.47 (0.96-2.25). The inverse probability-of-treatment and censoring weighted estimate of the relative increase in the number of weekly drinks in the past year associated with cumulative neighborhood poverty was 1.53 (1.02-2.27); the estimate from a standard model was 1.16 (0.83-1.62).ConclusionsCumulative and point-in-time measures of neighborhood poverty are important predictors of alcohol consumption. Estimators that more closely approximate a causal effect of neighborhood poverty on alcohol provided a stronger estimate than estimators from traditional regression models.
Project description:BackgroundLog-binomial and robust (modified) Poisson regression models are popular approaches to estimate risk ratios for binary response variables. Previous studies have shown that comparatively they produce similar point estimates and standard errors. However, their performance under model misspecification is poorly understood.MethodsIn this simulation study, the statistical performance of the two models was compared when the log link function was misspecified or the response depended on predictors through a non-linear relationship (i.e. truncated response).ResultsPoint estimates from log-binomial models were biased when the link function was misspecified or when the probability distribution of the response variable was truncated at the right tail. The percentage of truncated observations was positively associated with the presence of bias, and the bias was larger if the observations came from a population with a lower response rate given that the other parameters being examined were fixed. In contrast, point estimates from the robust Poisson models were unbiased.ConclusionUnder model misspecification, the robust Poisson model was generally preferable because it provided unbiased estimates of risk ratios.
Project description:This tutorial describes single-step low-dimensional simultaneous inference with a focus on the availability of adjusted p values and compatible confidence intervals for more than just the usual mean value comparisons. The basic idea is, first, to use the influence of correlation on the quantile of the multivariate t-distribution: the higher the less conservative. In addition, second, the estimability of the correlation matrix using the multiple marginal models approach (mmm) using multiple models in the class of linear up to generalized linear mixed models. The underlying maxT-test using mmm is discussed by means of several real data scenarios using selected R packages. Surprisingly, different features are highlighted, among them: (i) analyzing different-scaled, correlated, multiple endpoints, (ii) analyzing multiple correlated binary endpoints, (iii) modeling dose as qualitative factor and/or quantitative covariate, (iv) joint consideration of several tuning parameters within the poly-k trend test, (v) joint testing of dose and time, (vi) considering several effect sizes, (vii) joint testing of subgroups and overall population in multiarm randomized clinical trials with correlated primary endpoints, (viii) multiple linear mixed effect models, (ix) generalized estimating equations, and (x) nonlinear regression models.
Project description:It is of great interest for a biomedical analyst or an investigator to correctly model the CD4 cell count or disease biomarkers of a patient in the presence of covariates or factors determining the disease progression over time. The Poisson mixed-effects models (PMM) can be an appropriate choice for repeated count data. However, this model is not realistic because of the restriction that the mean and variance are equal. Therefore, the PMM is replaced by the negative binomial mixed-effects model (NBMM). The later model effectively manages the over-dispersion of the longitudinal data. We evaluate and compare the proposed models and their application to the number of CD4 cells of HIV-Infected patients recruited in the CAPRISA 002 Acute Infection Study. The results display that the NBMM has appropriate properties and outperforms the PMM in terms of handling over-dispersion of the data. Multiple imputation techniques are also used to handle missing values in the dataset to get valid inferences for parameter estimates. In addition, the results imply that the effect of baseline BMI, HAART initiation, baseline viral load, and the number of sexual partners were significantly associated with the patient's CD4 count in both fitted models. Comparison, discussion, and conclusion of the results of the fitted models complete the study.
Project description:BackgroundEpidemiologists often analyse binary outcomes in cohort and cross-sectional studies using multivariable logistic regression models, yielding estimates of adjusted odds ratios. It is widely known that the odds ratio closely approximates the risk or prevalence ratio when the outcome is rare, and it does not do so when the outcome is common. Consequently, investigators may decide to directly estimate the risk or prevalence ratio using a log binomial regression model.MethodsWe describe the use of a marginal structural binomial regression model to estimate standardized risk or prevalence ratios and differences. We illustrate the proposed approach using data from a cohort study of coronary heart disease status in Evans County, Georgia, USA.ResultsThe approach reduces problems with model convergence typical of log binomial regression by shifting all explanatory variables except the exposures of primary interest from the linear predictor of the outcome regression model to a model for the standardization weights. The approach also facilitates evaluation of departures from additivity in the joint effects of two exposures.ConclusionsEpidemiologists should consider reporting standardized risk or prevalence ratios and differences in cohort and cross-sectional studies. These are readily-obtained using the SAS, Stata and R statistical software packages. The proposed approach estimates the exposure effect in the total population.
Project description:Observational data are increasingly used with the aim of estimating causal effects of treatments, through careful control for confounding. Marginal structural models estimated using inverse probability weighting (MSMs-IPW), like other methods to control for confounding, assume that confounding variables are measured without error. The average treatment effect in an MSM-IPW may however be biased when a confounding variable is error prone. Using the potential outcome framework, we derive expressions for the bias due to confounder misclassification in analyses that aim to estimate the average treatment effect using an marginal structural model estimated using inverse probability weighting (MSM-IPW). We compare this bias with the bias due to confounder misclassification in analyses based on a conditional regression model. Focus is on a point-treatment study with a continuous outcome. Compared with bias in the average treatment effect in a conditional model, the bias in an MSM-IPW can be different in magnitude but is equal in sign. Also, we use a simulation study to investigate the finite sample performance of MSM-IPW and conditional models when a confounding variable is misclassified. Simulation results indicate that confidence intervals of the treatment effect obtained from MSM-IPW are generally wider, and coverage of the true treatment effect is higher compared with a conditional model, ranging from overcoverage if there is no confounder misclassification to undercoverage when there is confounder misclassification. Further, we illustrate in a study of blood pressure-lowering therapy, how the bias expressions can be used to inform a quantitative bias analysis to study the impact of confounder misclassification, supported by an online tool.