Project description:Inverse probability weights are increasingly used in epidemiological analysis, and estimation and application of weights to address a single bias are well discussed in the literature. Weights to address multiple biases simultaneously (i.e. a combination of weights) have almost exclusively been discussed related to marginal structural models in longitudinal settings where treatment weights (estimated first) are combined with censoring weights (estimated second). In this work, we examine two examples of combined weights for confounding and missingness in a time-fixed setting in which outcome or confounder data are missing, and the estimand is the marginal expectation of the outcome under a time-fixed treatment. We discuss the identification conditions, construction of combined weights and how assumptions of the missing data mechanisms affect this construction. We use a simulation to illustrate the estimation and application of the weights in the two examples. Notably, when only outcome data are missing, construction of combined weights is straightforward; however, when confounder data are missing, we show that in general we must follow a specific estimation procedure which entails first estimating missingness weights and then estimating treatment probabilities from data with missingness weights applied. However, if treatment and missingness are conditionally independent, then treatment probabilities can be estimated among the complete cases.
Project description:A primary goal of longitudinal studies is to examine trends over time. Reported results from these studies often depend on strong, unverifiable assumptions about the missing data. Whereas the risk of substantial bias from missing data is widely known, analyses exploring missing-data influences are commonly done either ad hoc or not at all. This article outlines one of the three primary recognized approaches for examining missing-data effects that could be more widely used, i.e. the shared-parameter model (SPM), and explains its purpose, use, limitations and extensions. We additionally provide synthetic data and reproducible research code for running SPMs in SAS, Stata and R programming languages to facilitate their use in practice and for teaching purposes in epidemiology, biostatistics, data science and related fields. Our goals are to increase understanding and use of these methods by providing introductions to the concepts and access to helpful tools.
Project description:Motivation:Metabolomics data generated from liquid chromatography-mass spectrometry platforms often contain missing values. Existing imputation methods do not consider underlying feature relations and the metabolic network information. As a result, the imputation results may not be optimal. Results:We proposed an imputation algorithm that incorporates the existing metabolic network, adduct ion relations even for unknown compounds, as well as linear and nonlinear associations between feature intensities to build a feature-level network. The algorithm uses support vector regression for missing value imputation based on features in the neighborhood on the network. We compared our proposed method with methods being widely used. As judged by the normalized root mean squared error in real data-based simulations, our proposed methods can achieve better accuracy. Availability and implementation:The R package is available at http://web1.sph.emory.edu/users/tyu8/MINMA. Contact:jiankang@umich.edu or tianwei.yu@emory.edu. Supplementary information:Supplementary data are available at Bioinformatics online.
Project description:BACKGROUND:Contact tracing data of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic is used to estimate basic epidemiological parameters. Contact tracing data could also be potentially used for assessing the heterogeneity of transmission at the individual patient level. Characterization of individuals based on different levels of infectiousness could better inform the contact tracing interventions at field levels. METHODS:Standard social network analysis methods used for exploring infectious disease transmission dynamics was employed to analyze contact tracing data of 1959 diagnosed SARS-CoV-2 patients from a large state of India. Relational network data set with diagnosed patients as "nodes" and their epidemiological contact as "edges" was created. Directed network perspective was utilized in which directionality of infection emanated from a "source patient" towards a "target patient". Network measures of " degree centrality" and "betweenness centrality" were calculated to identify influential patients in the transmission of infection. Components analysis was conducted to identify patients connected as sub- groups. Descriptive statistics was used to summarise network measures and percentile ranks were used to categorize influencers. RESULTS:Out-degree centrality measures identified that of the total 1959 patients, 11.27% (221) patients have acted as a source of infection to 40.19% (787) other patients. Among these source patients, 0.65% (12) patients had a higher out-degree centrality (>?=?10) and have collectively infected 37.61% (296 of 787), secondary patients. Betweenness centrality measures highlighted that 7.50% (93) patients had a non-zero betweenness (range 0.5 to 135) and thus have bridged the transmission between other patients. Network component analysis identified nineteen connected components comprising of influential patient's which have overall accounted for 26.95% of total patients (1959) and 68.74% of epidemiological contacts in the network. CONCLUSIONS:Social network analysis method for SARS-CoV-2 contact tracing data would be of use in measuring individual patient level variations in disease transmission. The network metrics identified individual patients and patient components who have disproportionately contributed to transmission. The network measures and graphical tools could complement the existing contact tracing indicators and could help improve the contact tracing activities.
Project description:Mathematical models are often used to explore network-driven cellular processes from a systems perspective. However, a dearth of quantitative data suitable for model calibration leads to models with parameter unidentifiability and questionable predictive power. Here we introduce a combined Bayesian and Machine Learning Measurement Model approach to explore how quantitative and non-quantitative data constrain models of apoptosis execution within a missing data context. We find model prediction accuracy and certainty strongly depend on rigorous data-driven formulations of the measurement, and the size and make-up of the datasets. For instance, two orders of magnitude more ordinal (e.g., immunoblot) data are necessary to achieve accuracy comparable to quantitative (e.g., fluorescence) data for calibration of an apoptosis execution model. Notably, ordinal and nominal (e.g., cell fate observations) non-quantitative data synergize to reduce model uncertainty and improve accuracy. Finally, we demonstrate the potential of a data-driven Measurement Model approach to identify model features that could lead to informative experimental measurements and improve model predictive power.
Project description:BackgroundThe potential value of large scale datasets is constrained by the ubiquitous problem of missing data, arising in either a structured or unstructured fashion. When imputation methods are proposed for large scale data, one limitation is the simplicity of existing evaluation methods. Specifically, most evaluations create synthetic data with only a simple, unstructured missing data mechanism which does not resemble the missing data patterns found in real data. For example, in the UK Biobank missing data tends to appear in blocks, because non-participation in one of the sub-studies leads to missingness for all sub-study variables.MethodsWe propose a tool for generating mixed type missing data mimicking key properties of a given real large scale epidemiological data set with both structured and unstructured missingness while accounting for informative missingness. The process involves identifying sub-studies using hierarchical clustering of missingness patterns and modelling the dependence of inter-variable correlation and co-missingness patterns.ResultsOn the UK Biobank brain imaging cohort, we identify several large blocks of missing data. We demonstrate the use of our tool for evaluating several imputation methods, showing modest accuracy of imputation overall, with iterative imputation having the best performance. We compare our evaluations based on synthetic data to an exemplar study which includes variable selection on a single real imputed dataset, finding only small differences between the imputation methods though with iterative imputation leading to the most informative selection of variables.ConclusionsWe have created a framework for simulating large scale data with that captures the complexities of the inter-variable dependence as well as structured and unstructured informative missingness. Evaluations using this framework highlight the immense challenge of data imputation in this setting and the need for improved missing data methods.
Project description:In the dementia area it is often of interest to study relationships among regional brain measures; however, it is often necessary to adjust for covariates. Partial correlations are frequently used to correlate two variables while adjusting for other variables. Complete case analysis is typically the analysis of choice for partial correlations with missing data. However, complete case analysis will lead to biased and inefficient results when the data are missing at random. We have extended the partial correlation coefficient in the presence of missing data using the expectation-maximization (EM) algorithm, and compared it with a multiple imputation method and complete case analysis using simulation studies. The EM approach performed the best of all methods with multiple imputation performing almost as well. These methods were illustrated with regional imaging data from an Alzheimer's disease study.
Project description:Appropriate handling of aggregate missing outcome data is necessary to minimise bias in the conclusions of systematic reviews. The two-stage pattern-mixture model has been already proposed to address aggregate missing continuous outcome data. While this approach is more proper compared with the exclusion of missing continuous outcome data and simple imputation methods, it does not offer flexible modelling of missing continuous outcome data to investigate their implications on the conclusions thoroughly. Therefore, we propose a one-stage pattern-mixture model approach under the Bayesian framework to address missing continuous outcome data in a network of interventions and gain knowledge about the missingness process in different trials and interventions. We extend the hierarchical network meta-analysis model for one aggregate continuous outcome to incorporate a missingness parameter that measures the departure from the missing at random assumption. We consider various effect size estimates for continuous data, and two informative missingness parameters, the informative missingness difference of means and the informative missingness ratio of means. We incorporate our prior belief about the missingness parameters while allowing for several possibilities of prior structures to account for the fact that the missingness process may differ in the network. The method is exemplified in two networks from published reviews comprising a different amount of missing continuous outcome data.
Project description:Secondary respondent data are underutilized because researchers avoid using these data in the presence of substantial missing data. We reviewed, critically evaluated, and tested potential solutions to this problem. Five strategies of dealing with missing partner data are reviewed: complete case analysis, inverse probability weighting, correction with a Heckman selection model, maximum likelihood estimation, and multiple imputation. Two approaches were used to evaluate the performance of these methods. First, we used data from the National Survey of Fertility Barriers (N = 1,666) to estimate a model predicting marital quality based on characteristics of women and their husbands. Second, we conducted a simulation based on these data testing the five methods and compared the results to estimates where the true value was known. We found that the maximum likelihood and multiple imputation methods were advantageous because they allow researchers to utilize all of the available information as well as produce less biased and more efficient estimates.
Project description:BackgroundAvailability of linked biomedical and social science data has risen dramatically in past decades, facilitating holistic and systems-based analyses. Among these, Bayesian networks have great potential to tackle complex interdisciplinary problems, because they can easily model inter-relations between variables. They work by encoding conditional independence relationships discovered via advanced inference algorithms. One challenge is dealing with missing data, ubiquitous in survey or biomedical datasets. Missing data is rarely addressed in an advanced way in Bayesian networks; the most common approach is to discard all samples containing missing measurements. This can lead to biased estimates. Here, we examine how Bayesian network structure learning can incorporate missing data.MethodsWe use a simulation approach to compare a commonly used method in frequentist statistics, multiple imputation by chained equations (MICE), with one specific for Bayesian network learning, structural expectation-maximization (SEM). We simulate multiple incomplete categorical (discrete) data sets with different missingness mechanisms, variable numbers, data amount, and missingness proportions. We evaluate performance of MICE and SEM in capturing network structure. We then apply SEM combined with community analysis to a real-world dataset of linked biomedical and social data to investigate associations between socio-demographic factors and multiple chronic conditions in the US elderly population.ResultsWe find that applying either method (MICE or SEM) provides better structure recovery than doing nothing, and SEM in general outperforms MICE. This finding is robust across missingness mechanisms, variable numbers, data amount and missingness proportions. We also find that imputed data from SEM is more accurate than from MICE. Our real-world application recovers known inter-relationships among socio-demographic factors and common multimorbidities. This network analysis also highlights potential areas of investigation, such as links between cancer and cognitive impairment and disconnect between self-assessed memory decline and standard cognitive impairment measurement.ConclusionOur simulation results suggest taking advantage of the additional information provided by network structure during SEM improves the performance of Bayesian networks; this might be especially useful for social science and other interdisciplinary analyses. Our case study show that comorbidities of different diseases interact with each other and are closely associated with socio-demographic factors.