Case study in evaluating time series prediction models using the relative mean absolute error.
ABSTRACT: Statistical prediction models inform decision-making processes in many real-world settings. Prior to using predictions in practice, one must rigorously test and validate candidate models to ensure that the proposed predictions have sufficient accuracy to be used in practice. In this paper, we present a framework for evaluating time series predictions that emphasizes computational simplicity and an intuitive interpretation using the relative mean absolute error metric. For a single time series, this metric enables comparisons of candidate model predictions against naïve reference models, a method that can provide useful and standardized performance benchmarks. Additionally, in applications with multiple time series, this framework facilitates comparisons of one or more models' predictive performance across different sets of data. We illustrate the use of this metric with a case study comparing predictions of dengue hemorrhagic fever incidence in two provinces of Thailand. This example demonstrates the utility and interpretability of the relative mean absolute error metric in practice, and underscores the practical advantages of using relative performance metrics when evaluating predictions.
Project description:BACKGROUND AND PURPOSE:MR imaging-based modeling of tumor cell density can substantially improve targeted treatment of glioblastoma. Unfortunately, interpatient variability limits the predictive ability of many modeling approaches. We present a transfer learning method that generates individualized patient models, grounded in the wealth of population data, while also detecting and adjusting for interpatient variabilities based on each patient's own histologic data. MATERIALS AND METHODS:We recruited patients with primary glioblastoma undergoing image-guided biopsies and preoperative imaging, including contrast-enhanced MR imaging, dynamic susceptibility contrast MR imaging, and diffusion tensor imaging. We calculated relative cerebral blood volume from DSC-MR imaging and mean diffusivity and fractional anisotropy from DTI. Following image coregistration, we assessed tumor cell density for each biopsy and identified corresponding localized MR imaging measurements. We then explored a range of univariate and multivariate predictive models of tumor cell density based on MR imaging measurements in a generalized one-model-fits-all approach. We then implemented both univariate and multivariate individualized transfer learning predictive models, which harness the available population-level data but allow individual variability in their predictions. Finally, we compared Pearson correlation coefficients and mean absolute error between the individualized transfer learning and generalized one-model-fits-all models. RESULTS:Tumor cell density significantly correlated with relative CBV (r = 0.33, P < .001), and T1-weighted postcontrast (r = 0.36, P < .001) on univariate analysis after correcting for multiple comparisons. With single-variable modeling (using relative CBV), transfer learning increased predictive performance (r = 0.53, mean absolute error = 15.19%) compared with one-model-fits-all (r = 0.27, mean absolute error = 17.79%). With multivariate modeling, transfer learning further improved performance (r = 0.88, mean absolute error = 5.66%) compared with one-model-fits-all (r = 0.39, mean absolute error = 16.55%). CONCLUSIONS:Transfer learning significantly improves predictive modeling performance for quantifying tumor cell density in glioblastoma.
Project description:A new validation metric is proposed that combines the use of a threshold based on the uncertainty in the measurement data with a normalized relative error, and that is robust in the presence of large variations in the data. The outcome from the metric is the probability that a model's predictions are representative of the real world based on the specific conditions and confidence level pertaining to the experiment from which the measurements were acquired. Relative error metrics are traditionally designed for use with a series of data values, but orthogonal decomposition has been employed to reduce the dimensionality of data matrices to feature vectors so that the metric can be applied to fields of data. Three previously published case studies are employed to demonstrate the efficacy of this quantitative approach to the validation process in the discipline of structural analysis, for which historical data were available; however, the concept could be applied to a wide range of disciplines and sectors where modelling and simulation play a pivotal role.
Project description:Evaluation of harvest data remains one of the most important sources of information in the development of strategies to manage regional populations of white-tailed deer. While descriptive statistics and simple linear models are utilized extensively, the use of artificial neural networks for this type of data analyses is unexplored. Linear model was compared to Artificial Neural Networks (ANN) models with Levenberg-Marquardt (L-M), Bayesian Regularization (BR) and Scaled Conjugate Gradient (SCG) learning algorithms, to evaluate the relative accuracy in predicting antler beam diameter and length using age and dressed body weight in white-tailed deer. Data utilized for this study were obtained from male animals harvested by hunters between 1977-2009 at the Berry College Wildlife Management Area. Metrics for evaluating model performance indicated that linear and ANN models resulted in close match and good agreement between predicted and observed values and thus good performance for all models. However, metrics values of Mean Absolute Error and Root Mean Squared Error for linear model and the ANN-BR model indicated smaller error and lower deviation relative to the mean values of antler beam diameter and length in comparison to other ANN models, demonstrating better agreement of the predicted and observed values of antler beam diameter and length. ANN-SCG model resulted in the highest error within the models. Overall, metrics for evaluating model performance from the ANN model with BR learning algorithm and linear model indicated better agreement of the predicted and observed values of antler beam diameter and length. Results of this study suggest the use of ANN generated results that are comparable to Linear Models of harvest data to aid in the development of strategies to manage white-tailed deer.
Project description:Guangxi, a province in southwestern China, has the second highest reported number of HIV/AIDS cases in China. This study aimed to develop an accurate and effective model to describe the tendency of HIV and to predict its incidence in Guangxi. HIV incidence data of Guangxi from 2005 to 2016 were obtained from the database of the Chinese Center for Disease Control and Prevention. Long short-term memory (LSTM) neural network models, autoregressive integrated moving average (ARIMA) models, generalised regression neural network (GRNN) models and exponential smoothing (ES) were used to fit the incidence data. Data from 2015 and 2016 were used to validate the most suitable models. The model performances were evaluated by evaluating metrics, including mean square error (MSE), root mean square error, mean absolute error and mean absolute percentage error. The LSTM model had the lowest MSE when the N value (time step) was 12. The most appropriate ARIMA models for incidence in 2015 and 2016 were ARIMA (1, 1, 2) (0, 1, 2)12 and ARIMA (2, 1, 0) (1, 1, 2)12, respectively. The accuracy of GRNN and ES models in forecasting HIV incidence in Guangxi was relatively poor. Four performance metrics of the LSTM model were all lower than the ARIMA, GRNN and ES models. The LSTM model was more effective than other time-series models and is important for the monitoring and control of local HIV epidemics.
Project description:Estimating and projecting population trends using population viability analysis (PVA) are central to identifying species at risk of extinction and for informing conservation management strategies. Models for PVA generally fall within two categories, scalar (count-based) or matrix (demographic). Model structure, process error, measurement error, and time series length all have known impacts in population risk assessments, but their combined impact has not been thoroughly investigated. We tested the ability of scalar and matrix PVA models to predict percent decline over a ten-year interval, selected to coincide with the IUCN Red List criterion A.3, using data simulated for a hypothetical, short-lived organism with a simple life-history and for a threatened snail, Tasmaphena lamproides. PVA performance was assessed across different time series lengths, population growth rates, and levels of process and measurement error. We found that the magnitude of effects of measurement error, process error, and time series length, and interactions between these, depended on context. We found that high process and measurement error reduced the reliability of both models in predicted percent decline. Both sources of error contributed strongly to biased predictions, with process error tending to contribute to the spread of predictions more than measurement error. Increasing time series length improved precision and reduced bias of predicted population trends, but gains substantially diminished for time series lengths greater than 10-15 years. The simple parameterization scheme we employed contributed strongly to bias in matrix model predictions when both process and measurement error were high, causing scalar models to exhibit similar or greater precision and lower bias than matrix models. Our study provides evidence that, for short-lived species with structured but simple life histories, short time series and simple models can be sufficient for reasonably reliable conservation decision-making, and may be preferable for population projections when unbiased estimates of vital rates cannot be obtained.
Project description:Provider profiling of outcome performance has become increasingly common in pay-for-performance programs. For chronic conditions, a substantial proportion of patients eligible for outcome measures may be lost to follow-up, potentially compromising outcome profiling. In the context of primary care depression treatment, we assess the implications of missing data for the accuracy of alternative approaches to provider outcome profiling.We used data from the Improving Mood-Promoting Access to Collaborative Treatment trial and the Depression Improvement across Minnesota, Offering a New Direction initiative to generate parameters for a Monte Carlo simulation experiment.The patient outcome of interest is the rate of remission of depressive symptoms at 6 months among a panel of patients with major depression at baseline. We considered two alternative approaches to profiling this outcome: (1) a relative, or tournament style threshold, set at the 80th percentile of remission rate among all providers, and (2) an absolute threshold, evaluating whether providers exceed a specified remission rate (30 percent). We performed a Monte Carlo simulation experiment to evaluate the total error rate (proportion of providers who were incorrectly classified) under each profiling approach. The total error rate was partitioned into error from random sampling variability and error resulting from missing data. We then evaluated the accuracy of alternative profiling approaches under different assumptions about the relationship between missing data and depression remission.Over a range of scenarios, relative profiling approaches had total error rates that were approximately 20 percent lower than absolute profiling approaches, and error due to missing data was approximately 50 percent lower for relative profiling. Most of the profiling error in the simulations was a result of random sampling variability, not missing data: between 11 and 21 percent of total error was attributable to missing data for relative profiling, while between 16 and 33 percent of total error was attributable to missing data for absolute profiling. Finally, compared with relative profiling, absolute profiling was much more sensitive to missing data that was correlated with the remission outcome.Relative profiling approaches for pay-for-performance were more accurate and more robust to missing data than absolute profiling approaches.
Project description:Prospective validation of methods for computing binding affinities can help assess their predictive power and thus set reasonable expectations for their performance in drug design applications. Supramolecular host-guest systems are excellent model systems for testing such affinity prediction methods, because their small size and limited conformational flexibility, relative to proteins, allows higher throughput and better numerical convergence. The SAMPL4 prediction challenge therefore included a series of host-guest systems, based on two hosts, cucurbituril and octa-acid. Binding affinities in aqueous solution were measured experimentally for a total of 23 guest molecules. Participants submitted 35 sets of computational predictions for these host-guest systems, based on methods ranging from simple docking, to extensive free energy simulations, to quantum mechanical calculations. Over half of the predictions provided better correlations with experiment than two simple null models, but most methods underperformed the null models in terms of root mean squared error and linear regression slope. Interestingly, the overall performance across all SAMPL4 submissions was similar to that for the prior SAMPL3 host-guest challenge, although the experimentalists took steps to simplify the current challenge. While some methods performed fairly consistently across both hosts, no single approach emerged as consistent top performer, and the nonsystematic nature of the various submissions made it impossible to draw definitive conclusions regarding the best choices of energy models or sampling algorithms. Salt effects emerged as an issue in the calculation of absolute binding affinities of cucurbituril-guest systems, but were not expected to affect the relative affinities significantly. Useful directions for future rounds of the challenge might involve encouraging participants to carry out some calculations that replicate each others' studies, and to systematically explore parameter options.
Project description:The surge of interest in personalized and precision medicine during recent years has increased the application of ordinal classification problems in biomedical science. Currently, accuracy, Kendall's ?b , and average mean absolute error are three commonly used metrics for evaluating the effectiveness of an ordinal classifier. Although there are benefits to each, no single metric considers the benefits of predictive accuracy with the tradeoffs of misclassification cost. In addition, decision analysis that considers pairwise analysis of the metrics is not trivial due to inconsistent findings. A new cost-sensitive metric is proposed to find the optimal tradeoff between the two most critical performance measures of a classification task - accuracy and cost. The proposed method accounts for an inherent ordinal data structure, total misclassification cost of a classifier, and imbalanced class distribution. The strengths of the new methodology are demonstrated through analyses of three real cancer datasets and four simulation studies. The new cost-sensitive metric proved better performance in its ability to identify the best ordinal classifier for a given analysis. The performance metric devised in this study provides a comprehensive tool for comparative analysis of multiple (and competing) ordinal classifiers. Consideration of the tradeoff between accuracy and misclassification cost in decisions regarding ordinal classification problems is imperative in real-world application. The work presented here is a precursor to the possibility of incorporating the proposed metric into a prediction modeling algorithm for ordinal data as a means of integrating misclassification cost in final model selection.
Project description:Objectives. Uncertainty in survival prediction beyond trial follow-up is highly influential in cost-effectiveness analyses of oncology products. This research provides an empirical evaluation of the accuracy of alternative methods and recommendations for their implementation. Methods. Mature (15-year) survival data were reconstructed from a published database study for "no treatment," radiotherapy, surgery plus radiotherapy, and surgery in early stage non-small cell lung cancer in an elderly patient population. Censored data sets were created from these data to simulate immature trial data (for 1- to 10-year follow-up). A second data set with mature (9-year) survival data for no treatment was used to extrapolate the predictions from models fitted to the first data set. Six methodological approaches were used to fit models to the simulated data and extrapolate beyond trial follow-up. Model performance was evaluated by comparing the relative difference in mean survival estimates and the absolute error in the difference in mean survival v. the control with those from the original mature survival data set. Results. Model performance depended on the treatment comparison scenario. All models performed reasonably well when there was a small short-term treatment effect, with the Bayesian model coping better with shorter follow-up times. However, in other scenarios, the most flexible Bayesian model that could be estimated in practice appeared to fit the data less well than the models that used the external data separately. Where there was a large treatment effect (hazard ratio = 0.4), models that used external data separately performed best. Conclusions. Models that directly use mature external data can improve the accuracy of survival predictions. Recommendations on modeling strategies are made for different treatment benefit scenarios.
Project description:The aim of this study is to model the association between weekly time series of dengue case counts and meteorological variables, in a high-incidence city of Colombia, applying Bayesian hierarchical dynamic generalized linear models over the period January 2008 to August 2015. Additionally, we evaluate the model's short-term performance for predicting dengue cases. The methodology shows dynamic Poisson log link models including constant or time-varying coefficients for the meteorological variables. Calendar effects were modeled using constant or first- or second-order random walk time-varying coefficients. The meteorological variables were modeled using constant coefficients and first-order random walk time-varying coefficients. We applied Markov Chain Monte Carlo simulations for parameter estimation, and deviance information criterion statistic (DIC) for model selection. We assessed the short-term predictive performance of the selected final model, at several time points within the study period using the mean absolute percentage error. The results showed the best model including first-order random walk time-varying coefficients for calendar trend and first-order random walk time-varying coefficients for the meteorological variables. Besides the computational challenges, interpreting the results implies a complete analysis of the time series of dengue with respect to the parameter estimates of the meteorological effects. We found small values of the mean absolute percentage errors at one or two weeks out-of-sample predictions for most prediction points, associated with low volatility periods in the dengue counts. We discuss the advantages and limitations of the dynamic Poisson models for studying the association between time series of dengue disease and meteorological variables. The key conclusion of the study is that dynamic Poisson models account for the dynamic nature of the variables involved in the modeling of time series of dengue disease, producing useful models for decision-making in public health.