Dataset Information

Statistical quantification of confounding bias in machine learning models.

ABSTRACT:

Background

The lack of nonparametric statistical tests for confounding bias significantly hampers the development of robust, valid, and generalizable predictive models in many fields of research. Here I propose the partial confounder test, which, for a given confounder variable, probes the null hypotheses of the model being unconfounded.

Results

The test provides a strict control for type I errors and high statistical power, even for nonnormally and nonlinearly dependent predictions, often seen in machine learning. Applying the proposed test on models trained on large-scale functional brain connectivity data (N= 1,865) (i) reveals previously unreported confounders and (ii) shows that state-of-the-art confound mitigation approaches may fail preventing confounder bias in several cases.

Conclusions

The proposed test (implemented in the package mlconfound; https://mlconfound.readthedocs.io) can aid the assessment and improvement of the generalizability and validity of predictive models and, thereby, fosters the development of clinically useful machine learning biomarkers.

SUBMITTER: Spisak T

PROVIDER: S-EPMC9412867 | biostudies-literature | 2022 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Statistical quantification of confounding bias in machine learning models.

Spisak Tamas T

GigaScience 20220801

<h4>Background</h4>The lack of nonparametric statistical tests for confounding bias significantly hampers the development of robust, valid, and generalizable predictive models in many fields of research. Here I propose the partial confounder test, which, for a given confounder variable, probes the null hypotheses of the model being unconfounded.<h4>Results</h4>The test provides a strict control for type I errors and high statistical power, even for nonnormally and nonlinearly dependent predictio ...[more]

PMID: 36017878

Similar Datasets

Project description:ObjectiveTo predict a woman's risk of postpartum hemorrhage at labor admission using machine learning and statistical models.MethodsPredictive models were constructed and compared using data from 10 of 12 sites in the U.S. Consortium for Safe Labor Study (2002-2008) that consistently reported estimated blood loss at delivery. The outcome was postpartum hemorrhage, defined as an estimated blood loss at least 1,000 mL. Fifty-five candidate risk factors routinely available on labor admission were considered. We used logistic regression with and without lasso regularization (lasso regression) as the two statistical models, and random forest and extreme gradient boosting as the two machine learning models to predict postpartum hemorrhage. Model performance was measured by C statistics (ie, concordance index), calibration, and decision curves. Models were constructed from the first phase (2002-2006) and externally validated (ie, temporally) in the second phase (2007-2008). Further validation was performed combining both temporal and site-specific validation.ResultsOf the 152,279 assessed births, 7,279 (4.8%, 95% CI 4.7-4.9) had postpartum hemorrhage. All models had good-to-excellent discrimination. The extreme gradient boosting model had the best discriminative ability to predict postpartum hemorrhage (C statistic: 0.93; 95% CI 0.92-0.93), followed by random forest (C statistic: 0.92; 95% CI 0.91-0.92). The lasso regression model (C statistic: 0.87; 95% CI 0.86-0.88) and logistic regression (C statistic: 0.87; 95% CI 0.86-0.87) had lower-but-good discriminative ability. The above results held with validation across both time and sites. Decision curve analysis demonstrated that, although all models provided superior net benefit when clinical decision thresholds were between 0% and 80% predicted risk, the extreme gradient boosting model provided the greatest net benefit.ConclusionPostpartum hemorrhage on labor admission can be predicted with excellent discriminative ability using machine learning and statistical models. Further clinical application is needed, which may assist health care providers to be prepared and triage at-risk women.

Project description:BackgroundIn health research, several chronic diseases are susceptible to competing risks (CRs). Initially, statistical models (SM) were developed to estimate the cumulative incidence of an event in the presence of CRs. As recently there is a growing interest in applying machine learning (ML) for clinical prediction, these techniques have also been extended to model CRs but literature is limited. Here, our aim is to investigate the potential role of ML versus SM for CRs within non-complex data (small/medium sample size, low dimensional setting).MethodsA dataset with 3826 retrospectively collected patients with extremity soft-tissue sarcoma (eSTS) and nine predictors is used to evaluate model-predictive performance in terms of discrimination and calibration. Two SM (cause-specific Cox, Fine-Gray) and three ML techniques are compared for CRs in a simple clinical setting. ML models include an original partial logistic artificial neural network for CRs (PLANNCR original), a PLANNCR with novel specifications in terms of architecture (PLANNCR extended), and a random survival forest for CRs (RSFCR). The clinical endpoint is the time in years between surgery and disease progression (event of interest) or death (competing event). Time points of interest are 2, 5, and 10 years.ResultsBased on the original eSTS data, 100 bootstrapped training datasets are drawn. Performance of the final models is assessed on validation data (left out samples) by employing as measures the Brier score and the Area Under the Curve (AUC) with CRs. Miscalibration (absolute accuracy error) is also estimated. Results show that the ML models are able to reach a comparable performance versus the SM at 2, 5, and 10 years regarding both Brier score and AUC (95% confidence intervals overlapped). However, the SM are frequently better calibrated.ConclusionsOverall, ML techniques are less practical as they require substantial implementation time (data preprocessing, hyperparameter tuning, computational intensity), whereas regression methods can perform well without the additional workload of model training. As such, for non-complex real life survival data, these techniques should only be applied complementary to SM as exploratory tools of model's performance. More attention to model calibration is urgently needed.

Project description:IntroductionMachine learning (ML) algorithms have been heralded as promising solutions to the realization of assistive systems in digital healthcare, due to their ability to detect fine-grain patterns that are not easily perceived by humans. Yet, ML algorithms have also been critiqued for treating individuals differently based on their demography, thus propagating existing disparities. This paper explores gender and race bias in speech-based ML algorithms that detect behavioral and mental health outcomes.MethodsThis paper examines potential sources of bias in the data used to train the ML, encompassing acoustic features extracted from speech signals and associated labels, as well as in the ML decisions. The paper further examines approaches to reduce existing bias via using the features that are the least informative of one's demographic information as the ML input, and transforming the feature space in an adversarial manner to diminish the evidence of the demographic information while retaining information about the focal behavioral and mental health state.ResultsResults are presented in two domains, the first pertaining to gender and race bias when estimating levels of anxiety, and the second pertaining to gender bias in detecting depression. Findings indicate the presence of statistically significant differences in both acoustic features and labels among demographic groups, as well as differential ML performance among groups. The statistically significant differences present in the label space are partially preserved in the ML decisions. Although variations in ML performance across demographic groups were noted, results are mixed regarding the models' ability to accurately estimate healthcare outcomes for the sensitive groups.DiscussionThese findings underscore the necessity for careful and thoughtful design in developing ML models that are capable of maintaining crucial aspects of the data and perform effectively across all populations in digital healthcare applications.

Project description:AimsThis study aimed to review the performance of machine learning (ML) methods compared with conventional statistical models (CSMs) for predicting readmission and mortality in patients with heart failure (HF) and to present an approach to formally evaluate the quality of studies using ML algorithms for prediction modelling.Methods and resultsFollowing Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines, we performed a systematic literature search using MEDLINE, EPUB, Cochrane CENTRAL, EMBASE, INSPEC, ACM Library, and Web of Science. Eligible studies included primary research articles published between January 2000 and July 2020 comparing ML and CSMs in mortality and readmission prognosis of initially hospitalized HF patients. Data were extracted and analysed by two independent reviewers. A modified CHARMS checklist was developed in consultation with ML and biostatistics experts for quality assessment and was utilized to evaluate studies for risk of bias. Of 4322 articles identified and screened by two independent reviewers, 172 were deemed eligible for a full-text review. The final set comprised 20 articles and 686 842 patients. ML methods included random forests (n = 11), decision trees (n = 5), regression trees (n = 3), support vector machines (n = 9), neural networks (n = 12), and Bayesian techniques (n = 3). CSMs included logistic regression (n = 16), Cox regression (n = 3), or Poisson regression (n = 3). In 15 studies, readmission was examined at multiple time points ranging from 30 to 180 day readmission, with the majority of studies (n = 12) presenting prediction models for 30 day readmission outcomes. Of a total of 21 time-point comparisons, ML-derived c-indices were higher than CSM-derived c-indices in 16 of the 21 comparisons. In seven studies, mortality was examined at 9 time points ranging from in-hospital mortality to 1 year survival; of these nine, seven reported higher c-indices using ML. Two of these seven studies reported survival analyses utilizing random survival forests in their ML prediction models. Both reported higher c-indices when using ML compared with CSMs. A limitation of studies using ML techniques was that the majority were not externally validated, and calibration was rarely assessed. In the only study that was externally validated in a separate dataset, ML was superior to CSMs (c-indices 0.913 vs. 0.835).ConclusionsML algorithms had better discrimination than CSMs in most studies aiming to predict risk of readmission and mortality in HF patients. Based on our review, there is a need for external validation of ML-based studies of prediction modelling. We suggest that ML-based studies should also be evaluated using clinical quality standards for prognosis research. Registration: PROSPERO CRD42020134867.

Project description:Background:Statistically derived cardiovascular risk calculators (CVRC) that use conventional risk factors, generally underestimate or overestimate the risk of cardiovascular disease (CVD) or stroke events primarily due to lack of integration of plaque burden. This study investigates the role of machine learning (ML)-based CVD/stroke risk calculators (CVRCML) and compares against statistically derived CVRC (CVRCStat) based on (I) conventional factors or (II) combined conventional with plaque burden (integrated factors). Methods:The proposed study is divided into 3 parts: (I) statistical calculator: initially, the 10-year CVD/stroke risk was computed using 13 types of CVRCStat (without and with plaque burden) and binary risk stratification of the patients was performed using the predefined thresholds and risk classes; (II) ML calculator: using the same risk factors (without and with plaque burden), as adopted in 13 different CVRCStat, the patients were again risk-stratified using CVRCML based on support vector machine (SVM) and finally; (III) both types of calculators were evaluated using AUC based on ROC analysis, which was computed using combination of predicted class and endpoint equivalent to CVD/stroke events. Results:An Institutional Review Board approved 202 patients (156 males and 46 females) of Japanese ethnicity were recruited for this study with a mean age of 69±11 years. The AUC for 13 different types of CVRCStat calculators were: AECRS2.0 (AUC 0.83, P<0.001), QRISK3 (AUC 0.72, P<0.001), WHO (AUC 0.70, P<0.001), ASCVD (AUC 0.67, P<0.001), FRScardio (AUC 0.67, P<0.01), FRSstroke (AUC 0.64, P<0.001), MSRC (AUC 0.63, P=0.03), UKPDS56 (AUC 0.63, P<0.001), NIPPON (AUC 0.63, P<0.001), PROCAM (AUC 0.59, P<0.001), RRS (AUC 0.57, P<0.001), UKPDS60 (AUC 0.53, P<0.001), and SCORE (AUC 0.45, P<0.001), while the AUC for the CVRCML with integrated risk factors (AUC 0.88, P<0.001), a 42% increase in performance. The overall risk-stratification accuracy for the CVRCML with integrated risk factors was 92.52% which was higher compared all the other CVRCStat. Conclusions:ML-based CVD/stroke risk calculator provided a higher predictive ability of 10-year CVD/stroke compared to the 13 different types of statistically derived risk calculators including integrated model AECRS 2.0.

Dataset Information

Statistical quantification of confounding bias in machine learning models.

Background

Results

Conclusions

Publications

Statistical quantification of confounding bias in machine learning models.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets