Unknown

Dataset Information

0

Utility of machine learning in developing a predictive model for early-age-onset colorectal neoplasia using electronic health records.


ABSTRACT:

Background and aims

The incidence of colorectal cancer (CRC) is increasing in adults younger than 50, and early screening remains challenging due to cost and under-utilization. To identify individuals aged 35-50 years who may benefit from early screening, we developed a prediction model using machine learning and electronic health record (EHR)-derived factors.

Methods

We enrolled 3,116 adults aged 35-50 at average-risk for CRC and underwent colonoscopy between 2017-2020 at a single center. Prediction outcomes were (1) CRC and (2) CRC or high-risk polyps. We derived our predictors from EHRs (e.g., demographics, obesity, laboratory values, medications, and zip code-derived factors). We constructed four machine learning-based models using a training set (random sample of 70% of participants): regularized discriminant analysis, random forest, neural network, and gradient boosting decision tree. In the testing set (remaining 30% of participants), we measured predictive performance by comparing C-statistics to a reference model (logistic regression).

Results

The study sample was 55.1% female, 32.8% non-white, and included 16 (0.05%) CRC cases and 478 (15.3%) cases of CRC or high-risk polyps. All machine learning models predicted CRC with higher discriminative ability compared to the reference model [e.g., C-statistics (95%CI); neural network: 0.75 (0.48-1.00) vs. reference: 0.43 (0.18-0.67); P = 0.07] Furthermore, all machine learning approaches, except for gradient boosting, predicted CRC or high-risk polyps significantly better than the reference model [e.g., C-statistics (95%CI); regularized discriminant analysis: 0.64 (0.59-0.69) vs. reference: 0.55 (0.50-0.59); P<0.0015]. The most important predictive variables in the regularized discriminant analysis model for CRC or high-risk polyps were income per zip code, the colonoscopy indication, and body mass index quartiles.

Discussion

Machine learning can predict CRC risk in adults aged 35-50 using EHR with improved discrimination. Further development of our model is needed, followed by validation in a primary-care setting, before clinical application.

SUBMITTER: Hussan H 

PROVIDER: S-EPMC9064446 | biostudies-literature | 2022

REPOSITORIES: biostudies-literature

altmetric image

Publications

Utility of machine learning in developing a predictive model for early-age-onset colorectal neoplasia using electronic health records.

Hussan Hisham H   Zhao Jing J   Badu-Tawiah Abraham K AK   Stanich Peter P   Tabung Fred F   Gray Darrell D   Ma Qin Q   Kalady Matthew M   Clinton Steven K SK  

PloS one 20220310 3


<h4>Background and aims</h4>The incidence of colorectal cancer (CRC) is increasing in adults younger than 50, and early screening remains challenging due to cost and under-utilization. To identify individuals aged 35-50 years who may benefit from early screening, we developed a prediction model using machine learning and electronic health record (EHR)-derived factors.<h4>Methods</h4>We enrolled 3,116 adults aged 35-50 at average-risk for CRC and underwent colonoscopy between 2017-2020 at a singl  ...[more]

Similar Datasets

| S-EPMC6888922 | biostudies-literature
| PRJNA158491 | ENA
| S-EPMC10359402 | biostudies-literature
| S-EPMC10896079 | biostudies-literature
| S-EPMC11429196 | biostudies-literature
| S-EPMC8463452 | biostudies-literature
| S-EPMC10766963 | biostudies-literature
| S-EPMC7820000 | biostudies-literature
| S-EPMC8983688 | biostudies-literature
| S-EPMC11446542 | biostudies-literature