Dataset Information

ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German language.

ABSTRACT:

Background

To investigate the feasibility of the large language model (LLM) ChatGPT for classifying liver lesions according to the Liver Imaging Reporting and Data System (LI-RADS) based on MRI reports, and to compare classification performance on structured vs. unstructured reports.

Methods

LI-RADS classifiable liver lesions were included from German written structured and unstructured MRI reports with report of size, location, and arterial phase contrast enhancement as minimum inclusion requirements. The findings sections of the reports were propagated to ChatGPT (GPT-3.5), which was instructed to determine LI-RADS scores for each classifiable liver lesion. Ground truth was established by two radiologists in consensus. Agreement between ground truth and ChatGPT was assessed with Cohen's kappa. Test-retest reliability was assessed by passing a subset of n = 50 lesions five times to ChatGPT, using the intraclass correlation coefficient (ICC).

Results

205 MRIs from 150 patients were included. The accuracy of ChatGPT at determining LI-RADS categories was poor (53% and 44% on unstructured and structured reports). The agreement to the ground truth was higher (k = 0.51 and k = 0.44), the mean absolute error in LI-RADS scores was lower (0.5 ± 0.5 vs. 0.6 ± 0.7, p < 0.05), and the test-retest reliability was higher (ICC = 0.81 vs. 0.50), in free-text compared to structured reports, respectively, although structured reports comprised the minimum required imaging features significantly more frequently (Chi-square test, p < 0.05).

Conclusions

ChatGPT attained only low accuracy when asked to determine LI-RADS scores from liver imaging reports. The superior accuracy and consistency throughout free-text reports might relate to ChatGPT's training process.

Clinical relevance statement

Our study indicates both the necessity of optimization of LLMs for structured clinical data input and the potential of LLMs for creating machine-readable labels based on large free-text radiological databases.

SUBMITTER: Fervers P

PROVIDER: S-EPMC11257913 | biostudies-literature | 2024

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German language.

Fervers Philipp P Hahnfeldt Robert R Kottlors Jonathan J Wagner Anton A Maintz David D Pinto Dos Santos Daniel D Lennartz Simon S Persigehl Thorsten T

Frontiers in radiology 20240705

<h4>Background</h4>To investigate the feasibility of the large language model (LLM) ChatGPT for classifying liver lesions according to the Liver Imaging Reporting and Data System (LI-RADS) based on MRI reports, and to compare classification performance on structured vs. unstructured reports.<h4>Methods</h4>LI-RADS classifiable liver lesions were included from German written structured and unstructured MRI reports with report of size, location, and arterial phase contrast enhancement as minimum i ...[more]

PMID: 39036542

Similar Datasets

Project description:ObjectiveTo develop a domain-specific large language model (LLM) for LI-RADS v2018 categorization of hepatic observations based on free-text descriptions extracted from MRI reports.Material and methodsThis retrospective study included 291 small liver observations, divided into training (n = 141), validation (n = 30), and test (n = 120) datasets. Of these, 120 were fictitious, and 171 were extracted from 175 MRI reports from a single institution. The algorithm's performance was compared to two independent radiologists and one hepatologist in a human replacement scenario, and considering two combined strategies (double reading with arbitration and triage). Agreement on LI-RADS category and dichotomic malignancy (LR-4, LR-5, and LR-M) were estimated using linear-weighted κ statistics and Cohen's κ, respectively. Sensitivity and specificity for LR-5 were calculated. The consensus agreement of three other radiologists served as the ground truth.ResultsThe model showed moderate agreement against the ground truth for both LI-RADS categorization (κ = 0.54 [95% CI: 0.42-0.65]) and the dichotomized approach (κ = 0.58 [95% CI: 0.42-0.73]). Sensitivity and specificity for LR-5 were 0.76 (95% CI: 0.69-0.86) and 0.96 (95% CI: 0.91-1.00), respectively. When the chatbot was used as a triage tool, performance improved for LI-RADS categorization (κ = 0.86/0.87 for the two independent radiologists and κ = 0.76 for the hepatologist), dichotomized malignancy (κ = 0.94/0.91 and κ = 0.87) and LR-5 identification (1.00/0.98 and 0.85 sensitivity, 0.96/0.92 and 0.92 specificity), with no statistical significance compared to the human readers' individual performance. Through this strategy, the workload decreased by 45%.ConclusionLI-RADS v2018 categorization from unlabelled MRI reports is feasible using our LLM, and it enhances the efficiency of data curation.Critical relevance statementOur proof-of-concept study provides novel insights into the potential applications of LLMs, offering a real-world example of how these tools could be integrated into a local workflow to optimize data curation for research purposes.Key pointsAutomatic LI-RADS categorization from free-text reports would be beneficial to workflow and data mining. LiverAI, a GPT-4-based model, supported various strategies improving data curation efficiency by up to 60%. LLMs can integrate into workflows, significantly reducing radiologists' workload.

Project description:Objective The aim of this study was to assess efficacy of large language models (LLMs) for converting free-text computed tomography (CT) scan reports of head and neck cancer (HNCa) patients into a structured format using a predefined template. Materials and Methods A retrospective study was conducted using 150 CT reports of HNCa patients. A comprehensive structured reporting template for HNCa CT scans was developed, and the Generative Pre-trained Transformer 4 (GPT-4) was initially used to convert 50 CT reports into a structured format using this template. The generated structured reports were then evaluated by a radiologist for instances of missing or misinterpreted information and any erroneous additional details added by GPT-4. Following this assessment, the template was refined for improved accuracy. This revised template was then used for conversion of 100 other HNCa CT reports into structured format using GPT-4. These reports were then reevaluated in the same manner. Results Initially, GPT-4 successfully converted all 50 free-text reports into structured reports. However, there were 10 places with missing information: tracheostomy tube ( n = 3), noninclusion of involvement of sternocleidomastoid muscle ( n = 2), extranodal tumor extension ( n = 3), and contiguous involvement of the neck structures by nodal mass rather than the primary ( n = 2). Few instances of nonsuspicious lung nodules were misinterpreted as metastases ( n = 2). GPT-4 did not indicate any erroneous additional findings. Using the revised reporting template, GPT-4 converted all the 100 CT reports into a structured format with no repeated or additional mistakes. Conclusion LLMs can be used for structuring free-text radiology reports using plain language prompts and a simple yet comprehensive reporting template. Key Points Structured radiology reports in oncological patients, although advantageous, are not used widely in practice due to perceived drawbacks like interference with routine radiology workflow and scan interpretation.We found that GPT-4 is highly efficient in converting conventional CT reports of HNCa patients to structured reports using a predefined template.This application of LLMs in radiology can help in enhancing the acceptability and clinical utility of structured radiology reports in oncological imaging. Summary Statement Large language models can successfully and accurately convert conventional radiology reports for oncology scans into a structured format using a comprehensive predefined template and thus can enhance the utility and integration of these reports in routine clinical practice.

Dataset Information

ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German language.

Background

Methods

Results

Conclusions

Clinical relevance statement

Publications

ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German language.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets