Unknown

Dataset Information

0

Evaluating large language models on a highly-specialized topic, radiation oncology physics.


ABSTRACT:

Purpose

We present the first study to investigate Large Language Models (LLMs) in answering radiation oncology physics questions. Because popular exams like AP Physics, LSAT, and GRE have large test-taker populations and ample test preparation resources in circulation, they may not allow for accurately assessing the true potential of LLMs. This paper proposes evaluating LLMs on a highly-specialized topic, radiation oncology physics, which may be more pertinent to scientific and medical communities in addition to being a valuable benchmark of LLMs.

Methods

We developed an exam consisting of 100 radiation oncology physics questions based on our expertise. Four LLMs, ChatGPT (GPT-3.5), ChatGPT (GPT-4), Bard (LaMDA), and BLOOMZ, were evaluated against medical physicists and non-experts. The performance of ChatGPT (GPT-4) was further explored by being asked to explain first, then answer. The deductive reasoning capability of ChatGPT (GPT-4) was evaluated using a novel approach (substituting the correct answer with "None of the above choices is the correct answer."). A majority vote analysis was used to approximate how well each group could score when working together.

Results

ChatGPT GPT-4 outperformed all other LLMs and medical physicists, on average, with improved accuracy when prompted to explain before answering. ChatGPT (GPT-3.5 and GPT-4) showed a high level of consistency in its answer choices across a number of trials, whether correct or incorrect, a characteristic that was not observed in the human test groups or Bard (LaMDA). In evaluating deductive reasoning ability, ChatGPT (GPT-4) demonstrated surprising accuracy, suggesting the potential presence of an emergent ability. Finally, although ChatGPT (GPT-4) performed well overall, its intrinsic properties did not allow for further improvement when scoring based on a majority vote across trials. In contrast, a team of medical physicists were able to greatly outperform ChatGPT (GPT-4) using a majority vote.

Conclusion

This study suggests a great potential for LLMs to work alongside radiation oncology experts as highly knowledgeable assistants.

SUBMITTER: Holmes J 

PROVIDER: S-EPMC10388568 | biostudies-literature | 2023

REPOSITORIES: biostudies-literature

altmetric image

Publications

Evaluating large language models on a highly-specialized topic, radiation oncology physics.

Holmes Jason J   Liu Zhengliang Z   Zhang Lian L   Ding Yuzhen Y   Sio Terence T TT   McGee Lisa A LA   Ashman Jonathan B JB   Li Xiang X   Liu Tianming T   Shen Jiajian J   Liu Wei W  

Frontiers in oncology 20230717


<h4>Purpose</h4>We present the first study to investigate Large Language Models (LLMs) in answering radiation oncology physics questions. Because popular exams like AP Physics, LSAT, and GRE have large test-taker populations and ample test preparation resources in circulation, they may not allow for accurately assessing the true potential of LLMs. This paper proposes evaluating LLMs on a highly-specialized topic, radiation oncology physics, which may be more pertinent to scientific and medical c  ...[more]

Similar Datasets

| S-EPMC10831180 | biostudies-literature
| S-EPMC11762905 | biostudies-literature
| S-EPMC10449915 | biostudies-literature
| S-EPMC10168498 | biostudies-literature
| S-EPMC10909174 | biostudies-literature
| S-EPMC10656647 | biostudies-literature
| S-EPMC11551352 | biostudies-literature
| S-EPMC10988356 | biostudies-literature
| S-EPMC11695923 | biostudies-literature
| S-EPMC11770143 | biostudies-literature