Unknown

Dataset Information

0

Popular large language model chatbots' accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries.


ABSTRACT: In light of growing interest in using emerging large language models (LLMs) for self-diagnosis, we systematically assessed the performance of ChatGPT-3.5, ChatGPT-4.0, and Google Bard in delivering proficient responses to 37 common inquiries regarding ocular symptoms. Responses were masked, randomly shuffled, and then graded by three consultant-level ophthalmologists for accuracy (poor, borderline, good) and comprehensiveness. Additionally, we evaluated the self-awareness capabilities (ability to self-check and self-correct) of the LLM-Chatbots. 89.2% of ChatGPT-4.0 responses were 'good'-rated, outperforming ChatGPT-3.5 (59.5%) and Google Bard (40.5%) significantly (all p < 0.001). All three LLM-Chatbots showed optimal mean comprehensiveness scores as well (ranging from 4.6 to 4.7 out of 5). However, they exhibited subpar to moderate self-awareness capabilities. Our study underscores the potential of ChatGPT-4.0 in delivering accurate and comprehensive responses to ocular symptom inquiries. Future rigorous validation of their performance is crucial to ensure their reliability and appropriateness for actual clinical use.

SUBMITTER: Pushpanathan K 

PROVIDER: S-EPMC10616302 | biostudies-literature | 2023 Nov

REPOSITORIES: biostudies-literature

altmetric image

Publications

Popular large language model chatbots' accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries.

Pushpanathan Krithi K   Lim Zhi Wei ZW   Er Yew Samantha Min SM   Chen David Ziyou DZ   Hui'En Lin Hazel Anne HA   Lin Goh Jocelyn Hui JH   Wong Wendy Meihua WM   Wang Xiaofei X   Jin Tan Marcus Chun MC   Chang Koh Victor Teck VT   Tham Yih-Chung YC  

iScience 20231010 11


In light of growing interest in using emerging large language models (LLMs) for self-diagnosis, we systematically assessed the performance of ChatGPT-3.5, ChatGPT-4.0, and Google Bard in delivering proficient responses to 37 common inquiries regarding ocular symptoms. Responses were masked, randomly shuffled, and then graded by three consultant-level ophthalmologists for accuracy (poor, borderline, good) and comprehensiveness. Additionally, we evaluated the self-awareness capabilities (ability t  ...[more]

Similar Datasets

| S-EPMC11806297 | biostudies-literature
| S-EPMC11487020 | biostudies-literature
| S-EPMC6567837 | biostudies-literature
| S-EPMC10410472 | biostudies-literature
| S-EPMC9079685 | biostudies-literature
| S-EPMC10585328 | biostudies-literature
| S-EPMC6123060 | biostudies-literature
| S-EPMC11922739 | biostudies-literature
| S-EPMC10791738 | biostudies-literature
| S-EPMC10976360 | biostudies-literature