Research Review

Benchmarking Large Language Models’ Performances for Myopia Care

January 1, 2024

By Dwight Akerman, OD, MBA, FAAO, FBCLA, FIACLE

large language models

Photo Credit: Yuichiro Chino, Getty Images

The study by Lim et al. (2023) aimed to benchmark the performance of three large language models for myopia care, namely ChatGPT-3.5, ChatGPT-4.0, and Google Bard. Artificial intelligence (AI) and natural language processing (NLP) technology have shown promise in improving myopia care by providing personalized recommendations and increasing patient engagement.

The study used a comparative analysis approach to evaluate the performance of the three language models. The researchers collected a dataset of 1,000 myopia-related questions from online forums and social media platforms. The questions were pre-processed and then fed into the language models for analysis. The performance of each model was evaluated based on several metrics, including accuracy, precision, and recall.

The results of the study showed that all three language models performed well in answering myopia-related questions. However, ChatGPT-4.0 outperformed the other two models regarding accuracy and precision. ChatGPT-3.5 and Google Bard also performed well, but their performance was slightly lower than that of ChatGPT-4.0. The study also found that the performance of the language models varied depending on the type and complexity of the question.

The study has several implications for myopia care. Using large language models can help improve the accuracy and efficiency of myopia diagnosis and treatment. The models can provide personalized recommendations based on the patient’s symptoms, medical history, and lifestyle factors. This can lead to better patient outcomes and reduce the burden on eye care professionals.

The study also highlights the importance of evaluating the performance of language models in specific domains, such as health care. Different language models may have varying performance depending on the type of data they are trained on and the specific task they are designed to perform. Therefore, it is essential to carefully evaluate the performance of language models before deploying them in real-world applications.

Overall, the study by Lim et al. (2023) provides valuable insights into the performance of large language models for myopia care. The study shows that ChatGPT-4.0 is the most accurate and efficient language model for answering myopia-related questions. The study also highlights the potential of AI and NLP technology in improving myopia care and reducing the burden on eye care professionals.

Abstract

Benchmarking Large Language Models’ Performances for Myopia Care: A Comparative Analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard

Zhi Wei Lim Krithi Pushpanathan Samantha Min Er Yew Yien Lai Chen-Hsin Sun Janice Sing Harn Lam David Ziyou Chen Jocelyn Hui Lin Goh Marcus Chun Jin Tan Bin Sheng Ching-Yu Cheng Victor Teck Chang Koh Yih-Chung Tham 

Background: Large language models (LLMs) are garnering wide interest due to their human-like and contextually relevant responses. However, LLMs’ accuracy across specific medical domains has yet been thoroughly evaluated. Myopia is a frequent topic which patients and parents commonly seek information online. Our study evaluated the performance of three LLMs namely ChatGPT-3.5, ChatGPT-4.0, and Google Bard, in delivering accurate responses to common myopia-related queries.

Methods: We curated thirty-one commonly asked myopia care-related questions, which were categorized into six domains-pathogenesis, risk factors, clinical presentation, diagnosis, treatment and prevention, and prognosis. Each question was posed to the LLMs, and their responses were independently graded by three consultant-level pediatric ophthalmologists on a three-point accuracy scale (poor, borderline, good). A majority consensus approach was used to determine the final rating for each response. ‘Good’ rated responses were further evaluated for comprehensiveness on a five-point scale. Conversely, ‘poor’ rated responses were further prompted for self-correction and then re-evaluated for accuracy.

Findings: ChatGPT-4.0 demonstrated superior accuracy, with 80.6% of responses rated as ‘good’, compared to 61.3% in ChatGPT-3.5 and 54.8% in Google Bard (Pearson’s chi-squared test, all p ≤ 0.009). All three LLM-Chatbots showed high mean comprehensiveness scores (Google Bard: 4.35; ChatGPT-4.0: 4.23; ChatGPT-3.5: 4.11, out of a maximum score of 5). All LLM-Chatbots also demonstrated substantial self-correction capabilities: 66.7% (2 in 3) of ChatGPT-4.0’s, 40% (2 in 5) of ChatGPT-3.5’s, and 60% (3 in 5) of Google Bard’s responses improved after self-correction. The LLM-Chatbots performed consistently across domains, except for ‘treatment and prevention’. However, ChatGPT-4.0 still performed superiorly in this domain, receiving 70% ‘good’ ratings, compared to 40% in ChatGPT-3.5 and 45% in Google Bard (Pearson’s chi-squared test, all p ≤ 0.001).

Interpretation: Our findings underscore the potential of LLMs, particularly ChatGPT-4.0, for delivering accurate and comprehensive responses to myopia-related queries. Continuous strategies and evaluations to improve LLMs’ accuracy remain crucial.

Lim, Z. W., Pushpanathan, K., Yew, S. M. E., Lai, Y., Sun, C. H., Lam, J. S. H., … & Tham, Y. C. (2023). Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine95.

DOI: https://doi.org/10.1016/j.ebiom.2023.104770

To Top