The study acknowledges that the results are time-sensitive and that the performance of ChatGPT, particularly GPT-4, is expected to improve rapidly. This limits the generalizability of the findings.
Exclusion of Image and Table-Based Questions
The exclusion of questions with images and tables, while necessary for comparison between GPT-3.5 and GPT-4, does not reflect the real-world application of these models in medical contexts where such visual information is crucial.
The study focuses solely on ChatGPT and does not consider other large language models. This limits the scope of the findings and prevents broader conclusions about the capabilities of LLMs in medical education and practice.
The study uses a single, specific examination (JMLE) in a specific language (Japanese). This limits the generalizability of the findings to other medical examinations and other languages.
Limited Discussion of Hallucinations
The study does not address the issue of "hallucinations" in detail, which is a significant concern with LLMs, especially in the context of medical information where accuracy is paramount.