Evaluation of the performance of different large language models on head and neck anatomy questions in the dentistry specialization exam in Turkey
Özet
Purpose The aim of this study was to assess the performance of various Large Language Models (LLMs) in addressing head and neck anatomy questions from the Dental Specialization Exam (DUS), conducted between 2012 and 2021. Methods A total of 103 multiple-choice questions were selected from DUS examinations over a decade. These questions covered major topics: Musculoskeletal System, Nervous System and Sensory Organs, Dental Anatomy, and Veins, Arteries, Lymphatic System and the Glandular System. Eight of the LLMs Gemini 1.5, Gemini 2, Copilot, Deepseek, Claude, ChatGPT 4o, ChatGPT 4, and ChatGPT o1 were employed using their most updated versions. Each model's accuracy was calculated by comparing the number of correct and incorrect responses. Results The ChatGPT o1 demonstrated the highest accuracy rate among all tested models, while Gemini 1.5 showed the lowest accuracy. These differences were found to be statistically significant (p = 0.027). Post-hoc analysis revealed that the only statistically significant difference among the LLMs was between ChatGPT o1 and Gemini 1.5 (p < 0.0031). When questions were analyzed by topic, no significant accuracy differences emerged in the Musculoskeletal System section. However, the ChatGPT o1 performed best in the Nervous System and Sensory Organs category. For Dental Anatomy questions, both ChatGPT o1 and Copilot achieved top results, and for Veins, Arteries, Lymphatic System and Glandular System section, ChatGPT o1 again excelled. Conclusion Overall, the findings show that LLMs effectively answer DUS head and neck anatomy questions with comparable performance. These insights support future exam-related model development and suggest that LLMs can serve as valuable educational tools.
















