Can large language models perform clinical anamnesis? Comparative evaluation of ChatGPT, Claude, and Gemini in diagnostic reasoning through case-based questioning in oral and maxillofacial disorders
Abstract
Introduction: This study aimed to evaluate whether large language models (LLMs) can emulate the clinical anamnesis process and diagnostic reasoning of oral and maxillofacial surgeons. Materials and methods: Twenty-five real clinical cases from five diagnostic categories maxillary sinus diseases, periapical pathologies, orofacial pain disorders and neuropathic pain syndromes, odontogenic cysts and tumors, and temporomandibular joint disorders were simulated. Three LLMs (ChatGPT 4o, Claude 4, and Gemini 2.5) were each provided only the patient's chief complaint and instructed to ask up to ten sequential questions to reach a diagnosis. One independent evaluators scored model performances on a 100 point scale, deducting 10 points for each additional question asked. Statistical comparisons were conducted using Kruskal-Wallis and Bonferroni post-hoc tests. Results: No statistically significant difference was found among the models (p = 0.431). Gemini achieved the highest mean diagnostic score (43.6 +/- 40.71), followed by ChatGPT-4o (37.2 +/- 36.8) and Claude (31.6 +/- 33.0). Diagnostic accuracy was highest in moderately difficult cases (p = 0.021) and markedly decreased in difficult ones (p = 0.016). Conclusion: LLMs demonstrated the ability to perform structured anamnesis and reach clinically meaningful diagnostic conclusions using limited information. Although no significant difference was observed among the models, Gemini achieved the highest overall mean score. These findings indicate that LLMs hold potential as complementary tools for diagnostic reasoning and as simulation-based educational resources in oral and maxillofacial surgery.
















