ASSESSING THE ADEQUACY OF ARTIFICIAL INTELLIGENCE MODELS IN ANSWERING SPINE SURGERY QUESTIONS FROM THE ORTHOPEDIC RESIDENCY TRAINING AND DEVELOPMENT EXAMINATION
Abstract
Objective: Artificial intelligence (AI) has undergone remarkable advancements in recent years, and its integration across various domains has been transformative. In the field of medicine, AI applications are rapidly expanding, offering novel opportunities for clinical practice, decision-making, and medical education. The present study sought to assess the performance and reliability of state-of-the-art AI models in addressing spine surgery questions from the Orthopedic Residency Training and Development Examination conducted in Türkiye between 2010 and 2023. Materials and Methods: A total of 286 spine surgery questions were systematically analyzed. The reference standard was established using the official correct answers, which were subsequently compared with the outputs generated by three advanced AI models: Chat Generative Pre-trained Transformer-5.0 (ChatGPT-5.0), Gemini-Pro, and DeepSeek-V3. Model performance was evaluated in terms of accuracy, error rate, and non-response rate. Comparative analyses among models were performed using chi-square and McNemar tests with pairwise post-hoc comparisons. Wilson’s method was employed to calculate 95% confidence intervals (CIs). In addition, subgroup analyses were conducted according to question categories and temporal strata. Results: Gemini-Pro achieved the highest accuracy rate (85.3%), demonstrating statistically significant superiority over ChatGPT-5.0 (71.7%, p<0.001). The overall accuracy rates were as follows: Gemini-Pro, 85.3% (95% CI: 80.7-88.9; non-response 1.4%); DeepSeek-V3, 78.0% (95% CI: 72.8-82.4; non-response 3.8%); and ChatGPT-5.0, 71.7% (95% CI: 66.2-76.6; non-response 10.8%). Temporal analyses revealed that Gemini-Pro and DeepSeek-V3 performed better in earlier years, whereas Gemini-Pro consistently maintained superior and stable performance in the later periods. In contrast, ChatGPT-5.0 exhibited persistently lower accuracy across all intervals. Conclusion: Gemini-Pro demonstrated the most consistent and robust performance across both overall and temporal analyses. These findings underscore the promising role of AI in orthopedic residency education, particularly in examination preparation. Nevertheless, integration of AI into training curricula should be approached with caution, as expert oversight remains indispensable to ensure reliability and clinical applicability. ©Copyright 2025 The Author.
Volume
36Issue
4URI
https://doi.org/10.4274/jtss.galenos.2025.74436https://search.trdizin.gov.tr/tr/yayin/detay/1350220
https://hdl.handle.net/20.500.12933/3020
















