Testing LLM diagnostics in endodontics: The impact of linguistic variation on unseen cases
Document Type
Article
Department
Dental-oral, Maxillo-facial Surgery
Abstract
Aim: To assess the diagnostic performance of two language models, GPT-5 Plus and Gemini 2.5 Flash using a curated benchmark dataset of unseen endodontic and restorative dentistry related clinical case scenarios and the linguistic variations introduced around the original dataset. Additionally, a descriptive qualitative analysis was performed on a subset of cases to evaluate the quality of reasoning generated by both models.
Methodology: One hundred single best answer MCQs were generated using standardised resources, constituting a benchmark dataset. Controlled linguistic variations were introduced around the original dataset; paraphrasing (sentence/clause rewording), perturbation (token-level substitutions), and permutation (answer-order shuffle). These case scenarios were presented to both models using a standardised prompt, and the performance metrics (accuracy/recall, F-1 score) were computed. Agreement between and within models was analysed using Cohen's κ, while paired differences were evaluated using McNemar's test with a significant p-value < 0.05. Qualitative analysis was performed on a subset of the total sample, and the responses were evaluated on a 3-point Likert scale.
Results: GPT-5 Plus achieved 80% accuracy on benchmark dataset compared to 66% for Gemini 2.5 Flash (McNemar's p-value = 0.0066). When linguistic variations were introduced, the performance of GPT-5 Plus declined, with perturbation having the most significant effect (McNemar's p-value = 0.003). Gemini 2.5 Flash, on the other hand, though inferior initial performance on benchmark dataset, maintained uniform decision patterns across all transformations with no significant drop further. The descriptive qualitative analysis demonstrated an overall higher proportion of responses rated as good (8/10, 80% for original dataset; 7/10, 70% for linguistic variations) for Gemini 2.5 Flash as opposed to GPT-5 Plus.
Conclusion: GPT-5 Plus outperformed Gemini 2.5 Flash on benchmark dataset; however, it was sensitive to linguistic variations. Perturbation negatively influenced the performance of GPT-5 Plus, emphasising the need to further investigate the linguistic phenomenon that may have affected the model's degradation. Additionally, the descriptive qualitative analysis demonstrated relatively higher performance for Gemini 2.5 Flash compared to GPT-5 Plus on the original dataset and across linguistic variations. However, owing to the descriptive nature of findings and limited sample size, the results should be interpreted with caution.
AKU Student
no
Publication (Name of Journal)
International Endodontic Journal
DOI
10.1111/iej.70109
Recommended Citation
Batool, I.,
Naved, N.,
Umer, F.
(2026). Testing LLM diagnostics in endodontics: The impact of linguistic variation on unseen cases. International Endodontic Journal, 1-9.
Available at:
https://ecommons.aku.edu/pakistan_fhs_mc_surg_dent_oral_maxillofac/295
Comments
Volume and issue number are not provided by the author/publisher.