Section of Dental-Oral Maxillofacial Surgery

Testing LLM diagnostics in endodontics: The impact of linguistic variation on unseen cases

Document Type

Article

Department

Dental-oral, Maxillo-facial Surgery

Abstract

Aim: To assess the diagnostic performance of two language models, GPT-5 Plus and Gemini 2.5 Flash using a curated benchmark dataset of unseen endodontic and restorative dentistry related clinical case scenarios and the linguistic variations introduced around the original dataset. Additionally, a descriptive qualitative analysis was performed on a subset of cases to evaluate the quality of reasoning generated by both models.
Methodology: One hundred single best answer MCQs were generated using standardised resources, constituting a benchmark dataset. Controlled linguistic variations were introduced around the original dataset; paraphrasing (sentence/clause rewording), perturbation (token-level substitutions), and permutation (answer-order shuffle). These case scenarios were presented to both models using a standardised prompt, and the performance metrics (accuracy/recall, F-1 score) were computed. Agreement between and within models was analysed using Cohen's κ, while paired differences were evaluated using McNemar's test with a significant p-value < 0.05. Qualitative analysis was performed on a subset of the total sample, and the responses were evaluated on a 3-point Likert scale.
Results: GPT-5 Plus achieved 80% accuracy on benchmark dataset compared to 66% for Gemini 2.5 Flash (McNemar's p-value = 0.0066). When linguistic variations were introduced, the performance of GPT-5 Plus declined, with perturbation having the most significant effect (McNemar's p-value = 0.003). Gemini 2.5 Flash, on the other hand, though inferior initial performance on benchmark dataset, maintained uniform decision patterns across all transformations with no significant drop further. The descriptive qualitative analysis demonstrated an overall higher proportion of responses rated as good (8/10, 80% for original dataset; 7/10, 70% for linguistic variations) for Gemini 2.5 Flash as opposed to GPT-5 Plus.
Conclusion: GPT-5 Plus outperformed Gemini 2.5 Flash on benchmark dataset; however, it was sensitive to linguistic variations. Perturbation negatively influenced the performance of GPT-5 Plus, emphasising the need to further investigate the linguistic phenomenon that may have affected the model's degradation. Additionally, the descriptive qualitative analysis demonstrated relatively higher performance for Gemini 2.5 Flash compared to GPT-5 Plus on the original dataset and across linguistic variations. However, owing to the descriptive nature of findings and limited sample size, the results should be interpreted with caution.

Comments

Volume and issue number are not provided by the author/publisher.

Publication (Name of Journal)

International Endodontic Journal

DOI

10.1111/iej.70109

Recommended Citation

Batool, I., Naved, N., Umer, F. (2026). Testing LLM diagnostics in endodontics: The impact of linguistic variation on unseen cases. International Endodontic Journal, 1-9.
Available at: https://ecommons.aku.edu/pakistan_fhs_mc_surg_dent_oral_maxillofac/295

Link to Full Text

COinS

eCommons@AKU

Section of Dental-Oral Maxillofacial Surgery

Testing LLM diagnostics in endodontics: The impact of linguistic variation on unseen cases

Document Type

Department

Abstract

Comments

Publication (Name of Journal)

DOI

Recommended Citation

Search

Browse

Links

eCommons@AKU

Section of Dental-Oral Maxillofacial Surgery

Testing LLM diagnostics in endodontics: The impact of linguistic variation on unseen cases

Authors

Document Type

Department

Abstract

Comments

Publication (Name of Journal)

DOI

Recommended Citation

Share

Search

Browse

Links