Testing LLM diagnostics in endodontics: The impact of linguistic variation on unseen cases

Document Type

Article

Department

Dental-oral, Maxillo-facial Surgery

Abstract

Aim: To assess the diagnostic performance of two language models, GPT-5 Plus and Gemini 2.5 Flash using a curated benchmark dataset of unseen endodontic and restorative dentistry related clinical case scenarios and the linguistic variations introduced around the original dataset. Additionally, a descriptive qualitative analysis was performed on a subset of cases to evaluate the quality of reasoning generated by both models.
Methodology: One hundred single best answer MCQs were generated using standardised resources, constituting a benchmark dataset. Controlled linguistic variations were introduced around the original dataset; paraphrasing (sentence/clause rewording), perturbation (token-level substitutions), and permutation (answer-order shuffle). These case scenarios were presented to both models using a standardised prompt, and the performance metrics (accuracy/recall, F-1 score) were computed. Agreement between and within models was analysed using Cohen's κ, while paired differences were evaluated using McNemar's test with a significant p-value < 0.05. Qualitative analysis was performed on a subset of the total sample, and the responses were evaluated on a 3-point Likert scale.
Results: GPT-5 Plus achieved 80% accuracy on benchmark dataset compared to 66% for Gemini 2.5 Flash (McNemar's p-value = 0.0066). When linguistic variations were introduced, the performance of GPT-5 Plus declined, with perturbation having the most significant effect (McNemar's p-value = 0.003). Gemini 2.5 Flash, on the other hand, though inferior initial performance on benchmark dataset, maintained uniform decision patterns across all transformations with no significant drop further. The descriptive qualitative analysis demonstrated an overall higher proportion of responses rated as good (8/10, 80% for original dataset; 7/10, 70% for linguistic variations) for Gemini 2.5 Flash as opposed to GPT-5 Plus.
Conclusion: GPT-5 Plus outperformed Gemini 2.5 Flash on benchmark dataset; however, it was sensitive to linguistic variations. Perturbation negatively influenced the performance of GPT-5 Plus, emphasising the need to further investigate the linguistic phenomenon that may have affected the model's degradation. Additionally, the descriptive qualitative analysis demonstrated relatively higher performance for Gemini 2.5 Flash compared to GPT-5 Plus on the original dataset and across linguistic variations. However, owing to the descriptive nature of findings and limited sample size, the results should be interpreted with caution.

Comments

Volume and issue number are not provided by the author/publisher.

AKU Student

no

Publication (Name of Journal)

International Endodontic Journal

DOI

10.1111/iej.70109

Share

COinS