Linguistic disparities in artificial intelligence-generated patient education for total hip arthroplasty: A pilot study of cross-language analysis of leading large language models
Document Type
Article
Department
Orthopaedic Surgery; General Surgery
Abstract
Background: Large Language Models (LLMs) are increasingly used for health information, but concerns exist regarding performance disparities for non-English speakers, potentially exacerbating health inequities. Appropriate information is critical for patients with limited English proficiency undergoing orthopedic procedures such as total hip arthroplasty (THA). This pilot study evaluated differences in the clinical reliability of English and Spanish responses to common THA questions generated by leading LLMs.
Methods: Three widely accessible LLMs (ChatGPT-4o, Gemini 2.0 Flash, and Microsoft Copilot) were evaluated using 10 standardized frequently asked questions on THA, posed in English and Spanish. Responses were independently graded by language-fluent medical experts using a 4-point rubric (1 = Unsatisfactory to 4 = Excellent) assessing clinical reliability and appropriateness. Nonparametric statistics, including Wilcoxon signed-rank, Kruskal-Wallis, and effect sizes (Cliff's Delta, η2), were used for comparisons.
Results: A statistically significant main effect of language was found (p = 0.014, η2 = 0.151), indicating significantly lower clinical reliability scores for Spanish responses in all LLMs. A nonsignificant within-model score decline was observed across all 3 LLMs.
Conclusion: Leading LLMs exhibit significant difference in clinical reliability when providing THA information, performing less reliably in Spanish compared with English. This linguistic gap suggests a potential risk for difference in response interpretation and could potentially worsen health inequities for Spanish-speaking populations. Efforts are needed to improve multilingual capabilities and manage biases in medical artificial intelligence (AI). Clinicians and patients should exercise caution when using LLMs for health information in languages other than English until cross-lingual reliability is demonstrably improved.
AKU Student
no
Publication (Name of Journal)
AOA Critical Issues in Education
DOI
10.2106/JBJS.OA.25.00207
Recommended Citation
Ali, U.,
Khan, J.,
Tareen, H.,
Kanza, S.,
Pedroza, J. A.,
Amjad, F.,
Khan, S.,
Hasan, A.,
Malik, M. A.
(2026). Linguistic disparities in artificial intelligence-generated patient education for total hip arthroplasty: A pilot study of cross-language analysis of leading large language models. AOA Critical Issues in Education, 11(1).
Available at:
https://ecommons.aku.edu/pakistan_fhs_mc_surg_orthop/183
Comments
Pagination is not provided by author/publisher.