Linguistic disparities in artificial intelligence-generated patient education for total hip arthroplasty: A pilot study of cross-language analysis of leading large language models

Document Type

Article

Department

Orthopaedic Surgery; General Surgery

Abstract

Background: Large Language Models (LLMs) are increasingly used for health information, but concerns exist regarding performance disparities for non-English speakers, potentially exacerbating health inequities. Appropriate information is critical for patients with limited English proficiency undergoing orthopedic procedures such as total hip arthroplasty (THA). This pilot study evaluated differences in the clinical reliability of English and Spanish responses to common THA questions generated by leading LLMs.
Methods: Three widely accessible LLMs (ChatGPT-4o, Gemini 2.0 Flash, and Microsoft Copilot) were evaluated using 10 standardized frequently asked questions on THA, posed in English and Spanish. Responses were independently graded by language-fluent medical experts using a 4-point rubric (1 = Unsatisfactory to 4 = Excellent) assessing clinical reliability and appropriateness. Nonparametric statistics, including Wilcoxon signed-rank, Kruskal-Wallis, and effect sizes (Cliff's Delta, η2), were used for comparisons.
Results: A statistically significant main effect of language was found (p = 0.014, η2 = 0.151), indicating significantly lower clinical reliability scores for Spanish responses in all LLMs. A nonsignificant within-model score decline was observed across all 3 LLMs.
Conclusion: Leading LLMs exhibit significant difference in clinical reliability when providing THA information, performing less reliably in Spanish compared with English. This linguistic gap suggests a potential risk for difference in response interpretation and could potentially worsen health inequities for Spanish-speaking populations. Efforts are needed to improve multilingual capabilities and manage biases in medical artificial intelligence (AI). Clinicians and patients should exercise caution when using LLMs for health information in languages other than English until cross-lingual reliability is demonstrably improved.

Comments

Pagination is not provided by author/publisher.

Publication (Name of Journal)

AOA Critical Issues in Education

DOI

10.2106/JBJS.OA.25.00207

Recommended Citation

Ali, U., Khan, J., Tareen, H., Kanza, S., Pedroza, J. A., Amjad, F., Khan, S., Hasan, A., Malik, M. A. (2026). Linguistic disparities in artificial intelligence-generated patient education for total hip arthroplasty: A pilot study of cross-language analysis of leading large language models. AOA Critical Issues in Education, 11(1).
Available at: https://ecommons.aku.edu/pakistan_fhs_mc_surg_orthop/183

eCommons@AKU

Section of Orthopaedic Surgery

Linguistic disparities in artificial intelligence-generated patient education for total hip arthroplasty: A pilot study of cross-language analysis of leading large language models

Document Type

Department

Abstract

Comments

Publication (Name of Journal)

DOI

Recommended Citation

Search

Browse

Links

eCommons@AKU

Section of Orthopaedic Surgery

Linguistic disparities in artificial intelligence-generated patient education for total hip arthroplasty: A pilot study of cross-language analysis of leading large language models

Authors

Document Type

Department

Abstract

Comments

Publication (Name of Journal)

DOI

Recommended Citation

Share

Search

Browse

Links