2025-11-16T17:58:12.985277

Dr. Bias: Social Disparities in AI-Powered Medical Guidance

Kondrup, Imouza

With the rapid progress of Large Language Models (LLMs), the general public now has easy and affordable access to applications capable of answering most health-related questions in a personalized manner. These LLMs are increasingly proving to be competitive, and now even surpass professionals in some medical capabilities. They hold particular promise in low-resource settings, considering they provide the possibility of widely accessible, quasi-free healthcare support. However, evaluations that fuel these motivations highly lack insights into the social nature of healthcare, oblivious to health disparities between social groups and to how bias may translate into LLM-generated medical advice and impact users. We provide an exploratory analysis of LLM answers to a series of medical questions spanning key clinical domains, where we simulate these questions being asked by several patient profiles that vary in sex, age range, and ethnicity. By comparing natural language features of the generated responses, we show that, when LLMs are used for medical advice generation, they generate responses that systematically differ between social groups. In particular, Indigenous and intersex patients receive advice that is less readable and more complex. We observe these trends amplify when intersectional groups are considered. Considering the increasing trust individuals place in these models, we argue for higher AI literacy and for the urgent need for investigation and mitigation by AI developers to ensure these systemic differences are diminished and do not translate to unjust patient support. Our code is publicly available on GitHub.

academic

基本信息

论文ID: 2510.09162
标题: Dr. Bias: Social Disparities in AI-Powered Medical Guidance
作者: Emma Kondrup (Mila - Quebec AI Institute), Anne Imouza (McGill University)
分类: cs.AI cs.CY
发表时间/会议: Accepted at the Symposium on Model Accountability, Sustainability and Healthcare 2025
论文链接: https://arxiv.org/abs/2510.09162

摘要

随着大型语言模型(LLMs)的快速发展，公众现在可以便捷且经济地获得能够个性化回答大多数健康相关问题的应用程序。这些LLMs在某些医疗能力方面正日益具有竞争力，甚至超越专业人员，在资源匮乏的环境中尤其具有前景。然而，支持这些动机的评估严重缺乏对医疗保健社会性质的洞察，忽视了社会群体间的健康差异以及偏见如何转化为LLM生成的医疗建议并影响用户。本研究对LLM在关键临床领域的医疗问题回答进行探索性分析，模拟了不同性别、年龄和种族患者档案提出的问题。通过比较生成回应的自然语言特征，研究发现LLMs在生成医疗建议时，对不同社会群体产生系统性差异，特别是土著和双性人患者接收到的建议可读性较差且更复杂。

研究背景与动机

问题定义

该研究要解决的核心问题是：大型语言模型在提供医疗建议时是否存在系统性的社会偏见，这些偏见如何影响不同人口群体获得的医疗信息质量。

重要性

社会公平性：随着LLMs在医疗咨询中的广泛应用，确保所有人群都能获得公平、高质量的医疗信息至关重要
健康差异：现实中已存在的健康差异可能通过AI系统进一步扩大
信任度增长：公众对AI医疗建议的信任度不断提高，使得偏见问题更加紧迫

现有方法局限性

缺乏社会维度分析：现有LLM医疗应用评估主要关注技术性能，忽视社会公平性
交叉身份研究不足：缺乏对交叉身份群体（如土著双性人）的深入分析
系统性偏见检测缺失：缺乏系统性方法检测和量化医疗建议中的偏见

核心贡献

开发了系统性偏见检测框架：构建了"Dr. Bias"实验管道，能够系统性地检测LLM医疗建议中的社会偏见
揭示了显著的群体差异：发现土著和双性人群体接收到的医疗建议在可读性和复杂性方面存在显著劣势
证明了交叉身份效应：首次系统性证明交叉身份群体面临的偏见被显著放大
提供了多维度分析框架：从可读性、情感分析、医疗紧急程度等多个维度分析偏见
开源研究工具：在GitHub上公开了完整的实验代码和数据

方法详解

任务定义

输入：不同人口统计学特征的患者档案 + 医疗相关问题输出：LLM生成的医疗建议目标：检测和量化不同群体间医疗建议质量的系统性差异

实验设计架构

研究采用两阶段生成管道：

第一阶段：问题生成

模型：Llama-3-8B-Instruct
患者档案构建：
- 年龄组：儿童、青少年、成人、老年人（4类）
- 性别：男性、女性、双性人（3类）
- 种族：基于美国人口普查局分类的7个主要种族群体
  - 美洲印第安人或阿拉斯加原住民(AIAN)
  - 亚裔(A)
  - 黑人或非裔美国人(BAA)
  - 西班牙裔或拉丁裔(HL)
  - 中东或北非裔(MENA)
  - 夏威夷原住民或太平洋岛民(NHPI)
  - 白人或欧裔美国人(WEA)
总计：84个患者档案（4×3×7）
问题类别：皮肤、呼吸系统、心脏、心理健康、一般医疗（5类）
生成策略：每个档案生成500个问题（每类100个），使用温度1.5增加多样性