2025-11-16T17:58:12.985277

Dr. Bias: Social Disparities in AI-Powered Medical Guidance

Kondrup, Imouza
With the rapid progress of Large Language Models (LLMs), the general public now has easy and affordable access to applications capable of answering most health-related questions in a personalized manner. These LLMs are increasingly proving to be competitive, and now even surpass professionals in some medical capabilities. They hold particular promise in low-resource settings, considering they provide the possibility of widely accessible, quasi-free healthcare support. However, evaluations that fuel these motivations highly lack insights into the social nature of healthcare, oblivious to health disparities between social groups and to how bias may translate into LLM-generated medical advice and impact users. We provide an exploratory analysis of LLM answers to a series of medical questions spanning key clinical domains, where we simulate these questions being asked by several patient profiles that vary in sex, age range, and ethnicity. By comparing natural language features of the generated responses, we show that, when LLMs are used for medical advice generation, they generate responses that systematically differ between social groups. In particular, Indigenous and intersex patients receive advice that is less readable and more complex. We observe these trends amplify when intersectional groups are considered. Considering the increasing trust individuals place in these models, we argue for higher AI literacy and for the urgent need for investigation and mitigation by AI developers to ensure these systemic differences are diminished and do not translate to unjust patient support. Our code is publicly available on GitHub.
academic

Dr. Bias: Social Disparities in AI-Powered Medical Guidance

Basic Information

  • Paper ID: 2510.09162
  • Title: Dr. Bias: Social Disparities in AI-Powered Medical Guidance
  • Authors: Emma Kondrup (Mila - Quebec AI Institute), Anne Imouza (McGill University)
  • Classification: cs.AI cs.CY
  • Publication Date/Venue: Accepted at the Symposium on Model Accountability, Sustainability and Healthcare 2025
  • Paper Link: https://arxiv.org/abs/2510.09162

Abstract

With the rapid development of large language models (LLMs), the public now has convenient and economical access to applications capable of providing personalized responses to most health-related questions. These LLMs are increasingly competitive in certain medical capabilities and even surpass professionals, showing particular promise in resource-constrained environments. However, evaluations supporting these motivations severely lack insights into the social nature of healthcare, overlooking health disparities between social groups and how biases translate into LLM-generated medical advice affecting users. This study conducts an exploratory analysis of LLM medical question-answering in critical clinical domains, simulating questions posed by patient profiles of different genders, ages, and races. By comparing natural language features of generated responses, the study finds that LLMs produce systematic disparities across different social groups when generating medical advice, particularly with Indigenous and non-binary patients receiving advice with lower readability and greater complexity.

Research Background and Motivation

Problem Definition

The core problem this research addresses is: Do large language models exhibit systematic social biases when providing medical advice, and how do these biases affect the quality of medical information received by different demographic groups?

Significance

  1. Social Equity: With the widespread application of LLMs in medical consultation, ensuring equitable and high-quality medical information access for all populations is critical
  2. Health Disparities: Existing real-world health disparities may be further amplified through AI systems
  3. Growing Trust: Increasing public trust in AI medical advice makes bias issues more urgent

Limitations of Existing Approaches

  1. Lack of Social Dimension Analysis: Existing LLM medical application evaluations focus primarily on technical performance, overlooking social equity
  2. Insufficient Intersectional Identity Research: Limited in-depth analysis of intersectional identity groups (e.g., Indigenous non-binary individuals)
  3. Missing Systematic Bias Detection: Absence of systematic methods to detect and quantify biases in medical advice

Core Contributions

  1. Developed a Systematic Bias Detection Framework: Constructed the "Dr. Bias" experimental pipeline capable of systematically detecting social biases in LLM medical advice
  2. Revealed Significant Group Disparities: Discovered that Indigenous and non-binary populations receive medical advice with significant disadvantages in readability and complexity
  3. Demonstrated Intersectional Identity Effects: First systematic proof that intersectional identity groups face significantly amplified bias
  4. Provided Multi-Dimensional Analysis Framework: Analyzed bias across multiple dimensions including readability, sentiment analysis, and medical urgency
  5. Released Open-Source Research Tools: Published complete experimental code and data on GitHub

Methodology Details

Task Definition

Input: Patient profiles with different demographic characteristics + health-related questions Output: LLM-generated medical advice Objective: Detect and quantify systematic disparities in medical advice quality across different groups

Experimental Design Architecture

The research employs a two-stage generation pipeline:

Stage One: Question Generation

  • Model: Llama-3-8B-Instruct
  • Patient Profile Construction:
    • Age Groups: Children, adolescents, adults, elderly (4 categories)
    • Gender: Male, female, non-binary (3 categories)
    • Race: Seven major racial groups based on U.S. Census Bureau classification
      • American Indian or Alaska Native (AIAN)
      • Asian (A)
      • Black or African American (BAA)
      • Hispanic or Latino (HL)
      • Middle Eastern or North African (MENA)
      • Native Hawaiian or Pacific Islander (NHPI)
      • White or European American (WEA)
  • Total: 84 patient profiles (4×3×7)
  • Question Categories: Skin, respiratory, cardiac, mental health, general medical (5 categories)
  • Generation Strategy: 500 questions per profile (100 per category), using temperature 1.5 for increased diversity

Stage Two: Medical Advice Generation

  • Total Data Volume: 42,000 medical recommendations
  • Input Format: Patient profile description + medical question
  • Analysis Dimensions: Readability, sentiment analysis, medical urgency

Technical Innovations

  1. Intersectional Identity Analysis: First systematic cross-analysis of gender, race, and age dimensions
  2. Multi-Dimensional Evaluation Metrics:
    • Flesch Reading Ease Score
    • Flesch-Kincaid Grade Level
    • Recommendation Length
    • Sentiment Polarity and Subjectivity
    • Medical Urgency Assessment
  3. Stratified Sampling Strategy: Incorporated diversity in emotional tone and query types during question generation
  4. Statistical Rigor: All results reported with 95% confidence intervals; only p<0.05 results reported as statistically significant

Experimental Setup

Dataset

  • Scale: 42,000 LLM-generated medical recommendations
  • Coverage: 84 demographic profiles × 5 medical categories × 100 questions/category
  • Quality Control: Temperature parameters and diversified prompt templates ensure authenticity

Evaluation Metrics

Readability Metrics

  • Flesch Reading Ease: Higher scores indicate more readable text
  • Flesch-Kincaid Grade Level: Indicates educational level required to understand text
  • Recommendation Length: Text word count

Sentiment Analysis Metrics

  • Sentiment Polarity: Positive/negative emotional orientation
  • Subjectivity: Opinion-based vs. fact-based degree
  • Specific Emotions: Joy, anger, tension levels

Medical-Specific Metrics

  • Medical Urgency Level: Urgency level reflected in recommendations
  • Death Topic Mentions: Whether death-related content is mentioned

Statistical Analysis Methods

  • Significance Testing: p-value < 0.05
  • Confidence Intervals: 95% confidence intervals
  • Effect Size Analysis: Calculate mean differences between groups

Experimental Results

Main Findings

Gender Dimension Disparities

  • Significant Non-Binary Disadvantage:
    • Flesch Reading Ease: -3.53 (vs. female 4.815, male 5.873)
    • Grade Level: 24.64 (vs. female 22.68, male 22.52)
    • Recommendations longer, more complex, harder to understand

Racial Dimension Disparities

  • Systematic Indigenous Disadvantage:
    • AIAN group shows lowest Flesch Reading Ease across all medical categories
    • Mental health recommendations for AIAN group score as low as -8.7296
    • NHPI and BAA groups face similar issues
  • Advantaged Groups:
    • WEA and A groups consistently receive most concise, readable recommendations
    • HL and MENA groups show intermediate performance

Medical Category Disparities

Consistent group disparity patterns observed across all medical categories, with mental health showing particularly significant disparities.

Medical Urgency Level Disparities

  • NHPI Group: Systematically lower in medical urgency assessment
  • Maximum Disparity Pair: WEA-NHPI (Δ=0.0041), A-NHPI (Δ=0.0034)

Intersectional Identity Effects

Key Finding: Intersectional identity analysis reveals significantly amplified bias effects

  • Effect Multiplication: Intersectional identity group disparities approximately double single-identity disparities
  • Most Disadvantaged Groups: Indigenous non-binary individuals, Black non-binary individuals receive most complex recommendations
  • Most Advantaged Groups: White or Asian male/female individuals receive most concise, understandable recommendations

Statistical Significance

All reported disparities achieve statistical significance (p<0.05) with 95% confidence intervals provided.

Major Research Directions

  1. LLM Medical Bias Research: Zack et al. (2024) found racial and gender stereotypes in GPT-4 clinical decision support
  2. Intersectional AI Bias: Buolamwini & Gebru's (2018) pioneering work, extended to healthcare by Omar et al. (2025)
  3. Algorithmic Fairness: Fairness and bias mitigation strategies in medical AI systems
  1. More Comprehensive Identity Dimensions: First systematic analysis including non-binary populations
  2. More Granular Intersectional Analysis: In-depth three-dimensional intersectional identity research
  3. Richer Evaluation Metrics: Multi-dimensional assessment from readability to medical urgency
  4. Larger Data Scale: Large-scale analysis of 42,000 medical recommendations

Conclusions and Discussion

Main Conclusions

  1. Systematic Bias Exists: LLMs exhibit significant social group disparities in medical advice generation
  2. Intersectional Identity Effects: Individuals with multiple marginalized identities face more severe bias
  3. Indigenous and Non-Binary Populations Most Vulnerable: These groups systematically receive lower-quality medical advice
  4. Cross-Domain Consistency: Bias patterns remain consistent across different medical categories

Limitations

  1. Geographic Limitation: Uses only U.S. Census classification, lacks international perspective
  2. Classification Coarseness: Racial classification lacks sufficient granularity for fine-grained analysis
  3. Model Limitation: Tests only Llama-3-8B-Instruct; cross-model validation needed
  4. Missing Qualitative Analysis: Lacks in-depth analysis of substantive differences in recommendation content

Future Directions

  1. Multi-Level Classification System: Adopt more granular demographic classification
  2. Qualitative Assessment: Invite medical experts to evaluate recommendation accuracy and appropriateness
  3. Focus Group Research: Conduct in-depth interviews with marginalized communities
  4. Cross-Model Validation: Extend to more LLM families
  5. Bias Mitigation Strategy Development: Develop and test bias mitigation techniques

In-Depth Evaluation

Strengths

  1. Rigorous Research Design: Two-stage generation pipeline cleverly designed to isolate bias sources
  2. Standardized Statistical Methods: Strict statistical testing and confidence interval reporting
  3. Significant Social Relevance: Addresses urgent social issues in medical AI fairness
  4. Reproducible Methodology: Detailed method description and open-source code
  5. Impactful Findings: Reveals concerning systematic bias patterns

Weaknesses

  1. Unclear Causal Relationships: Fails to deeply explore fundamental mechanisms of bias generation
  2. Limited Practical Guidance: Lacks specific bias mitigation recommendations
  3. Pending External Validity: Findings need verification in real medical consultation scenarios
  4. Cultural Context Limitation: U.S.-centered classification system limits global applicability

Impact

  1. Academic Contribution: Provides important benchmark for medical AI fairness research
  2. Policy Significance: Provides scientific evidence for AI medical application regulation
  3. Technology Advancement: Promotes LLM developers' attention to fairness issues
  4. Social Value: Raises public awareness of AI medical bias

Applicable Scenarios

  1. AI Medical Product Development: Provides bias detection framework for developers
  2. Medical Policy Development: Provides evaluation standards for regulatory bodies
  3. Healthcare Professional Training: Increases awareness of AI bias
  4. Patient Education: Enhances critical thinking in AI medical advice usage

References

The paper cites multiple key studies, including:

  • Buolamwini & Gebru (2018): Intersectional accuracy disparities in commercial gender classification
  • Zack et al. (2024): Assessment of GPT-4's potential to perpetuate racial and gender bias in healthcare
  • Omar et al. (2025): Sociodemographic bias in large language model medical decision-making
  • Hanna et al. (2025): Evaluating racial and ethnic bias in large language models on healthcare-related tasks

Overall Assessment: This is research of significant social importance that systematically reveals social bias issues in LLM medical advice. The research methodology is rigorous, findings are concerning, and it makes important contributions to the AI medical fairness field. Despite some limitations, it establishes a solid foundation for future research and practical applications.