2025-11-25T02:07:24.751943

Unmasking Hiring Bias: Platform Data Analysis and Controlled Experiments on Bias in Online Freelance Marketplaces via RAG-LLM Generated Contents

Zheng, Shan
Online freelance marketplaces, a rapidly growing part of the global labor market, are creating a fair environment where professional skills are the main factor for hiring. While these platforms can reduce bias from traditional hiring, the personal information in user profiles raises concerns about ongoing discrimination. Past studies on this topic have mostly used existing data, which makes it hard to control for other factors and clearly see the effect of things like gender or race. To solve these problems, this paper presents a new method that uses Retrieval-Augmented Generation (RAG) with a Large Language Model (LLM) to create realistic, artificial freelancer profiles for controlled experiments. This approach effectively separates individual factors, enabling a clearer statistical analysis of how different variables influence the freelancer project process. In addition to analyzing extracted data with traditional statistical methods for post-project stage analysis, our research utilizes a dataset with highly controlled variables, generated by an RAG-LLM, to conduct a simulated hiring experiment for pre-project stage analysis. The results of our experiments show that, regarding gender, while no significant preference emerged in initial hiring decisions, female freelancers are substantially more likely to receive imperfect ratings post-project stage. Regarding regional bias, a strong and consistent preference favoring US-based freelancers shows that people are more likely to be selected in the simulated experiments, perceived as more leader-like, and receive higher ratings on the live platform.
academic

Unmasking Hiring Bias: Platform Data Analysis and Controlled Experiments on Bias in Online Freelance Marketplaces via RAG-LLM Generated Contents

Basic Information

  • Paper ID: 2510.13091
  • Title: Unmasking Hiring Bias: Platform Data Analysis and Controlled Experiments on Bias in Online Freelance Marketplaces via RAG-LLM Generated Contents
  • Authors: Wugeng Zheng, Guohou Shan (Northeastern University)
  • Classification: cs.HC (Human-Computer Interaction)
  • Publication Venue: ACM Conference on Intelligent User Interfaces 2026
  • Paper Link: https://arxiv.org/abs/2510.13091

Abstract

Online freelance marketplaces, as a rapidly growing segment of the global labor market, theoretically should create a fair environment where professional skills are the primary hiring factor. However, personal information in user profiles raises persistent concerns about discrimination. This paper proposes an innovative approach using Retrieval-Augmented Generation (RAG) combined with Large Language Models (LLMs) to create realistic synthetic freelancer profiles for controlled experiments. The findings reveal that regarding gender, while no significant preference emerges in initial hiring decisions, female freelancers are more likely to receive imperfect ratings after project completion. Regarding geographic bias, U.S. freelancers demonstrate strong and consistent advantages.

Research Background and Motivation

Problem Definition

  1. Core Issue: Whether online freelance platforms truly achieve the goal of eliminating hiring bias, and how to accurately measure and analyze such biases.
  2. Significance:
    • Online freelance markets have expanded rapidly post-COVID-19, with 20-30% of working-age populations in Western countries engaged in independent work
    • These platforms theoretically should evaluate candidates based on skills rather than personal background
    • Personally identifiable information in user profiles may lead to conscious or unconscious bias
  3. Limitations of Existing Methods:
    • Traditional research primarily relies on observational data analysis, making it difficult to control confounding variables
    • Freelancers' skills, educational backgrounds, and project experience are typically intertwined with demographic attributes (gender, race)
    • Collecting large-scale datasets to statistically control these variables faces significant challenges
  4. Research Motivation: Develop a novel experimental methodology capable of rigorous variable control to precisely isolate and measure the independent impact of specific demographic factors on hiring decisions.

Core Contributions

  1. Methodological Innovation: First application of RAG-LLM framework to generate highly controlled synthetic data for controlled hiring bias experiments, overcoming confounding factor challenges in traditional observational data.
  2. Multi-Stage Bias Analysis: Proposes a comprehensive analytical framework encompassing pre-hiring stages (through user studies) and post-project evaluation stages (using real-world data), providing a more complete perspective than research limited to post-project data.
  3. Precise Variable Control: Achieves precise variable isolation through RAG-LLM generated profiles, enabling creation of candidate profiles that are nearly identical except for specific research variables.
  4. Empirical Findings: Reveals distinct manifestation patterns of gender and geographic bias across different stages, providing new insights into understanding discrimination mechanisms in online markets.

Methodology Details

Task Definition

Input: Real user data from freelance platforms and specific demographic variable control requirements Output: Highly controlled synthetic freelancer profiles for measuring the impact of specific variables on hiring decisions Constraints: Generated profiles must be highly similar in skills, experience, and ratings, differing only in research variables (e.g., gender, location)

Model Architecture

1. Data Acquisition and Processing

  • Data Source: Scraped 12,799 freelancer profiles from Freelancer.com
  • Data Preprocessing:
    • Gender classification using Huggingface pre-trained facial recognition model (confidence threshold 0.75)
    • Focus on Indian and U.S. freelancers (two most representative countries in dataset)
    • Extraction of attributes including username, user ID, verification badges, overall ratings, profile taglines

2. RAG-LLM Pipeline

  • Vectorization: Vectorized processed data using Huggingface embedding models to construct knowledge base
  • Core Model: Employed Qwen/QwQ-32B large language model
  • Generation Process:
    1. Retrieval: Retrieve most similar profiles from knowledge base as references
    2. Augmentation: Add retrieved documents to LLM context
    3. Generation: Generate coherent profiles consistent with real-world data based on augmented prompts

3. Experimental Platform

  • Technology Stack: Interactive web interface built using Flask
  • Task Design:
    • Freelancer comparison task: Display two profiles side-by-side, requiring users to select preferred candidate
    • Review comparison task: Display relevant review information and answer questions
  • Data Collection: Record user selections and interaction data

4. Participant Recruitment

  • Platform: Recruited participants through Amazon Mechanical Turk (MTurk)
  • Quality Control: Included attention check questions to filter invalid submissions

Technical Innovations

  1. Precise Variable Control: Compared to traditional methods, the RAG-LLM framework generates highly similar profile pairs across all attributes, differing only in research variables, achieving unprecedented experimental control precision.
  2. Realism Assurance: Through RAG mechanisms, generated profiles are grounded in real data, avoiding unrealistic and inconsistent issues that may arise from manual authoring.
  3. Efficiency Improvement: Compared to manual profile authoring requiring 10-15 minutes each, the RAG-LLM approach significantly improves generation efficiency while ensuring quality.

Experimental Setup

Dataset

  • Scale: 12,799 real freelancer profiles
  • Source: Freelancer.com platform
  • Features: Username, ID, verification status, ratings, review count, country, AI-inferred gender
  • Synthetic Data: Generated 1,980 highly controlled profile pairs for user studies

Evaluation Metrics

  • Hiring Preference: Profile selection probability and win rate
  • Leadership Perception: Probability of being selected as more leadership-oriented
  • Rating Bias: Probability of receiving non-5-star ratings (using logistic regression)
  • Review Count: Number of reviews received (using negative binomial regression)

Comparison Methods

  • Traditional observational data analysis methods
  • Statistical regression analysis (with and without interaction terms)

Implementation Details

  • Confidence Threshold: Gender classification model confidence > 0.75
  • Statistical Methods: Logistic regression, negative binomial regression, chi-square tests
  • Significance Levels: p<0.05, p<0.01, p<0.001

Experimental Results

Main Findings

1. Hiring Decision Analysis

  • Geographic Bias: U.S. freelancers show significant advantages over Indian freelancers
    • U.S. male win rate: 1.212 (95% CI: 1.066, 1.375, p=0.003)
    • U.S. female win rate: 1.158 (95% CI: 1.020, 1.315, p=0.025)
    • Indian male win rate: 0.767 (95% CI: 0.678, 0.869, p<0.001)
  • Gender Bias: Within the same country, gender differences are not significant (p>0.3)

2. Leadership Perception Analysis

  • Strong Geographic Bias:
    • U.S. male vs. Indian male: OR=2.014 (p<0.001)
    • U.S. female vs. Indian female: OR=1.934 (p<0.001)
  • Overall U.S. Candidate Advantage: U.S. candidates of both genders are significantly more often selected as leaders

3. Post-Project Evaluation Analysis

  • Gender Bias: Female freelancers have 51.2% higher probability of receiving non-perfect ratings (OR=1.512, p<0.001)
  • Geographic Bias: U.S. freelancers have 37.9% lower probability of receiving non-perfect ratings (OR=0.621, p=0.019)

4. Review Count Analysis

  • Significant Interaction Effect: Gender's impact on review count depends on country (p=0.031)
    • Indian females receive 24% more reviews than Indian males (IRR=1.237)
    • U.S. females receive 22% fewer reviews than U.S. males

Ablation Studies

The paper validates the independent effects and interaction effects of geographic and gender factors through model comparisons with and without interaction terms.

Experimental Insights

  1. Stage Differences: Gender bias is not significant at the hiring stage but becomes significant at the evaluation stage; geographic bias is significant and consistent across both stages.
  2. Universality of Geographic Bias: U.S. freelancers enjoy systematic advantages in selection, leadership perception, and ratings.
  3. Complexity of Gender Bias: Women are not disadvantaged in obtaining work opportunities but face stricter evaluation standards in work assessment.

Online Market Discrimination Research

  • Hannak et al. (2017): Found racial and gender bias on TaskRabbit and Fiverr
  • Edelman et al. (2017): Found persistent consumer discrimination on sharing economy platforms like Airbnb
  • Chan & Wang (2018): Found hiring preferences for female applicants in certain contexts

Machine Learning and LLM Applications

  • Traditional Method Limitations: Data scraping and econometric analysis struggle to control all potential confounding variables
  • LLM Applications in Platform Research: Understanding user activities on Stack Overflow, online reviews, search behavior, and other domains
  • RAG Technology: Overcomes standard LLM limitations in factual errors and specialized information processing

Conclusions and Discussion

Main Conclusions

  1. Methodological Breakthrough: The RAG-LLM framework successfully achieves high-precision variable control, providing new methodological tools for online bias research.
  2. Stage-Specific Characteristics of Gender Bias: Women do not face significant disadvantages at the hiring stage but encounter stricter judgment standards in post-project evaluations.
  3. Systemic Nature of Geographic Bias: U.S. freelancers enjoy end-to-end advantages from hiring selection to final evaluation, reflecting deeper cultural biases and stereotypes.

Limitations

  1. Geographic Scope Limitation: Research primarily focuses on U.S. and Indian freelancers, potentially not fully representing global situations.
  2. Platform Specificity: Based solely on Freelancer.com data; different platforms may exhibit different bias patterns.
  3. Temporal Limitation: Research reflects bias at a specific time point; bias patterns may change over time.
  4. Participant Representativeness: MTurk participants may not fully represent actual employer populations.

Future Directions

  1. Cross-Platform Validation: Verify the generalizability of findings across multiple freelance platforms.
  2. Longitudinal Studies: Track bias trends over time.
  3. Intervention Measures: Design and test platform design interventions to reduce bias based on research findings.
  4. Extended Demographics: Include additional demographic dimensions such as age and educational background.

In-Depth Evaluation

Strengths

  1. Strong Methodological Innovation: The RAG-LLM approach to generating controlled experimental data is pioneering, providing new tools for social science experimental research.
  2. Rigorous Experimental Design: Multi-stage analytical design is comprehensive, considering both pre-hiring decisions and post-project evaluations.
  3. Sufficient Statistical Analysis: Employs appropriate statistical methods including interaction effect analysis with statistically significant results.
  4. Significant Practical Implications: Research findings have important policy implications for understanding fairness in online labor markets.
  5. Complete Technical Implementation: Clear and complete technical pathway from data collection to experimental platform construction.

Limitations

  1. Relatively Limited Sample Size: While including 12,799 profiles, the participant scale for user studies may require further expansion.
  2. Insufficient Cultural Factor Analysis: Explanations for geographic bias are primarily speculative, lacking in-depth cultural and psychological mechanism analysis.
  3. Unknown Long-Term Effects: Research is cross-sectional, unable to reveal dynamic changes in bias.
  4. Generation Quality Verification: While manual review of generated profiles is mentioned, systematic quality assessment metrics are lacking.

Impact

  1. Academic Contribution: Provides new research paradigms for HCI and social computing fields, expected to be widely cited and applied.
  2. Practical Value: Research findings can guide platform design improvements and promote fairer online labor markets.
  3. Reproducibility: Clear methodology and reproducible technical implementation facilitate subsequent research verification and extension.
  4. Interdisciplinary Impact: Combines AI technology with social science research, demonstrating the value of interdisciplinary research.

Applicable Scenarios

  1. Online Platform Bias Research: Extensible to other types of online markets and platforms.
  2. Algorithm Fairness Assessment: Provides new data generation methods for AI system fairness testing.
  3. Policy-Making Support: Provides empirical evidence for labor market fairness policy formulation.
  4. Platform Design Optimization: Guides online platform user interface and recommendation algorithm design.

References

The paper cites 35 relevant references covering important research in online market discrimination, machine learning applications, and human-computer interaction, providing solid theoretical foundation and methodological support for this research.


Overall Assessment: This is a high-quality research paper with significant methodological innovation. Through RAG-LLM technology achieving precise variable control, it opens new pathways for online bias research. The research findings have important theoretical and practical significance, contributing positively to promoting fairness in online labor markets. Despite some limitations, it represents an important contribution to the field.