2025-11-25T14:34:18.139163

The Matthew Effect of AI Programming Assistants: A Hidden Bias in Software Evolution

Gu, Liang, LI et al.
AI-assisted programming is rapidly reshaping software development, with large language models (LLMs) enabling new paradigms such as vibe coding and agentic coding. While prior works have focused on prompt design and code generation quality, the broader impact of LLM-driven development on the iterative dynamics of software engineering remains underexplored. In this paper, we conduct large-scale experiments on thousands of algorithmic programming tasks and hundreds of framework selection tasks to systematically investigate how AI-assisted programming interacts with the software ecosystem. Our analysis reveals \textbf{a striking Matthew effect: the more popular a programming language or framework, the higher the success rate of LLM-generated code}. The phenomenon suggests that AI systems may reinforce existing popularity hierarchies, accelerating convergence around dominant tools while hindering diversity and innovation. We provide a quantitative characterization of this effect and discuss its implications for the future evolution of programming ecosystems.
academic

The Matthew Effect of AI Programming Assistants: A Hidden Bias in Software Evolution

Basic Information

  • Paper ID: 2509.23261
  • Title: The Matthew Effect of AI Programming Assistants: A Hidden Bias in Software Evolution
  • Authors: Fei Gu, Zi Liang, Hongzong Li, Jiahao Ma
  • Classification: cs.SE (Software Engineering)
  • Publication Date: October 13, 2025 (arXiv v2)
  • Paper Link: https://arxiv.org/abs/2509.23261

Abstract

AI-assisted programming is rapidly reshaping software development, with large language models (LLMs) spawning new paradigms such as "vibe coding" and "agentic coding." While prior research has primarily focused on prompt engineering and code generation quality, the broader impact of LLM-driven development on software engineering iteration dynamics remains underexplored. This paper systematically investigates how AI-assisted programming interacts with software ecosystems through large-scale experiments on thousands of algorithmic programming tasks and hundreds of framework selection tasks. The analysis reveals a significant Matthew Effect: the more popular a programming language or framework, the higher the success rate of LLM-generated code. This phenomenon suggests that AI systems may reinforce existing popularity hierarchies, accelerate convergence toward mainstream tools, and simultaneously hinder diversity and innovation. The paper provides quantitative characterization of this effect and discusses its implications for the future evolution of programming ecosystems.

Research Background and Motivation

Problem Definition

The core research question is: Do AI programming assistants inadvertently reinforce the dominance of existing programming languages and frameworks, thereby producing a "Matthew Effect"—a "rich get richer" phenomenon?

Problem Significance

  1. Ecosystem Impact: As AI programming tools proliferate, their biases may systematically influence which languages, frameworks, and paradigms will thrive or decline
  2. Innovation Suppression: If AI tools are overly biased toward mainstream technologies, they may inhibit technological innovation and ecosystem diversity
  3. Long-term Consequences: Such biases may create lock-in effects, reduce experimentation opportunities, and diminish the likelihood of paradigm-shifting innovations

Limitations of Existing Research

  1. Micro-level Assessment: Existing research primarily focuses on short-term, micro-level evaluations, measuring model performance on narrow benchmarks or single-language datasets
  2. Lack of Ecosystem Perspective: Fails to capture the multifaceted complexity of real-world software engineering
  3. Neglect of Systemic Bias: Lacks investigation into how AI tools affect the trajectory of entire programming ecosystems

Research Motivation

Based on observations of LLM training data distribution: Python comprises nearly 40% of the StarCoder dataset, while many other languages occupy marginal proportions; AI programming assistants frequently over-rely on established libraries, with NumPy appearing in 48% of completions, and Python being selected 58% of the time even in performance-critical tasks where other languages might be more suitable.

Core Contributions

  1. First Large-Scale Benchmark: Constructs the first large-scale benchmark combining algorithmic programming tasks (totaling 120,440 tasks: 3011×8×5) and complex full-stack development tasks to evaluate AI programming assistants' performance across languages and frameworks
  2. Controlled Evaluation Methodology: Designs a controlled evaluation methodology that isolates the effects of language and framework popularity, revealing structural biases beyond overall accuracy metrics
  3. Empirical Evidence of Matthew Effect: Provides the first empirical evidence of Matthew Effect in LLM code generation simultaneously at both language and framework levels, demonstrating how this dual-layer bias shapes software ecosystem trajectories

Methodology Details

Task Definition

The research designs a two-tier experimental pipeline:

  1. Algorithmic Task Layer: Evaluates code generation performance across 8 programming languages on 3,011 LeetCode problems
  2. Framework Task Layer: Evaluates performance of 6 mainstream full-stack combinations on 17 general CRUD applications and specialized technical pathway differentiation scenarios

Experimental Architecture

Language Selection Strategy

Based on the June 2025 TIOBE Index, 8 languages were selected:

  • Mainstream Languages: Python (Rank 1), C++ (Rank 2), Java (Rank 4), JavaScript (Rank 6)
  • Emerging Languages: Go (Rank 7), Rust (Rank 13)
  • Niche Languages: Erlang (Rank 46), Racket (Unranked)

Framework Selection Strategy

Six full-stack combinations were selected, spanning from popular to emerging technology stacks:

  • Vue + Spring Boot + Hibernate (Java enterprise-level)
  • React + Express.js + Prisma (modern JavaScript)
  • Django REST + Django ORM (Python full-stack)
  • Preact + Gin + GORM (lightweight Go)
  • Svelte + FastAPI + SQLAlchemy (modern Python)
  • SolidJS + Actix Web + SeaORM (emerging Rust)

Technical Implementation

Code Generation Pipeline

  1. Standardized Prompting: Generate consistent prompt templates for each problem and language combination
  2. Multi-stage Code Extraction: Design a multi-stage pipeline to extract pure executable code from mixed text responses
  3. Language-Specific Cleaning: Apply regular expression patterns tailored to the syntactic features of each programming language

VibeCoding Protocol

For framework tasks, a strictly controlled VibeCoding protocol was adopted:

  • Utilize Cursor Pro, CodeBuddy, and GitHub Copilot
  • Experimenters perform no manual coding or architectural input
  • Interactions strictly limited to forwarding raw error messages back to the chat interface
  • Iterate until all core functional requirements are met or a preset attempt limit is reached

Technical Innovations

  1. Dual-Layer Bias Detection: First systematic detection of Matthew Effect simultaneously at both language and framework levels
  2. Controlled Variable Methodology: Isolate popularity effects by maintaining consistent functional requirements while varying only the technology stack
  3. Large-Scale Distributed Evaluation: Implement a distributed submission system supporting 120,440 code generations

Experimental Setup

Dataset

  • LeetCode Benchmark: 3,011 problems (765 easy, 1,526 medium, 720 hard)
  • Framework Tasks: 17 general CRUD applications + 8 technical pathway differentiation scenarios
  • Models: 5 state-of-the-art LLMs (GPT-4o-mini, DeepSeek-V3, Gemini-2.0-Flash, Gemini-2.5-Flash, Qwen3-Turbo)

Evaluation Metrics

  • Pass@1 Accuracy: Acceptance rate of first submission attempts
  • Error Type Distribution: Compilation errors, runtime errors, incorrect answers, etc.
  • Completion Attempt Count: Number of iterations required to achieve functional completeness in framework tasks

Implementation Details

  • API Parameters: temperature=0.5, maxOutputTokens=65535, top_p=0.95
  • Distributed System: 15 LeetCode accounts, exponential backoff strategy, 10 submissions per account per minute limit
  • Error Handling: Implement robust error handling framework including rate limiting and retry mechanisms

Experimental Results

Main Findings

Matthew Effect at Language Level

The experiments reveal significant performance gaps between popular and niche languages:

Top Model Performance Comparison:

  • Mainstream Languages: Python, JavaScript, Java, C++ achieve Pass@1 rates exceeding 60%
  • Niche Languages: Erlang and Racket success rates typically below 25%, sometimes approaching zero
  • Best Performance: DeepSeek-V3 achieves 79.81% on Python but only 24.31% on Erlang and 20.82% on Racket

Difficulty-Stratified Analysis:

  • Easy Problems: Gap between popular and niche languages ranges from 45-82 percentage points
  • Hard Problems: Gap expands to 58-95 percentage points
  • Difficult Task Performance: Top models achieve 50-63% success rates on popular languages but only 0-6% on niche languages

Matthew Effect at Framework Level

Framework experiments similarly demonstrate significant bias patterns:

Success Rate Distribution:

  • Mainstream Frameworks: Vue+Spring, React+Express, Django complete most of the 17 benchmark tasks within 1-3 attempts
  • Niche Frameworks: Svelte+FastAPI and SolidJS+Actix show higher failure rates, with many tasks requiring over 5 attempts or remaining incomplete

Technical Pathway Differentiation Experiment:

  • Mainstream Technology Stacks: Typically converge within 1-2 correction rounds
  • Mid-tier Technology Stacks: Require 2-3 interventions
  • Niche Technology Stacks: Frequently require 5-10 rounds of guidance to produce runnable systems

Statistical Significance Verification

Paired t-tests on Pass@1 rate differences between popular and niche languages:

  • Differences are statistically significant for all models (p < 0.001)
  • Average difference ranges: +49.6% for DeepSeek-V3, +34.2% for Qwen3-Turbo

Error Type Analysis

Mainstream Languages: Most failures are incorrect answers or runtime errors, indicating models generate semantically reasonable but incorrect solutions Niche Languages: Failures primarily consist of compilation errors, indicating models struggle to produce syntactically valid code

AI Programming Assistant Research

  • Early Evaluations: HumanEval benchmark shows Copilot produces syntactically valid code but with low correctness rates highly correlated with language prevalence in training data
  • Multilingual Benchmarks: XCODEEVAL and other large-scale multilingual benchmarks demonstrate persistent challenges on less common languages
  • Tool Comparisons: Copilot performs best in Java, ChatGPT maintains strong cross-language consistency, Gemini excels in JavaScript

Programming Ecosystem Evolution

  • Ecosystem Factors: Community size, tools, and industry adoption often supersede intrinsic technical advantages in influencing language adoption
  • Web Framework Research: 15-year longitudinal study shows significant differences in adoption trajectories across different ecosystems
  • Uneven LLM Performance: Existing surveys show LLMs perform unevenly on code tasks, with severe bias toward widely-used languages

Conclusions and Discussion

Main Conclusions

  1. Matthew Effect Confirmed: AI programming assistants indeed exhibit significant Matthew Effect, with popular technologies enjoying systematic advantages
  2. Dual-Layer Bias: This bias exists simultaneously at both programming language and framework levels
  3. Self-Reinforcing Cycle: Popular frameworks are more easily successfully generated by LLMs → developers are guided to use these frameworks → increased adoption further amplifies online presence → ensures greater model exposure in future iterations

Limitations

  1. Evaluation Scope: Primarily based on LeetCode algorithmic tasks and specific framework combinations
  2. Time Window: Research based on models and popularity data from a specific point in time
  3. Causality: While correlations are observed, establishing direct causal relationships remains challenging

Future Directions

  1. Benchmark Expansion: Plans to extend benchmarks to broader domains
  2. Multi-Agent Collaboration: Investigate collaborative multi-agent development scenarios
  3. Diversity-Aware Methods: Develop methods to counter ecosystem homogenization through diversity-aware training and inference strategies

In-Depth Evaluation

Strengths

  1. Problem Significance: First systematic investigation of long-term impact of AI programming assistants on software ecosystems, with important theoretical and practical value
  2. Methodological Innovation: Designs a dual-layer experimental pipeline capable of detecting bias simultaneously at language and framework levels
  3. Experimental Scale: Large-scale experiments with over 120,440 code generations, results possess statistical persuasiveness
  4. Controlled Design: Effectively isolates popularity effects through methodology of maintaining consistent functional requirements while varying only technology stacks

Weaknesses

  1. Representativeness Limitations: LeetCode tasks may not fully represent real-world programming scenarios
  2. Time Sensitivity: Technology popularity is dynamically changing, limiting the timeliness of research results
  3. Causal Mechanisms: While Matthew Effect is observed, deeper analysis of its generative mechanisms remains insufficient
  4. Solution Deficiency: Paper primarily identifies problems but lacks specific mitigation strategies

Impact

  1. Academic Contribution: Provides new research perspectives for the intersection of AI and software engineering
  2. Practical Value: Offers important warnings for AI tool developers and policymakers
  3. Reproducibility: Provides complete datasets, code, and experimental settings supporting result reproduction

Applicable Scenarios

  1. AI Tool Evaluation: Provides framework for evaluating fairness of AI programming assistants
  2. Technology Decision-Making: Offers AI compatibility considerations for enterprise technology selection
  3. Educational Policy: Provides reference for policymaking regarding AI tool usage in programming education

References

The paper cites 29 important references spanning multiple related fields including AI programming assistants, programming language adoption, and ecosystem evolution, providing solid theoretical foundation for this research.


Overall Assessment: This is a research paper of significant importance that systematically reveals the Matthew Effect existing in AI programming assistants for the first time. The research methodology is scientifically rigorous, the experimental scale is substantial, and the conclusions possess important theoretical and practical value. While there is room for improvement in solution development and mechanism analysis, the paper opens new research directions for the intersection of AI and software engineering.