2025-11-25T14:34:18.139163

The Matthew Effect of AI Programming Assistants: A Hidden Bias in Software Evolution

Gu, Liang, LI et al.

AI-assisted programming is rapidly reshaping software development, with large language models (LLMs) enabling new paradigms such as vibe coding and agentic coding. While prior works have focused on prompt design and code generation quality, the broader impact of LLM-driven development on the iterative dynamics of software engineering remains underexplored. In this paper, we conduct large-scale experiments on thousands of algorithmic programming tasks and hundreds of framework selection tasks to systematically investigate how AI-assisted programming interacts with the software ecosystem. Our analysis reveals \textbf{a striking Matthew effect: the more popular a programming language or framework, the higher the success rate of LLM-generated code}. The phenomenon suggests that AI systems may reinforce existing popularity hierarchies, accelerating convergence around dominant tools while hindering diversity and innovation. We provide a quantitative characterization of this effect and discuss its implications for the future evolution of programming ecosystems.

academic

The Matthew Effect of AI Programming Assistants: A Hidden Bias in Software Evolution

Basic Information

Paper ID: 2509.23261
Title: The Matthew Effect of AI Programming Assistants: A Hidden Bias in Software Evolution
Authors: Fei Gu, Zi Liang, Hongzong Li, Jiahao Ma
Classification: cs.SE (Software Engineering)
Publication Date: October 13, 2025 (arXiv v2)
Paper Link: https://arxiv.org/abs/2509.23261

Abstract

AI-assisted programming is rapidly reshaping software development, with large language models (LLMs) spawning new paradigms such as "vibe coding" and "agentic coding." While prior research has primarily focused on prompt engineering and code generation quality, the broader impact of LLM-driven development on software engineering iteration dynamics remains underexplored. This paper systematically investigates how AI-assisted programming interacts with software ecosystems through large-scale experiments on thousands of algorithmic programming tasks and hundreds of framework selection tasks. The analysis reveals a significant Matthew Effect: the more popular a programming language or framework, the higher the success rate of LLM-generated code. This phenomenon suggests that AI systems may reinforce existing popularity hierarchies, accelerate convergence toward mainstream tools, and simultaneously hinder diversity and innovation. The paper provides quantitative characterization of this effect and discusses its implications for the future evolution of programming ecosystems.

Research Background and Motivation

Problem Definition

The core research question is: Do AI programming assistants inadvertently reinforce the dominance of existing programming languages and frameworks, thereby producing a "Matthew Effect"—a "rich get richer" phenomenon?

Problem Significance

Ecosystem Impact: As AI programming tools proliferate, their biases may systematically influence which languages, frameworks, and paradigms will thrive or decline
Innovation Suppression: If AI tools are overly biased toward mainstream technologies, they may inhibit technological innovation and ecosystem diversity
Long-term Consequences: Such biases may create lock-in effects, reduce experimentation opportunities, and diminish the likelihood of paradigm-shifting innovations

Limitations of Existing Research

Micro-level Assessment: Existing research primarily focuses on short-term, micro-level evaluations, measuring model performance on narrow benchmarks or single-language datasets
Lack of Ecosystem Perspective: Fails to capture the multifaceted complexity of real-world software engineering
Neglect of Systemic Bias: Lacks investigation into how AI tools affect the trajectory of entire programming ecosystems

Research Motivation

Based on observations of LLM training data distribution: Python comprises nearly 40% of the StarCoder dataset, while many other languages occupy marginal proportions; AI programming assistants frequently over-rely on established libraries, with NumPy appearing in 48% of completions, and Python being selected 58% of the time even in performance-critical tasks where other languages might be more suitable.

Core Contributions

First Large-Scale Benchmark: Constructs the first large-scale benchmark combining algorithmic programming tasks (totaling 120,440 tasks: 3011×8×5) and complex full-stack development tasks to evaluate AI programming assistants' performance across languages and frameworks
Controlled Evaluation Methodology: Designs a controlled evaluation methodology that isolates the effects of language and framework popularity, revealing structural biases beyond overall accuracy metrics
Empirical Evidence of Matthew Effect: Provides the first empirical evidence of Matthew Effect in LLM code generation simultaneously at both language and framework levels, demonstrating how this dual-layer bias shapes software ecosystem trajectories

Methodology Details

Task Definition

The research designs a two-tier experimental pipeline:

Algorithmic Task Layer: Evaluates code generation performance across 8 programming languages on 3,011 LeetCode problems
Framework Task Layer: Evaluates performance of 6 mainstream full-stack combinations on 17 general CRUD applications and specialized technical pathway differentiation scenarios

Experimental Architecture

Language Selection Strategy

Based on the June 2025 TIOBE Index, 8 languages were selected:

Mainstream Languages: Python (Rank 1), C++ (Rank 2), Java (Rank 4), JavaScript (Rank 6)
Emerging Languages: Go (Rank 7), Rust (Rank 13)
Niche Languages: Erlang (Rank 46), Racket (Unranked)

Framework Selection Strategy

Six full-stack combinations were selected, spanning from popular to emerging technology stacks:

Vue + Spring Boot + Hibernate (Java enterprise-level)
React + Express.js + Prisma (modern JavaScript)
Django REST + Django ORM (Python full-stack)
Preact + Gin + GORM (lightweight Go)
Svelte + FastAPI + SQLAlchemy (modern Python)
SolidJS + Actix Web + SeaORM (emerging Rust)

Technical Implementation

Code Generation Pipeline

Standardized Prompting: Generate consistent prompt templates for each problem and language combination
Multi-stage Code Extraction: Design a multi-stage pipeline to extract pure executable code from mixed text responses
Language-Specific Cleaning: Apply regular expression patterns tailored to the syntactic features of each programming language

VibeCoding Protocol

For framework tasks, a strictly controlled VibeCoding protocol was adopted:

Utilize Cursor Pro, CodeBuddy, and GitHub Copilot
Experimenters perform no manual coding or architectural input
Interactions strictly limited to forwarding raw error messages back to the chat interface
Iterate until all core functional requirements are met or a preset attempt limit is reached

Technical Innovations

Dual-Layer Bias Detection: First systematic detection of Matthew Effect simultaneously at both language and framework levels
Controlled Variable Methodology: Isolate popularity effects by maintaining consistent functional requirements while varying only the technology stack
Large-Scale Distributed Evaluation: Implement a distributed submission system supporting 120,440 code generations

Experimental Setup

Dataset

LeetCode Benchmark: 3,011 problems (765 easy, 1,526 medium, 720 hard)
Framework Tasks: 17 general CRUD applications + 8 technical pathway differentiation scenarios
Models: 5 state-of-the-art LLMs (GPT-4o-mini, DeepSeek-V3, Gemini-2.0-Flash, Gemini-2.5-Flash, Qwen3-Turbo)

Evaluation Metrics

Pass@1 Accuracy: Acceptance rate of first submission attempts
Error Type Distribution: Compilation errors, runtime errors, incorrect answers, etc.
Completion Attempt Count: Number of iterations required to achieve functional completeness in framework tasks

Implementation Details

API Parameters: temperature=0.5, maxOutputTokens=65535, top_p=0.95
Distributed System: 15 LeetCode accounts, exponential backoff strategy, 10 submissions per account per minute limit
Error Handling: Implement robust error handling framework including rate limiting and retry mechanisms

Experimental Results

Main Findings

Matthew Effect at Language Level

The experiments reveal significant performance gaps between popular and niche languages:

Top Model Performance Comparison:

Mainstream Languages: Python, JavaScript, Java, C++ achieve Pass@1 rates exceeding 60%
Niche Languages: Erlang and Racket success rates typically below 25%, sometimes approaching zero
Best Performance: DeepSeek-V3 achieves 79.81% on Python but only 24.31% on Erlang and 20.82% on Racket

Difficulty-Stratified Analysis:

Easy Problems: Gap between popular and niche languages ranges from 45-82 percentage points
Hard Problems: Gap expands to 58-95 percentage points
Difficult Task Performance: Top models achieve 50-63% success rates on popular languages but only 0-6% on niche languages

Matthew Effect at Framework Level

Framework experiments similarly demonstrate significant bias patterns:

Success Rate Distribution:

Mainstream Frameworks: Vue+Spring, React+Express, Django complete most of the 17 benchmark tasks within 1-3 attempts
Niche Frameworks: Svelte+FastAPI and SolidJS+Actix show higher failure rates, with many tasks requiring over 5 attempts or remaining incomplete

Technical Pathway Differentiation Experiment:

Mainstream Technology Stacks: Typically converge within 1-2 correction rounds
Mid-tier Technology Stacks: Require 2-3 interventions
Niche Technology Stacks: Frequently require 5-10 rounds of guidance to produce runnable systems

Statistical Significance Verification

Paired t-tests on Pass@1 rate differences between popular and niche languages:

Differences are statistically significant for all models (p < 0.001)
Average difference ranges: +49.6% for DeepSeek-V3, +34.2% for Qwen3-Turbo

Error Type Analysis

Mainstream Languages: Most failures are incorrect answers or runtime errors, indicating models generate semantically reasonable but incorrect solutions Niche Languages: Failures primarily consist of compilation errors, indicating models struggle to produce syntactically valid code

AI Programming Assistant Research

Early Evaluations: HumanEval benchmark shows Copilot produces syntactically valid code but with low correctness rates highly correlated with language prevalence in training data
Multilingual Benchmarks: XCODEEVAL and other large-scale multilingual benchmarks demonstrate persistent challenges on less common languages
Tool Comparisons: Copilot performs best in Java, ChatGPT maintains strong cross-language consistency, Gemini excels in JavaScript

Programming Ecosystem Evolution

Ecosystem Factors: Community size, tools, and industry adoption often supersede intrinsic technical advantages in influencing language adoption
Web Framework Research: 15-year longitudinal study shows significant differences in adoption trajectories across different ecosystems
Uneven LLM Performance: Existing surveys show LLMs perform unevenly on code tasks, with severe bias toward widely-used languages

Conclusions and Discussion

Main Conclusions

Matthew Effect Confirmed: AI programming assistants indeed exhibit significant Matthew Effect, with popular technologies enjoying systematic advantages
Dual-Layer Bias: This bias exists simultaneously at both programming language and framework levels
Self-Reinforcing Cycle: Popular frameworks are more easily successfully generated by LLMs → developers are guided to use these frameworks → increased adoption further amplifies online presence → ensures greater model exposure in future iterations

Limitations

Evaluation Scope: Primarily based on LeetCode algorithmic tasks and specific framework combinations
Time Window: Research based on models and popularity data from a specific point in time
Causality: While correlations are observed, establishing direct causal relationships remains challenging

Future Directions

Benchmark Expansion: Plans to extend benchmarks to broader domains
Multi-Agent Collaboration: Investigate collaborative multi-agent development scenarios
Diversity-Aware Methods: Develop methods to counter ecosystem homogenization through diversity-aware training and inference strategies

In-Depth Evaluation

Strengths

Problem Significance: First systematic investigation of long-term impact of AI programming assistants on software ecosystems, with important theoretical and practical value
Methodological Innovation: Designs a dual-layer experimental pipeline capable of detecting bias simultaneously at language and framework levels
Experimental Scale: Large-scale experiments with over 120,440 code generations, results possess statistical persuasiveness
Controlled Design: Effectively isolates popularity effects through methodology of maintaining consistent functional requirements while varying only technology stacks

Weaknesses

Representativeness Limitations: LeetCode tasks may not fully represent real-world programming scenarios
Time Sensitivity: Technology popularity is dynamically changing, limiting the timeliness of research results
Causal Mechanisms: While Matthew Effect is observed, deeper analysis of its generative mechanisms remains insufficient
Solution Deficiency: Paper primarily identifies problems but lacks specific mitigation strategies

Impact

Academic Contribution: Provides new research perspectives for the intersection of AI and software engineering
Practical Value: Offers important warnings for AI tool developers and policymakers
Reproducibility: Provides complete datasets, code, and experimental settings supporting result reproduction

Applicable Scenarios

AI Tool Evaluation: Provides framework for evaluating fairness of AI programming assistants
Technology Decision-Making: Offers AI compatibility considerations for enterprise technology selection
Educational Policy: Provides reference for policymaking regarding AI tool usage in programming education

References

The paper cites 29 important references spanning multiple related fields including AI programming assistants, programming language adoption, and ecosystem evolution, providing solid theoretical foundation for this research.

Overall Assessment: This is a research paper of significant importance that systematically reveals the Matthew Effect existing in AI programming assistants for the first time. The research methodology is scientifically rigorous, the experimental scale is substantial, and the conclusions possess important theoretical and practical value. While there is room for improvement in solution development and mechanism analysis, the paper opens new research directions for the intersection of AI and software engineering.