AI-assisted programming is rapidly reshaping software development, with large language models (LLMs) enabling new paradigms such as vibe coding and agentic coding. While prior works have focused on prompt design and code generation quality, the broader impact of LLM-driven development on the iterative dynamics of software engineering remains underexplored. In this paper, we conduct large-scale experiments on thousands of algorithmic programming tasks and hundreds of framework selection tasks to systematically investigate how AI-assisted programming interacts with the software ecosystem. Our analysis reveals \textbf{a striking Matthew effect: the more popular a programming language or framework, the higher the success rate of LLM-generated code}. The phenomenon suggests that AI systems may reinforce existing popularity hierarchies, accelerating convergence around dominant tools while hindering diversity and innovation. We provide a quantitative characterization of this effect and discuss its implications for the future evolution of programming ecosystems.
- Paper ID: 2509.23261
- Title: The Matthew Effect of AI Programming Assistants: A Hidden Bias in Software Evolution
- Authors: Fei Gu, Zi Liang, Hongzong Li, Jiahao Ma
- Classification: cs.SE (Software Engineering)
- Publication Date: October 13, 2025 (arXiv v2)
- Paper Link: https://arxiv.org/abs/2509.23261
AI-assisted programming is rapidly reshaping software development, with large language models (LLMs) spawning new paradigms such as "vibe coding" and "agentic coding." While prior research has primarily focused on prompt engineering and code generation quality, the broader impact of LLM-driven development on software engineering iteration dynamics remains underexplored. This paper systematically investigates how AI-assisted programming interacts with software ecosystems through large-scale experiments on thousands of algorithmic programming tasks and hundreds of framework selection tasks. The analysis reveals a significant Matthew Effect: the more popular a programming language or framework, the higher the success rate of LLM-generated code. This phenomenon suggests that AI systems may reinforce existing popularity hierarchies, accelerate convergence toward mainstream tools, and simultaneously hinder diversity and innovation. The paper provides quantitative characterization of this effect and discusses its implications for the future evolution of programming ecosystems.
The core research question is: Do AI programming assistants inadvertently reinforce the dominance of existing programming languages and frameworks, thereby producing a "Matthew Effect"—a "rich get richer" phenomenon?
- Ecosystem Impact: As AI programming tools proliferate, their biases may systematically influence which languages, frameworks, and paradigms will thrive or decline
- Innovation Suppression: If AI tools are overly biased toward mainstream technologies, they may inhibit technological innovation and ecosystem diversity
- Long-term Consequences: Such biases may create lock-in effects, reduce experimentation opportunities, and diminish the likelihood of paradigm-shifting innovations
- Micro-level Assessment: Existing research primarily focuses on short-term, micro-level evaluations, measuring model performance on narrow benchmarks or single-language datasets
- Lack of Ecosystem Perspective: Fails to capture the multifaceted complexity of real-world software engineering
- Neglect of Systemic Bias: Lacks investigation into how AI tools affect the trajectory of entire programming ecosystems
Based on observations of LLM training data distribution: Python comprises nearly 40% of the StarCoder dataset, while many other languages occupy marginal proportions; AI programming assistants frequently over-rely on established libraries, with NumPy appearing in 48% of completions, and Python being selected 58% of the time even in performance-critical tasks where other languages might be more suitable.
- First Large-Scale Benchmark: Constructs the first large-scale benchmark combining algorithmic programming tasks (totaling 120,440 tasks: 3011×8×5) and complex full-stack development tasks to evaluate AI programming assistants' performance across languages and frameworks
- Controlled Evaluation Methodology: Designs a controlled evaluation methodology that isolates the effects of language and framework popularity, revealing structural biases beyond overall accuracy metrics
- Empirical Evidence of Matthew Effect: Provides the first empirical evidence of Matthew Effect in LLM code generation simultaneously at both language and framework levels, demonstrating how this dual-layer bias shapes software ecosystem trajectories
The research designs a two-tier experimental pipeline:
- Algorithmic Task Layer: Evaluates code generation performance across 8 programming languages on 3,011 LeetCode problems
- Framework Task Layer: Evaluates performance of 6 mainstream full-stack combinations on 17 general CRUD applications and specialized technical pathway differentiation scenarios
Based on the June 2025 TIOBE Index, 8 languages were selected:
- Mainstream Languages: Python (Rank 1), C++ (Rank 2), Java (Rank 4), JavaScript (Rank 6)
- Emerging Languages: Go (Rank 7), Rust (Rank 13)
- Niche Languages: Erlang (Rank 46), Racket (Unranked)
Six full-stack combinations were selected, spanning from popular to emerging technology stacks:
- Vue + Spring Boot + Hibernate (Java enterprise-level)
- React + Express.js + Prisma (modern JavaScript)
- Django REST + Django ORM (Python full-stack)
- Preact + Gin + GORM (lightweight Go)
- Svelte + FastAPI + SQLAlchemy (modern Python)
- SolidJS + Actix Web + SeaORM (emerging Rust)
- Standardized Prompting: Generate consistent prompt templates for each problem and language combination
- Multi-stage Code Extraction: Design a multi-stage pipeline to extract pure executable code from mixed text responses
- Language-Specific Cleaning: Apply regular expression patterns tailored to the syntactic features of each programming language
For framework tasks, a strictly controlled VibeCoding protocol was adopted:
- Utilize Cursor Pro, CodeBuddy, and GitHub Copilot
- Experimenters perform no manual coding or architectural input
- Interactions strictly limited to forwarding raw error messages back to the chat interface
- Iterate until all core functional requirements are met or a preset attempt limit is reached
- Dual-Layer Bias Detection: First systematic detection of Matthew Effect simultaneously at both language and framework levels
- Controlled Variable Methodology: Isolate popularity effects by maintaining consistent functional requirements while varying only the technology stack
- Large-Scale Distributed Evaluation: Implement a distributed submission system supporting 120,440 code generations
- LeetCode Benchmark: 3,011 problems (765 easy, 1,526 medium, 720 hard)
- Framework Tasks: 17 general CRUD applications + 8 technical pathway differentiation scenarios
- Models: 5 state-of-the-art LLMs (GPT-4o-mini, DeepSeek-V3, Gemini-2.0-Flash, Gemini-2.5-Flash, Qwen3-Turbo)
- Pass@1 Accuracy: Acceptance rate of first submission attempts
- Error Type Distribution: Compilation errors, runtime errors, incorrect answers, etc.
- Completion Attempt Count: Number of iterations required to achieve functional completeness in framework tasks
- API Parameters: temperature=0.5, maxOutputTokens=65535, top_p=0.95
- Distributed System: 15 LeetCode accounts, exponential backoff strategy, 10 submissions per account per minute limit
- Error Handling: Implement robust error handling framework including rate limiting and retry mechanisms
The experiments reveal significant performance gaps between popular and niche languages:
Top Model Performance Comparison:
- Mainstream Languages: Python, JavaScript, Java, C++ achieve Pass@1 rates exceeding 60%
- Niche Languages: Erlang and Racket success rates typically below 25%, sometimes approaching zero
- Best Performance: DeepSeek-V3 achieves 79.81% on Python but only 24.31% on Erlang and 20.82% on Racket
Difficulty-Stratified Analysis:
- Easy Problems: Gap between popular and niche languages ranges from 45-82 percentage points
- Hard Problems: Gap expands to 58-95 percentage points
- Difficult Task Performance: Top models achieve 50-63% success rates on popular languages but only 0-6% on niche languages
Framework experiments similarly demonstrate significant bias patterns:
Success Rate Distribution:
- Mainstream Frameworks: Vue+Spring, React+Express, Django complete most of the 17 benchmark tasks within 1-3 attempts
- Niche Frameworks: Svelte+FastAPI and SolidJS+Actix show higher failure rates, with many tasks requiring over 5 attempts or remaining incomplete
Technical Pathway Differentiation Experiment:
- Mainstream Technology Stacks: Typically converge within 1-2 correction rounds
- Mid-tier Technology Stacks: Require 2-3 interventions
- Niche Technology Stacks: Frequently require 5-10 rounds of guidance to produce runnable systems
Paired t-tests on Pass@1 rate differences between popular and niche languages:
- Differences are statistically significant for all models (p < 0.001)
- Average difference ranges: +49.6% for DeepSeek-V3, +34.2% for Qwen3-Turbo
Mainstream Languages: Most failures are incorrect answers or runtime errors, indicating models generate semantically reasonable but incorrect solutions
Niche Languages: Failures primarily consist of compilation errors, indicating models struggle to produce syntactically valid code
- Early Evaluations: HumanEval benchmark shows Copilot produces syntactically valid code but with low correctness rates highly correlated with language prevalence in training data
- Multilingual Benchmarks: XCODEEVAL and other large-scale multilingual benchmarks demonstrate persistent challenges on less common languages
- Tool Comparisons: Copilot performs best in Java, ChatGPT maintains strong cross-language consistency, Gemini excels in JavaScript
- Ecosystem Factors: Community size, tools, and industry adoption often supersede intrinsic technical advantages in influencing language adoption
- Web Framework Research: 15-year longitudinal study shows significant differences in adoption trajectories across different ecosystems
- Uneven LLM Performance: Existing surveys show LLMs perform unevenly on code tasks, with severe bias toward widely-used languages
- Matthew Effect Confirmed: AI programming assistants indeed exhibit significant Matthew Effect, with popular technologies enjoying systematic advantages
- Dual-Layer Bias: This bias exists simultaneously at both programming language and framework levels
- Self-Reinforcing Cycle: Popular frameworks are more easily successfully generated by LLMs → developers are guided to use these frameworks → increased adoption further amplifies online presence → ensures greater model exposure in future iterations
- Evaluation Scope: Primarily based on LeetCode algorithmic tasks and specific framework combinations
- Time Window: Research based on models and popularity data from a specific point in time
- Causality: While correlations are observed, establishing direct causal relationships remains challenging
- Benchmark Expansion: Plans to extend benchmarks to broader domains
- Multi-Agent Collaboration: Investigate collaborative multi-agent development scenarios
- Diversity-Aware Methods: Develop methods to counter ecosystem homogenization through diversity-aware training and inference strategies
- Problem Significance: First systematic investigation of long-term impact of AI programming assistants on software ecosystems, with important theoretical and practical value
- Methodological Innovation: Designs a dual-layer experimental pipeline capable of detecting bias simultaneously at language and framework levels
- Experimental Scale: Large-scale experiments with over 120,440 code generations, results possess statistical persuasiveness
- Controlled Design: Effectively isolates popularity effects through methodology of maintaining consistent functional requirements while varying only technology stacks
- Representativeness Limitations: LeetCode tasks may not fully represent real-world programming scenarios
- Time Sensitivity: Technology popularity is dynamically changing, limiting the timeliness of research results
- Causal Mechanisms: While Matthew Effect is observed, deeper analysis of its generative mechanisms remains insufficient
- Solution Deficiency: Paper primarily identifies problems but lacks specific mitigation strategies
- Academic Contribution: Provides new research perspectives for the intersection of AI and software engineering
- Practical Value: Offers important warnings for AI tool developers and policymakers
- Reproducibility: Provides complete datasets, code, and experimental settings supporting result reproduction
- AI Tool Evaluation: Provides framework for evaluating fairness of AI programming assistants
- Technology Decision-Making: Offers AI compatibility considerations for enterprise technology selection
- Educational Policy: Provides reference for policymaking regarding AI tool usage in programming education
The paper cites 29 important references spanning multiple related fields including AI programming assistants, programming language adoption, and ecosystem evolution, providing solid theoretical foundation for this research.
Overall Assessment: This is a research paper of significant importance that systematically reveals the Matthew Effect existing in AI programming assistants for the first time. The research methodology is scientifically rigorous, the experimental scale is substantial, and the conclusions possess important theoretical and practical value. While there is room for improvement in solution development and mechanism analysis, the paper opens new research directions for the intersection of AI and software engineering.