2025-11-19T03:28:13.831095

SWE-Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering

Zhao

Foundation models (FMs), particularly large language models (LLMs), have shown significant promise in various software engineering (SE) tasks, including code generation, debugging, and requirement refinement. Despite these advances, existing evaluation frameworks are insufficient for assessing model performance in iterative, context-rich workflows characteristic of SE activities. To address this limitation, we introduce \emph{SWE-Arena}, an interactive platform designed to evaluate FMs in SE tasks. SWE-Arena provides a transparent, open-source leaderboard, supports multi-round conversational workflows, and enables end-to-end model comparisons. The platform introduces novel metrics, including \emph{model consistency score} that measures the consistency of model outputs through self-play matches, and \emph{conversation efficiency index} that evaluates model performance while accounting for the number of interaction rounds required to reach conclusions. Moreover, SWE-Arena incorporates a new feature called \emph{RepoChat}, which automatically injects repository-related context (e.g., issues, commits, pull requests) into the conversation, further aligning evaluations with real-world development processes. This paper outlines the design and capabilities of SWE-Arena, emphasizing its potential to advance the evaluation and practical application of FMs in software engineering.

academic

SWE-Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering

Basic Information

Paper ID: 2502.01860
Title: SWE-Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering
Author: Zhimin Zhao (Queen's University)
Classification: cs.SE cs.LG
Publication Date: arXiv Preprint (Latest version v5, October 10, 2025)
Paper Link: https://arxiv.org/abs/2502.01860v5

Abstract

Foundation Models (FMs), particularly Large Language Models (LLMs), have demonstrated tremendous potential across various software engineering (SE) tasks, including code generation, debugging, and requirements refinement. Despite these advances, existing evaluation frameworks are insufficient for assessing model performance in the iterative, context-rich workflows inherent to SE activities. To address this limitation, this paper introduces SWE-Arena, an interactive platform specifically designed for evaluating FMs in SE tasks. SWE-Arena provides a transparent open-source leaderboard, supports multi-turn dialogue workflows, and enables end-to-end model comparison. The platform introduces novel evaluation metrics, including the Model Consistency Score (MCS), which measures output consistency through self-play matching, and the Conversation Efficiency Index (CEI), which evaluates model performance while considering the number of interaction rounds required to reach conclusions. Furthermore, SWE-Arena integrates a novel feature called RepoChat, which automatically injects repository-related context (such as issues, commits, and pull requests) into conversations, further aligning evaluation with real-world development workflows.

Research Background and Motivation

Core Problems

Existing foundation model evaluation frameworks face the following critical challenges in the software engineering domain:

Lack of Iterative Support: Traditional evaluation methods cannot handle the multi-turn interaction requirements specific to SE tasks
Missing Context: Existing frameworks fail to effectively integrate repository-level contextual information from real development scenarios
Unidimensional Evaluation: Platforms like Chatbot Arena rely solely on Elo ratings and win rates, providing limited evaluation perspectives
Insufficient Transparency: Many existing platforms are not open-source, limiting community-driven innovation

Problem Significance

Software engineering tasks possess the following characteristics that render traditional evaluation methods inadequate:

Multidimensionality: Spanning multiple domains including requirements engineering, release engineering, and project management
Iterativity: In debugging sessions, for example, models must iteratively optimize solutions based on user feedback
Context Dependency: Real SE workflows require substantial repository-level contextual information

Limitations of Existing Approaches

Static Benchmarks: BigCodeBench, SWE-bench, and others rely on predefined datasets, lacking adaptability
Existing Arena Platforms: Chatbot Arena, WebDev Arena, and similar platforms do not support multi-turn interactions and offer limited evaluation metrics
Insufficient Domain Specificity: General-purpose evaluation platforms cannot capture the unique requirements of SE tasks

Core Contributions

First SE-Specific Interactive Evaluation Platform: SWE-Arena is the first large-scale crowdsourced evaluation platform specifically designed for software engineering tasks
Novel Evaluation Metrics: Proposes two innovative evaluation metrics—Model Consistency Score (MCS) and Conversation Efficiency Index (CEI)
RepoChat Feature: Automatically injects repository-level context, making evaluation more aligned with real development scenarios
Multidimensional Evaluation Framework: Integrates traditional metrics (Elo, win rate) with advanced metrics (eigenvector centrality, PageRank, etc.)
Open-Source Transparent Design: Provides a fully transparent open-source leaderboard and evaluation methodology

Methodology Details

Task Definition

SWE-Arena aims to evaluate foundation models' performance on software engineering tasks through pairwise comparisons based on human preferences. Inputs include user SE-related queries and optional repository URLs, while outputs consist of comparison results between responses from two anonymous models.

Platform Architecture Design

1. RepoChat Feature

RepoChat is the core innovative feature of SWE-Arena:

Automatic Context Extraction: Automatically extracts repository descriptions, programming languages, issue discussions, commit diffs, and other metadata from GitHub/GitLab and similar platforms
Intelligent Context Injection: Merges extracted context with user queries to form comprehensive prompts
Optional Usage: Users can choose whether to provide repository URLs, ensuring backward compatibility

2. Multi-turn Dialogue System

Iterative Interaction: Supports multi-turn conversations between users and models, evaluating long-context processing capabilities
Dynamic Voting: Users can submit votes at any time and reassess to modify their votes
Context Management: Employs FIFO strategy for handling cases exceeding context window limits

3. Quality Assurance Mechanisms

SE Relevance Filtering: Uses GPT-5-nano to automatically filter non-SE-related prompts
Anonymous Evaluation: Model identities remain hidden throughout the session
Response Time Limits: Individual model response time capped at 1 minute

Technical Innovations

1. Model Consistency Score (MCS)

MCS = (D/N) × 100%

Where D represents the number of draws in self-play matches and N represents the total number of self-play matches. This metric quantifies model output consistency through self-play matching.

2. Conversation Efficiency Index (CEI)

CEI = Σ(si/ni) / Σ(1/ni)

Where:

ni: Number of chat turns in a single conversation
si: Result score for a single user vote
Scoring rules: Victory = 1, Draw (both work well) = 0.3, Draw (neither works) = -0.3, Defeat = -1

This metric comprehensively considers result quality and the number of interaction rounds required to reach conclusions.

3. Multidimensional Evaluation Metrics Framework

Beyond traditional Elo ratings and win rates, the platform integrates:

Eigenvector Centrality: Measures global dominance
PageRank Score: Evaluates model importance in the comparison network
Newman Modularity Score: Reveals domain-specific capabilities

Experimental Setup

Platform Implementation

Deployment Platform: Hugging Face Spaces
Access URL: https://huggingface.co/spaces/SE-Arena/Software-Engineering-Arena
Open-Source Nature: Fully open-source, supporting community contributions

User Interface Design

First-Turn Interaction Interface:
- User login and prompt input
- Optional repository URL input
- Random model pairing mechanism
Multi-turn Dialogue Interface:
- Continuous conversation support
- Real-time voting and reassessment functionality
- Anonymous model display

Data Collection Strategy

Crowdsourced Evaluation: Collects preference data through user votes
Real-time Updates: Leaderboard updates immediately after users submit votes
Privacy Protection: Anonymized data collection with user consent requirements

Experimental Results

Platform Functionality Verification

The paper primarily demonstrates SWE-Arena's design and functional implementation rather than traditional comparative experimental results. Key verifications include:

Multi-turn Dialogue Support: Successfully implements iterative interaction evaluation
RepoChat Feature: Effectively extracts and injects repository context automatically
Real-time Leaderboard: Real-time computation and display of multidimensional metrics
Quality Control: Effectively filters non-SE-related queries

Evaluation Metric Validity

MCS Metric: Effectively quantifies model consistency through self-play
CEI Metric: Successfully balances result quality and efficiency considerations
Multidimensional Metrics: Provides richer evaluation perspectives compared to single Elo ratings

Static Benchmarks

BigCodeBench: Code generation benchmark
DevOps-Eval: DevOps-related evaluation
EvalPlus: Code evaluation enhancement framework
SWE-bench: GitHub issue resolution benchmark

Pairwise Comparison Platforms

Chatbot Arena: General-purpose chatbot evaluation platform
WebDev Arena: Web development-specific evaluation platform
Copilot Arena: Code assistant evaluation platform

Technical Differentiation

SWE-Arena's advantages over existing work:

First SE-specific platform supporting multi-turn interactions
Integrates repository-level context through RepoChat feature
Richer multidimensional evaluation metrics framework
Fully open-source and transparent design

Conclusions and Discussion

Main Conclusions

SWE-Arena successfully fills the gap in interactive model evaluation for SE
RepoChat feature effectively enhances evaluation authenticity and practicality
Newly proposed MCS and CEI metrics provide novel perspectives for model evaluation
Multidimensional evaluation framework provides more comprehensive model understanding than single metrics

Limitations

User Engagement Dependency: Platform effectiveness depends on active user community participation
Subjective Bias: Human preference evaluation inherently contains subjectivity
Limited Model Coverage: Currently supported model types are relatively limited
Long-term Maintenance Requirements: Requires continuous technical maintenance and community support

Future Directions

The paper explicitly proposes four development directions:

Real SE Workload Analysis: Analyze patterns in user-submitted requests and develop specialized sub-leaderboards
Enhanced Community Participation: Promote broader research and development community contributions
Expanded FM Coverage: Support domain-specific models and multimodal foundation models
Advanced Context Compression: Integrate technologies like LongRope and SelfExtend for handling long interaction histories

In-Depth Evaluation

Strengths

Strong Innovation: First SE-specific interactive evaluation platform, filling an important gap
Advanced Technology: RepoChat feature and new evaluation metrics demonstrate clear innovation
High Practical Value: Directly serves the practical needs of the SE community
Reasonable Design: Multi-turn interaction, anonymous evaluation, and other design choices follow evaluation best practices
Open-Source Transparency: Fully open-source design promotes community development and academic research

Weaknesses

Lack of Large-Scale Validation: Paper lacks sufficient user usage data and effectiveness validation
Insufficient Metric Validation: Newly proposed MCS and CEI metrics lack validation against human judgment
Insufficient Scalability Considerations: Limited discussion of technical challenges for large-scale user concurrency and long-term operation
Inadequate Bias Control Mechanisms: Insufficient detail on mechanisms for controlling potential user and model biases

Impact

Academic Contribution: Provides new directions and tools for model evaluation research in SE
Practical Value: Can directly serve industrial needs for model selection and evaluation
Community Building: Has potential to become an important community platform in the SE-AI intersection
Methodological Inspiration: Evaluation methods and metric design can inspire similar research in other domains

Applicable Scenarios

Model Developers: Evaluate and improve SE-related foundation models
Software Engineers: Select optimal models for specific SE tasks
Researchers: Conduct empirical research in the SE-AI intersection
Tool Developers: Integrate evaluation capabilities into SE tool chains

References

The paper cites 18 relevant references, covering:

Theoretical foundations of Elo rating systems and Bradley-Terry models
Human preference learning and reinforcement learning research
Existing code generation and SE benchmarks
Network analysis and ranking algorithms
Context window extension technologies

Overall Assessment: SWE-Arena represents significant progress in SE model evaluation. Through innovative platform design and evaluation methodology, it provides a valuable solution to addressing limitations of existing evaluation frameworks. While further validation at scale and long-term sustainability require additional demonstration, its technical innovation and practical value position it as having the potential to become an important tool in this field.