Foundation models (FMs), particularly large language models (LLMs), have shown significant promise in various software engineering (SE) tasks, including code generation, debugging, and requirement refinement. Despite these advances, existing evaluation frameworks are insufficient for assessing model performance in iterative, context-rich workflows characteristic of SE activities. To address this limitation, we introduce \emph{SWE-Arena}, an interactive platform designed to evaluate FMs in SE tasks. SWE-Arena provides a transparent, open-source leaderboard, supports multi-round conversational workflows, and enables end-to-end model comparisons. The platform introduces novel metrics, including \emph{model consistency score} that measures the consistency of model outputs through self-play matches, and \emph{conversation efficiency index} that evaluates model performance while accounting for the number of interaction rounds required to reach conclusions. Moreover, SWE-Arena incorporates a new feature called \emph{RepoChat}, which automatically injects repository-related context (e.g., issues, commits, pull requests) into the conversation, further aligning evaluations with real-world development processes. This paper outlines the design and capabilities of SWE-Arena, emphasizing its potential to advance the evaluation and practical application of FMs in software engineering.
- Paper ID: 2502.01860
- Title: SWE-Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering
- Author: Zhimin Zhao (Queen's University)
- Classification: cs.SE cs.LG
- Publication Date: arXiv Preprint (Latest version v5, October 10, 2025)
- Paper Link: https://arxiv.org/abs/2502.01860v5
Foundation Models (FMs), particularly Large Language Models (LLMs), have demonstrated tremendous potential across various software engineering (SE) tasks, including code generation, debugging, and requirements refinement. Despite these advances, existing evaluation frameworks are insufficient for assessing model performance in the iterative, context-rich workflows inherent to SE activities. To address this limitation, this paper introduces SWE-Arena, an interactive platform specifically designed for evaluating FMs in SE tasks. SWE-Arena provides a transparent open-source leaderboard, supports multi-turn dialogue workflows, and enables end-to-end model comparison. The platform introduces novel evaluation metrics, including the Model Consistency Score (MCS), which measures output consistency through self-play matching, and the Conversation Efficiency Index (CEI), which evaluates model performance while considering the number of interaction rounds required to reach conclusions. Furthermore, SWE-Arena integrates a novel feature called RepoChat, which automatically injects repository-related context (such as issues, commits, and pull requests) into conversations, further aligning evaluation with real-world development workflows.
Existing foundation model evaluation frameworks face the following critical challenges in the software engineering domain:
- Lack of Iterative Support: Traditional evaluation methods cannot handle the multi-turn interaction requirements specific to SE tasks
- Missing Context: Existing frameworks fail to effectively integrate repository-level contextual information from real development scenarios
- Unidimensional Evaluation: Platforms like Chatbot Arena rely solely on Elo ratings and win rates, providing limited evaluation perspectives
- Insufficient Transparency: Many existing platforms are not open-source, limiting community-driven innovation
Software engineering tasks possess the following characteristics that render traditional evaluation methods inadequate:
- Multidimensionality: Spanning multiple domains including requirements engineering, release engineering, and project management
- Iterativity: In debugging sessions, for example, models must iteratively optimize solutions based on user feedback
- Context Dependency: Real SE workflows require substantial repository-level contextual information
- Static Benchmarks: BigCodeBench, SWE-bench, and others rely on predefined datasets, lacking adaptability
- Existing Arena Platforms: Chatbot Arena, WebDev Arena, and similar platforms do not support multi-turn interactions and offer limited evaluation metrics
- Insufficient Domain Specificity: General-purpose evaluation platforms cannot capture the unique requirements of SE tasks
- First SE-Specific Interactive Evaluation Platform: SWE-Arena is the first large-scale crowdsourced evaluation platform specifically designed for software engineering tasks
- Novel Evaluation Metrics: Proposes two innovative evaluation metrics—Model Consistency Score (MCS) and Conversation Efficiency Index (CEI)
- RepoChat Feature: Automatically injects repository-level context, making evaluation more aligned with real development scenarios
- Multidimensional Evaluation Framework: Integrates traditional metrics (Elo, win rate) with advanced metrics (eigenvector centrality, PageRank, etc.)
- Open-Source Transparent Design: Provides a fully transparent open-source leaderboard and evaluation methodology
SWE-Arena aims to evaluate foundation models' performance on software engineering tasks through pairwise comparisons based on human preferences. Inputs include user SE-related queries and optional repository URLs, while outputs consist of comparison results between responses from two anonymous models.
RepoChat is the core innovative feature of SWE-Arena:
- Automatic Context Extraction: Automatically extracts repository descriptions, programming languages, issue discussions, commit diffs, and other metadata from GitHub/GitLab and similar platforms
- Intelligent Context Injection: Merges extracted context with user queries to form comprehensive prompts
- Optional Usage: Users can choose whether to provide repository URLs, ensuring backward compatibility
- Iterative Interaction: Supports multi-turn conversations between users and models, evaluating long-context processing capabilities
- Dynamic Voting: Users can submit votes at any time and reassess to modify their votes
- Context Management: Employs FIFO strategy for handling cases exceeding context window limits
- SE Relevance Filtering: Uses GPT-5-nano to automatically filter non-SE-related prompts
- Anonymous Evaluation: Model identities remain hidden throughout the session
- Response Time Limits: Individual model response time capped at 1 minute
Where D represents the number of draws in self-play matches and N represents the total number of self-play matches. This metric quantifies model output consistency through self-play matching.
Where:
- ni: Number of chat turns in a single conversation
- si: Result score for a single user vote
- Scoring rules: Victory = 1, Draw (both work well) = 0.3, Draw (neither works) = -0.3, Defeat = -1
This metric comprehensively considers result quality and the number of interaction rounds required to reach conclusions.
Beyond traditional Elo ratings and win rates, the platform integrates:
- Eigenvector Centrality: Measures global dominance
- PageRank Score: Evaluates model importance in the comparison network
- Newman Modularity Score: Reveals domain-specific capabilities
- Deployment Platform: Hugging Face Spaces
- Access URL: https://huggingface.co/spaces/SE-Arena/Software-Engineering-Arena
- Open-Source Nature: Fully open-source, supporting community contributions
- First-Turn Interaction Interface:
- User login and prompt input
- Optional repository URL input
- Random model pairing mechanism
- Multi-turn Dialogue Interface:
- Continuous conversation support
- Real-time voting and reassessment functionality
- Anonymous model display
- Crowdsourced Evaluation: Collects preference data through user votes
- Real-time Updates: Leaderboard updates immediately after users submit votes
- Privacy Protection: Anonymized data collection with user consent requirements
The paper primarily demonstrates SWE-Arena's design and functional implementation rather than traditional comparative experimental results. Key verifications include:
- Multi-turn Dialogue Support: Successfully implements iterative interaction evaluation
- RepoChat Feature: Effectively extracts and injects repository context automatically
- Real-time Leaderboard: Real-time computation and display of multidimensional metrics
- Quality Control: Effectively filters non-SE-related queries
- MCS Metric: Effectively quantifies model consistency through self-play
- CEI Metric: Successfully balances result quality and efficiency considerations
- Multidimensional Metrics: Provides richer evaluation perspectives compared to single Elo ratings
- BigCodeBench: Code generation benchmark
- DevOps-Eval: DevOps-related evaluation
- EvalPlus: Code evaluation enhancement framework
- SWE-bench: GitHub issue resolution benchmark
- Chatbot Arena: General-purpose chatbot evaluation platform
- WebDev Arena: Web development-specific evaluation platform
- Copilot Arena: Code assistant evaluation platform
SWE-Arena's advantages over existing work:
- First SE-specific platform supporting multi-turn interactions
- Integrates repository-level context through RepoChat feature
- Richer multidimensional evaluation metrics framework
- Fully open-source and transparent design
- SWE-Arena successfully fills the gap in interactive model evaluation for SE
- RepoChat feature effectively enhances evaluation authenticity and practicality
- Newly proposed MCS and CEI metrics provide novel perspectives for model evaluation
- Multidimensional evaluation framework provides more comprehensive model understanding than single metrics
- User Engagement Dependency: Platform effectiveness depends on active user community participation
- Subjective Bias: Human preference evaluation inherently contains subjectivity
- Limited Model Coverage: Currently supported model types are relatively limited
- Long-term Maintenance Requirements: Requires continuous technical maintenance and community support
The paper explicitly proposes four development directions:
- Real SE Workload Analysis: Analyze patterns in user-submitted requests and develop specialized sub-leaderboards
- Enhanced Community Participation: Promote broader research and development community contributions
- Expanded FM Coverage: Support domain-specific models and multimodal foundation models
- Advanced Context Compression: Integrate technologies like LongRope and SelfExtend for handling long interaction histories
- Strong Innovation: First SE-specific interactive evaluation platform, filling an important gap
- Advanced Technology: RepoChat feature and new evaluation metrics demonstrate clear innovation
- High Practical Value: Directly serves the practical needs of the SE community
- Reasonable Design: Multi-turn interaction, anonymous evaluation, and other design choices follow evaluation best practices
- Open-Source Transparency: Fully open-source design promotes community development and academic research
- Lack of Large-Scale Validation: Paper lacks sufficient user usage data and effectiveness validation
- Insufficient Metric Validation: Newly proposed MCS and CEI metrics lack validation against human judgment
- Insufficient Scalability Considerations: Limited discussion of technical challenges for large-scale user concurrency and long-term operation
- Inadequate Bias Control Mechanisms: Insufficient detail on mechanisms for controlling potential user and model biases
- Academic Contribution: Provides new directions and tools for model evaluation research in SE
- Practical Value: Can directly serve industrial needs for model selection and evaluation
- Community Building: Has potential to become an important community platform in the SE-AI intersection
- Methodological Inspiration: Evaluation methods and metric design can inspire similar research in other domains
- Model Developers: Evaluate and improve SE-related foundation models
- Software Engineers: Select optimal models for specific SE tasks
- Researchers: Conduct empirical research in the SE-AI intersection
- Tool Developers: Integrate evaluation capabilities into SE tool chains
The paper cites 18 relevant references, covering:
- Theoretical foundations of Elo rating systems and Bradley-Terry models
- Human preference learning and reinforcement learning research
- Existing code generation and SE benchmarks
- Network analysis and ranking algorithms
- Context window extension technologies
Overall Assessment: SWE-Arena represents significant progress in SE model evaluation. Through innovative platform design and evaluation methodology, it provides a valuable solution to addressing limitations of existing evaluation frameworks. While further validation at scale and long-term sustainability require additional demonstration, its technical innovation and practical value position it as having the potential to become an important tool in this field.