Current evaluations of agents remain centered around one-shot task completion, failing to account for the inherently iterative and collaborative nature of many real-world problems, where human goals are often underspecified and evolve. We argue for a shift from building and assessing task completion agents to developing collaborative agents, assessed not only by the quality of their final outputs but by how well they engage with and enhance human effort throughout the problem-solving process. To support this shift, we introduce collaborative effort scaling, a framework that captures how an agent's utility grows with increasing user involvement. Through case studies and simulated evaluations, we show that state-of-the-art agents often underperform in multi-turn, real-world scenarios, revealing a missing ingredient in agent design: the ability to sustain engagement and scaffold user understanding. Collaborative effort scaling offers a lens for diagnosing agent behavior and guiding development toward more effective interactions.
- Paper ID: 2510.25744
- Title: Completion = Collaboration: Scaling Collaborative Effort with Agents
- Authors: Shannon Zejiang Shen, Valerie Chen, Ken Gu, Alexis Ross, Zixian Ma, Jillian Ross, Alex Gu, Chenglei Si, Wayne Chi, Andi Peng, Jocelyn Shen, Ameet Talwalkar, Tongshuang Wu, David Sontag
- Institutions: MIT, CMU, University of Washington, Stanford University
- Classification: cs.CL cs.AI
- Paper Link: https://arxiv.org/abs/2510.25744
- Project Link: https://github.com/clinicalml/collaborative-effort-scaling
Current agent evaluation focuses primarily on one-shot task completion, failing to account for the iterative and collaborative nature inherent in many real-world problems, where human objectives are often underspecified and evolving. This paper proposes shifting from building and evaluating task-completion agents toward developing collaborative agents, assessed not only by final output quality but also by how they interact with and augment human effort throughout the problem-solving process. To support this transition, the authors introduce the Collaborative Effort Scaling framework, which captures how agent utility grows with increased user engagement. Through case studies and simulated evaluation, the research demonstrates that state-of-the-art agents perform poorly in multi-turn realistic scenarios, revealing missing elements in agent design: the ability to maintain engagement and support user understanding.
- Core Issue: Existing agents are primarily optimized for one-shot task completion, yet real-world complex tasks often require iterative human-AI collaboration
- Problem Significance: As LLM-based agents are increasingly applied to complex knowledge work, effective collaboration becomes a critical challenge
- Existing Limitations:
- Assumes user requirements are static and fully specified
- Overlooks the process of user understanding construction and goal evolution
- Lacks evaluation mechanisms for collaboration process quality
Through case studies across five domains (data analysis, travel planning, financial consulting, education, and mathematical discovery), the authors identify systematic issues with current task-completion agents in multi-turn interactions:
- Premature generation of difficult-to-digest complete results
- Inability to effectively integrate user feedback
- Lack of transparency in reasoning processes
- Poor performance when user needs evolve
- Theoretical Framework: Proposes the Collaborative Effort Scaling framework, evaluating human-AI collaboration quality across two dimensions: user effort and joint utility
- Evaluation Methodology: Designs a metric system for quantifying collaborative agent performance, including interaction sustainability and maximum availability
- Empirical Findings: Demonstrates through simulated experiments that current SOTA agents perform poorly in collaborative scenarios, highlighting the importance of collaborative design
- Design Insights: Provides concrete design guidance and diagnostic tools for building more effective collaborative agents
Models human-AI collaboration as a Partially Observable Markov Decision Process (POMDP):
- Action Sequence: a=[a1(l1),a2(l2),...,aT(lT)], where lt∈{H,A} denotes human or agent
- Context Window: c=[c1(l1),c2(l2),...,cT(lT)]
- Collaborative Rounds: Decomposes the entire process into rounds ak=a[ik:jk] through human-agent handoffs
- User Effort: Cognitive and research work invested by users in the collaboration process
- Base Metric: Number of human-led rounds ∣aH∣
- Enhanced Metric: Context tokens processed ∑cA
- Utility of Joint Actions: Quality of work completed jointly by the human-AI team
Overall Utility:
U=N1∑i=1NmaxUk(i)
Improvement Gain:
G=N1∑i=1NmaxUk(i)−Uki′(i)
Availability Decline:
D@τ=N1∑i=1NUki,τ(i)−UKi(i)
- Interaction Sustainability: Agents should generate greater value as user effort increases
- Maximum Availability: Agents should encourage and maintain long-term interaction, preventing premature user abandonment
- From Outcome-Oriented to Process-Oriented: Focuses not only on final output quality but also on collaboration process effectiveness
- Scaling Law Inspiration: Borrows concepts from scaling laws in machine learning to study collaborative utility scaling properties
- Multi-Stage Modeling: Distinguishes between initial request phase and refinement phase, more precisely capturing collaborative dynamics
- Platform: Collaborative-Gym environment supporting asynchronous human-agent actions
- Tasks: Travel planning tasks, developing detailed itineraries including activities, accommodations, and transportation from high-level descriptions
- Test Models: GPT-4o, Claude 3.5 Sonnet, Claude 4.0 Sonnet, Llama-3.1 70B
- Agent Types:
- Automated baseline agents
- One-stage collaborative agents
- Two-stage collaborative agents (with added planning steps)
- Performance Metrics: Arithmetic mean of common-sense pass rate and constraint satisfaction rate
- Simulated User: Prompt-based agent based on GPT-4o with additional access to user preferences and goals
- Interaction Limit: Maximum 30 rounds of interaction
- All agents exhibit similar collaborative effort scaling trends: initial improvement followed by plateau around 5 rounds of interaction
- Claude series models perform best, effectively leveraging user effort to achieve performance improvements
Based on Table 1 results:
| Model | Strategy | Overall Utility | Improvement Gain (Relative) | Availability Decline (Relative) |
|---|
| Claude-4.0-sonnet | One-stage | 0.680 | 5.7% | -20.6% |
| Claude-4.0-sonnet | Two-stage | 0.681 | 5.2% | -34.9% |
| Claude-3.5-sonnet | One-stage | 0.450 | 13.6% | -29.7% |
| GPT-4o | One-stage | 0.507 | 4.9% | -20.8% |
- Claude-3.5-sonnet: Two-stage planning significantly improves performance, from 0.450 to 0.687
- Claude-4.0-sonnet: One-stage and two-stage strategies achieve similar final utility but with different efficiency
- GPT-4o and Llama-3.1-70b: Collaborative versions fail to exceed automated baselines
- Except for Claude-4.0-sonnet, other models require users to invest more tokens with limited returns
- Claude-4.0-sonnet maintains strong performance across a wider range of effort ratios
- Model-dependent optimal agent-user effort ratios exist
- Joint performance declines when either party dominates interaction excessively
- Capability Determines Strategy: Weaker models require more structured interaction scaffolding
- Collaborative Design is Critical: Even powerful models show significant performance variations based on collaboration design
- Effort Balance Matters: Optimal human-AI effort allocation ratios exist and require adjustment based on model capability
- Early research focused on human-AI collaboration design principles for limited AI systems
- Modern LLM agents possess more complex interaction capabilities, requiring new collaborative frameworks
- Existing benchmarks primarily focus on task completion capabilities (e.g., SWE-Bench, WebArena, GAIA)
- Lack systematic evaluation of collaboration process quality
- Recent work introduces interactive evaluation but remains limited to narrowly-scoped step-by-step interactions
- This paper focuses on collaborative dynamics across extended interaction trajectories
- Paradigm Shift Necessary: Transition from task completion to collaborative capability evaluation is essential
- Current Agents Insufficient: SOTA agents perform poorly in collaborative scenarios, lacking ability to maintain engagement and support understanding
- Design Guidance: The Collaborative Effort Scaling framework provides effective tools for diagnosing and improving agent collaborative capabilities
- Limited Experimental Scope: Experiments conducted only in a single domain (travel planning), potentially missing broader collaborative dynamics
- Simulated Users: Use of simulated rather than real human participants may not fully reflect authentic interaction patterns
- Simplified Metrics: Use of simplified utility and effort proxy metrics; real collaboration is more complex
- Richer Simulation Environments: Construct scenarios where users possess private information or domain expertise
- Adaptive Collaboration Frameworks: Dynamically adjust collaborative strategies based on model capabilities
- Multimodal Collaboration: Extend to collaborative scenarios incorporating vision, speech, and other modalities
- Accurate Problem Identification: Precisely identifies core deficiencies in current agent evaluation
- Reasonable Framework Design: Collaborative Effort Scaling framework is conceptually clear and operationally practical
- Sufficient Empirical Research: Combines case studies and simulated experiments, providing multi-perspective validation
- High Practical Value: Provides concrete design guidance for agent developers
- Evaluation Limitations: Simulated environments and proxy metrics may not fully capture real collaboration complexity
- Limited Model Coverage: Relatively limited number of tested models; generalizability of conclusions requires verification
- Unknown Long-Term Effects: Lacks research on long-term collaborative relationships and learning effects
- Academic Contribution: Provides new theoretical framework and evaluation methods for human-AI collaboration research
- Practical Value: Offers important guidance for agent product development
- Research Direction: May catalyze more research focusing on collaboration quality rather than pure task completion
- Knowledge Work: Fields requiring iterative exploration such as data analysis, research, and consulting
- Education and Training: Learning scenarios requiring progressive understanding construction
- Creative Work: Tasks requiring joint human-AI creation and refinement
This paper cites extensive related work, including:
- Human-AI collaboration design principles (Amershi et al., 2019)
- Agent evaluation benchmarks (Jimenez et al., 2023; Zhou et al., 2023)
- Interactive evaluation methods (Lee et al., 2023; Shao et al., 2024)
- Scaling law research (Hoffmann et al., 2022; Kaplan et al., 2020)
Summary: This paper addresses an important and timely research question, providing a systematic framework for evaluating and improving agent collaborative capabilities. Despite certain limitations in experimental setup, its theoretical contributions and practical value make it significant work in the human-AI collaboration domain. As agent technology rapidly advances, this research direction emphasizing collaboration quality over pure task completion will become increasingly important.