2025-11-21T13:37:16.010816

Multi-Agent Collaborative Reward Design for Enhancing Reasoning in Reinforcement Learning

Yang, Zhang, Wang et al.
We present CRM (Multi-Agent Collaborative Reward Model), a framework that replaces a single black-box reward model with a coordinated team of specialist evaluators to improve robustness and interpretability in RLHF. Conventional reward models struggle to jointly optimize multiple, sometimes conflicting, preference dimensions (e.g., factuality, helpfulness, safety) and offer limited transparency into why a score is assigned. CRM addresses these issues by decomposing preference evaluation into domain-specific agents that each produce partial signals, alongside global evaluators such as ranker-based and embedding-similarity rewards. A centralized aggregator fuses these signals at each timestep, balancing factors like step-wise correctness, multi-agent agreement, and repetition penalties, yielding a single training reward compatible with standard RL pipelines. The policy is optimized with advantage-based updates (e.g., GAE), while a value model regresses to the aggregated reward, enabling multi-perspective reward shaping without requiring additional human annotations beyond those used to train the evaluators. To support training and assessment, we introduce rewardBench, a benchmark and training suite aligned with the collaborative structure of CRM. Together, CRM and rewardBench provide a practical, modular path to more transparent reward modeling and more stable optimization.
academic

Multi-Agent Collaborative Reward Design for Enhancing Reasoning in Reinforcement Learning

Basic Information

  • Paper ID: 2511.16202
  • Title: Multi-Agent Collaborative Reward Design for Enhancing Reasoning in Reinforcement Learning
  • Authors: Pei Yang (Gradient), Ke Zhang (Waseda University), Ji Wang (Columbia University), Xiao Chen (Hong Kong Polytechnic University), Yuxin Tang (Rice University & Gradient Network), Eric Yang, Lynn Ai, Bill Shi (Gradient)
  • Classification: cs.AI
  • Publication Date: November 20, 2025 (arXiv preprint, under review)
  • Paper Link: https://arxiv.org/abs/2511.16202

Abstract

This paper proposes the CRM (Collaborative Reward Model) framework, which replaces a single black-box reward model with a coordinated team of expert evaluators to enhance the robustness and interpretability of RLHF (Reinforcement Learning from Human Feedback). Traditional reward models struggle to simultaneously optimize multiple potentially conflicting preference dimensions (such as factuality, helpfulness, and safety) and offer limited transparency regarding scoring rationale. CRM addresses these issues by decomposing preference assessment into domain-specific agents, each producing partial signals, complemented by a global evaluator based on ranking and embedding similarity. A centralized aggregator fuses these signals at each timestep, balancing step-wise correctness, multi-agent consistency, and repetition penalties, producing a single training reward compatible with standard RL pipelines. The paper also introduces the RewardBench benchmark suite, providing a practical pathway for modular, interpretable reward modeling.

Research Background and Motivation

1. Core Problem

The alignment of Large Language Models (LLMs) typically relies on RLHF technology, where a learned reward model guides policy toward preferred behaviors. However, traditional single-scalar reward models face several critical issues:

  • Multi-dimensional preference trade-offs: Human preferences are inherently multidimensional, encompassing factors such as factual accuracy, coherence, helpfulness, and safety. A single scalar reward cannot easily capture trade-offs between these sometimes competing criteria.
  • Insufficient interpretability: Traditional reward models provide limited insights, making it difficult to understand why a particular output received a high or low score.
  • Reward hacking risk: Opacity makes error diagnosis difficult and increases the risk of policies exploiting reward function vulnerabilities (producing high-scoring outputs that diverge from true intent).

2. Problem Significance

As LLMs are increasingly deployed in critical applications, ensuring model behavior reliability, safety, and interpretability becomes paramount. The reward model, as a core component of the alignment pipeline, directly impacts the performance and trustworthiness of the final model.

3. Limitations of Existing Approaches

  • Ensemble methods: While some research explores ensemble-based reward models to mitigate overoptimization, they lack structured assessment decomposition.
  • Multi-objective formulations: Existing work decomposes feedback into interpretable dimensions and re-aggregates through learned mixtures, but lacks real-time multi-perspective feedback mechanisms.
  • Self-reflection methods: Approaches like Critique-out-Loud output scores and critiques to improve interpretability, but do not integrate expert agents into reward modeling.

4. Research Motivation

The core motivation is to redefine reward modeling from a single black-box oracle into an adaptive, interpretable, and scalable multi-agent evaluation ecosystem, achieving more transparent and robust reward shaping through coordinated distributed evaluators.

Core Contributions

  1. Novel Paradigm: Proposes an innovative multi-agent collaborative assessment paradigm extending RLHF, improving interpretability and robustness compared to single black-box reward models.
  2. Structured Collaborative Mechanism: Designs MARM (Multi-Agent Reward Model) with structured collaborative reward mechanisms, including expert evaluators and a centralized aggregator that fuses multi-dimensional interpretable signals into a single reward usable by standard policy gradient methods.
  3. RewardBench Benchmark: Releases a benchmark and training suite organized around multi-agent preferences, providing a common platform for research into modular, interpretable reward modeling.
  4. Significant Performance Gains: Achieves substantial improvements on complex reasoning tasks with higher accuracy and stability compared to single RM baselines while maintaining fluency and safety, demonstrating the effectiveness of multi-perspective reward shaping.

Methodology Details

Task Definition

Given a large-scale policy model πθ and a set of prompts x, the model generates structured outputs o = πθ(x) containing multi-step reasoning trajectories and final answers. The objective is to learn across a multi-dimensional evaluation space rather than optimizing a fixed scalar reward.

The formalized objective is:

max_θ E_{x~D}[F(αR_ranker(o) + βR_similarity(o) + Σ_{i=1}^K λ_i R_i(o))]

Where:

  • F(·) is the central aggregator converting heterogeneous signals into scalar rewards
  • {α, β, λ_i} are adaptive weights learned or adjusted during training
  • A = {a1, a2, ..., aK} is the set of agents, each agent ai outputting scores Ri(o) for specific evaluation dimensions

Model Architecture

1. Collaborative Reward Modeling (CRM)

CRM reconstructs post-training as a distributed, feedback-driven optimization process, introducing a team of expert agents collaboratively assessing LLM outputs from complementary perspectives:

Four Core Agents:

  • Data Optimizer: Quantifies output efficiency and diversity, penalizing redundant reasoning trajectories while encouraging balanced exploration.
  • Quality Assessor: Provides fine-grained judgments, evaluating reasoning accuracy, factual consistency, and logical coherence of intermediate steps.
  • Data Synthesizer: Enhances supervision through synthetic perturbations and external knowledge integration, improving robustness and domain generalization.
  • Data Analyzer: Continuously monitors statistical trends in reward signals, enforcing stability and preventing collapse or pattern drift.

2. Reward Function Design

Step-level Rewards:

  • Outcome Reward: Verifies whether partial reasoning aligns with intermediate expectations.
  • Enhanced Data Reward: Leverages augmented or counterfactual samples generated by the Data Synthesizer for stronger supervision.

Model-level Rewards: Computes cosine similarity between predicted and reference embeddings using all-MiniLM-L6-v2 encoder:

R_sim = cos(h_pred, h_ref)

Multi-dimensional Assessment Components:

  • Accuracy Reward (R_acc): Verifies mathematical equivalence through symbolic comparison (using latex2sympy2, math_verify).
  • Format Reward (R_fmt): Enforces adherence to reasoning format defined by and tags.
  • Reasoning Step Reward (R_step): Encourages organized, interpretable multi-step explanations.
  • Cosine Scaling Reward (R_cs): Modulates accuracy reward by completion length to prevent verbosity.
  • Repetition Penalty (R_rep): Penalizes n-gram redundancy and degenerative loops detected by the Data Analyzer.

Collaborative Weight Mechanism:

R_collab = αR_acc + βR_sim + γR_fmt + δR_step - ηR_rep

Where coefficients (α, β, γ, δ, η) are empirically adjusted to balance factual correctness, reasoning clarity, and linguistic fluency.

3. Reward Aggregation and Policy Update

Centralized Aggregation:

r_t = F(R_collab(o_t), R_enhanced(o_t))

Where F is a nonlinear fusion operator balancing reasoning format, accuracy, and repetition penalties.

Policy Optimization: Updates the policy model using Generalized Advantage Estimation (GAE):

L_policy = -E_t[Â_t log π_θ(a_t|s_t)]

Value Model Optimization: Optimizes through regression on centralized rewards:

L_value = E_t[(V_φ(s_t) - r_t)²]

Where Â_t is the advantage function and V_φ is the value model.

Technical Innovations

  1. Distributed Evaluation Architecture: First to systematically transform reward modeling into a multi-agent collaborative process, with each agent focusing on specific evaluation dimensions.
  2. Interpretability Enhancement: Each agent's score represents human-understandable assessments (e.g., factual accuracy), collectively forming a multi-dimensional picture of output quality.
  3. Modular Design: Allows new evaluators to be introduced as plugin agents, providing a scalable pathway toward self-regulating and interpretable reward alignment.
  4. No Additional Annotation Required: Multi-perspective reward shaping requires no additional human annotation beyond what is used to train the evaluators.
  5. Standard Compatibility: Produces a single training reward fully compatible with standard RL pipelines (e.g., GRPO, PPO).

Experimental Setup

Datasets

Primary Datasets:

  1. RewardBench: Benchmark organized around multi-agent preferences, including multiple evaluation dimensions:
    • Chat: Dialogue quality
    • Chat Hard: Difficult dialogue scenarios
    • Safety: Safety assessment
    • Reasoning: Reasoning capability
  2. GSM8K: Mathematical reasoning dataset
  3. Math: Mathematical problem-solving dataset
  4. AI-MO/NuminaMath-TIR:
    • Training set: 3,800 samples
    • Test set: 99 samples

Evaluation Metrics

  • Accuracy: Correctness rate across task categories
  • Reasoning Quality: Logical coherence and step completeness
  • Dialogue Quality: Fluency and helpfulness
  • Safety: Safety score of outputs

Baseline Methods

Baseline Model: Qwen2.5-0.5B-Instruct (approximately 494M parameters)

Experimental Configurations:

  • Two agents: Data Analyzer + Data Optimizer
  • Three agents: Data Analyzer + Data Optimizer + Quality Assessor
  • Four agents: Data Analyzer + Data Optimizer + Quality Assessor + Data Synthesizer

Variants:

  • MARM: Base collaborative model
  • MARM(rerank): Version with reranking
  • MARM(emb): Embedding-based version

Implementation Details

  • Optimization Framework: GRPO (Generalized Reinforcement Policy Optimization)
  • Base Model: Qwen/Qwen2.5-0.5B-Instruct (494M parameters)
  • Prompt Format: Structured prompts with reasoning process within <think>...</think> tags and final answer within <answer>...</answer> tags
  • Embedding Model: all-MiniLM-L6-v2 for semantic similarity computation

Experimental Results

Main Results

Table 1: MARM Results on RewardBench, Math, and GSM8K

Two-Agent Configuration (Data Analyzer + Data Optimizer)

MethodChatChat HardSafetyReasoningMathGSM8K
Qwen2.5-0.5B-ins0.1930.5610.5610.5980.1390.08%
MARM0.1900.5570.5530.6590.14919.64%
MARM(rerank)0.1820.5450.5660.4230.13622.16%
MARM(emb)0.1980.5610.5360.5670.13122.33%

Key Findings:

  • GSM8K accuracy improved from 0.08% to 22.33%, approximately 279-fold improvement
  • Reasoning dimension improved from 0.598 to 0.659 (MARM base version)

Three-Agent Configuration (+ Quality Assessor)

MethodChatChat HardSafetyReasoningMathGSM8K
MARM(rerank)0.1900.5670.5380.3980.14322.87%
MARM(emb)0.1990.5320.5700.6370.14123.15%

Key Findings:

  • Addition of Quality Assessor further improves GSM8K to 23.15%
  • Reasoning-related metrics continue to improve

Four-Agent Configuration (+ Data Synthesizer)

MethodChatChat HardSafetyReasoningMathGSM8K
MARM(rerank)0.1820.5680.5270.6100.19229.87%
MARM(emb)0.1790.5570.5730.5780.15227.60%

Best Performance:

  • GSM8K accuracy reaches 29.87% (MARM(rerank)), approximately 374-fold improvement over baseline
  • Math dimension reaches 0.192, significantly outperforming other configurations

Ablation Studies

Impact of Agent Count:

  1. Two agents → Three agents:
    • Significant improvement in reasoning accuracy
    • RewardBench(rerank) improves from 0.639 to 0.689
    • Quality Assessor introduces fine-grained evaluation feedback, better capturing structural coherence and step-wise logical soundness
  2. Three agents → Four agents:
    • Further improvement on combined reasoning and factual synthesis tasks
    • Data Synthesizer enhances model generalization by mitigating local overfitting
    • Improves semantic completeness of intermediate reasoning chains

Impact of Aggregation Strategy:

  • Reranking Method: Consistently outperforms other variants on high-precision reasoning tasks; explicit preference modeling and pairwise ranking contribute more discriminative reward shaping
  • Embedding Method: Demonstrates better stability and scalability in complex multi-agent coordination

Case Analysis

The paper demonstrates model behavior through structured prompts:

  • Reasoning Process: Step-by-step reasoning displayed within <think> tags enables reward models to assess reasoning quality
  • Final Answer: Final result provided within <answer> tags facilitates correctness verification

This structured output enables individual agents to separately evaluate different aspects of the reasoning chain.

Experimental Findings

  1. Effectiveness of Multi-perspective Assessment: The collaborative framework achieves significant improvements in reasoning robustness and mathematical precision without compromising dialogue quality.
  2. Modular Advantages: Introduction of different agents brings incremental improvements, validating the value of decomposed assessment.
  3. Stability Maintenance: Performance remains relatively stable on general dialogue tasks (Chat, Chat Hard), indicating that the reward fusion mechanism effectively balances multi-dimensional objectives.
  4. Generalization Capability: Introduction of the Data Synthesizer significantly improves model performance on tasks requiring compositional reasoning.

1. Reward Modeling and RLHF

  • Classical Methods: InstructGPT, GPT-4, etc., use scalar reward models but offer limited transparency
  • Ensemble Methods: Mitigate overoptimization through reward model ensembles
  • Multi-objective Methods: Decompose feedback into interpretable dimensions (helpfulness, honesty, verbosity)
  • Self-reflection Methods: Critique-out-Loud outputs scores and critiques to improve interpretability

2. Multi-Agent and Structured Assessment

  • AI Safety via Debate: Pioneering work introducing mechanisms where two models debate and a third party evaluates
  • RLAIF-style Settings: Agents simulate reviewers or judges from different perspectives
  • ChatEval: Aggregates multiple LLMs as a judge panel for debate and voting

CRM Distinctions:

  • Uses agents not only during evaluation but integrates them into reward modeling
  • Expert agents serve as real-time contributors to reward signals during training
  • Provides structure-aware multi-perspective feedback

3. Fine-grained Feedback Techniques

  • GRPO: Guided Reinforcement Preference Optimization
  • SPIN: Reinforcement Learning from Structured Feedback
  • RAFT: Reward Alignment with Feedback Trees

CRM complements these techniques, focusing on multi-agent collaborative reward decomposition.

Conclusions and Discussion

Main Conclusions

  1. Paradigm Shift: MARM successfully redefines reward modeling as a multi-agent assessment process rather than a single black-box oracle.
  2. Performance Validation: Comprehensive experiments on RewardBench, Math, and GSM8K demonstrate that multi-agent collaboration significantly enhances reasoning accuracy, mathematical precision, and overall stability without compromising dialogue quality.
  3. Modular Advantages: Introduction of roles such as Quality Assessor and Data Synthesizer further improves consistency and generalization, highlighting the benefits of domain-specific decomposition and coordinated feedback in reward modeling.
  4. Practical Value: Provides scalable and modular design supporting integration of new evaluators as plugin agents, compatible with existing RLHF pipelines.

Limitations

  1. Computational Overhead: Multi-agent evaluation requires more computational resources compared to single reward models, with each agent requiring independent assessment.
  2. Weight Tuning: Collaborative weight coefficients (α, β, γ, δ, η) require empirical adjustment, lacking automatic optimization mechanisms.
  3. Agent Design: The paper lacks detailed specification of how to train individual expert agents and ensure their assessment quality.
  4. Scale Validation: Experiments primarily conducted on smaller models (494M parameters); performance on large-scale models remains unknown.
  5. Dialogue Quality Trade-off: While claiming to maintain dialogue quality, table data shows slight performance decline in Chat and Chat Hard dimensions.

Future Directions

  1. Automatic Weight Learning: Develop adaptive mechanisms to automatically learn and adjust collaborative weights.
  2. Agent Training Methods: Systematize training procedures for expert agents and quality assurance mechanisms.
  3. Large-scale Validation: Verify framework effectiveness and scalability on larger-scale models.
  4. Dynamic Agent Selection: Dynamically select and combine relevant agents based on task type.
  5. Cross-domain Generalization: Extend to more domains and task types.

In-Depth Evaluation

Strengths

  1. Strong Innovation:
    • First to systematically transform reward modeling into a multi-agent collaborative process
    • Proposed distributed evaluation architecture is original
    • Advanced modular design philosophy
  2. Interpretability Breakthrough:
    • Each agent provides human-understandable evaluation dimensions
    • Significantly improves transparency compared to black-box reward models
    • Facilitates diagnosis and debugging of model behavior
  3. Comprehensive Experimental Validation:
    • Systematic evaluation across multiple benchmarks
    • Ablation studies with multiple agent configurations
    • Impressive improvements on GSM8K (279-374 fold)
  4. High Practical Value:
    • Compatible with standard RL pipelines
    • RewardBench benchmark promotes subsequent research
    • Modular design facilitates extension and customization
  5. Solid Theoretical Foundation:
    • Clear problem definition
    • Rigorous mathematical formalization
    • Well-motivated method design

Weaknesses

  1. Insufficient Method Details:
    • Specific training methods for expert agents not detailed
    • Weight coefficient tuning process lacks detailed description
    • Aggregation function F(·) implementation unclear
  2. Experimental Limitations:
    • Validation only on small models (494M parameters)
    • Lacks comparison with more SOTA methods
    • No statistical significance testing reported
    • Dialogue quality decline not deeply analyzed
  3. Missing Efficiency Analysis:
    • Training time and inference speed not reported
    • Computational overhead of multi-agent evaluation not quantified
    • Efficiency-performance trade-off analysis absent
  4. Reproducibility Issues:
    • Hyperparameter settings insufficiently detailed
    • Agent implementation details lacking
    • No declaration regarding code and model open-sourcing
  5. Insufficient Generalization Verification:
    • Primarily focused on mathematical reasoning tasks
    • Performance on other domains (code generation, creative writing) unknown
    • Cross-lingual capabilities not evaluated
  6. Lacking Theoretical Analysis:
    • No convergence analysis provided
    • Lacks theoretical explanation for why multi-agent outperforms single models
    • Relationship between agent count and performance lacks theoretical guidance

Impact

  1. Academic Contribution:
    • Provides new research direction for RLHF field
    • Multi-agent reward modeling may become new paradigm
    • RewardBench benchmark helps standardize evaluation
  2. Practical Value:
    • Improves interpretability of LLM alignment
    • Clear advantages on high-accuracy tasks like mathematical reasoning
    • Modular design facilitates industrial application
  3. Potential Influence:
    • May drive shift from black-box to white-box reward modeling
    • Provides tools for safe AI and trustworthy AI research
    • Inspires more multi-agent collaborative research
  4. Reproducibility:
    • Method description relatively clear
    • Missing implementation details may affect reproducibility
    • Anticipate authors open-sourcing code and models

Applicable Scenarios

Highly Applicable:

  1. Mathematical Reasoning Tasks: Experiments demonstrate significant effectiveness on GSM8K and similar benchmarks
  2. Multi-dimensional Assessment Requirements: Applications requiring simultaneous consideration of accuracy, safety, helpfulness, etc.
  3. High Interpretability Requirements: Domains like finance and healthcare requiring decision explanation
  4. Structured Output Tasks: Problem-solving requiring step-by-step reasoning

Use with Caution:

  1. Dialogue Generation: Experiments show slight dialogue quality decline; requires trade-off consideration
  2. Creative Tasks: Excessive structure may limit creativity
  3. Real-time Applications: Multi-agent evaluation may increase latency
  4. Resource-constrained Scenarios: Significant computational overhead

Requiring Verification:

  1. Large-scale Models: Performance on billion-parameter models unknown
  2. Cross-lingual Scenarios: Applicability to non-English tasks to be verified
  3. Long-form Generation: Effectiveness on long-form writing tasks unclear
  4. Other Modalities: Extensibility to image, audio, and multimodal tasks

References

Key Citations:

  1. RLHF Foundations:
    • Christiano et al. (2017) - Deep reinforcement learning from human preferences
    • Ouyang et al. (2022) - InstructGPT: Training language models to follow instructions with human feedback
  2. Reward Modeling:
    • Coste et al. (2023) - Reward model ensembles help mitigate overoptimization
    • Wang et al. (2024) - Interpretable preferences via multi-objective reward modeling
  3. Multi-agent Assessment:
    • Irving et al. (2018) - AI safety via debate
    • Chan et al. (2023) - ChatEval: Towards better LLM-based evaluators through multi-agent debate
  4. Fine-grained Feedback:
    • Zheng et al. (2024) - GRPO: Guided reinforcement preference optimization
    • Ankner et al. (2024) - Critique-out-loud reward models

Overall Assessment: This paper proposes an innovative and practical multi-agent collaborative reward modeling framework, making important contributions to improving interpretability and reasoning capability in RLHF. Despite limitations such as limited experimental scale and insufficient implementation details, its core ideas possess significant academic value and application prospects. We anticipate that the authors will provide more implementation details, expand experimental scope, and open-source relevant code and models in subsequent work to promote community development.