Multi-Agent Collaborative Reward Design for Enhancing Reasoning in Reinforcement Learning
Yang, Zhang, Wang et al.
We present CRM (Multi-Agent Collaborative Reward Model), a framework that replaces a single black-box reward model with a coordinated team of specialist evaluators to improve robustness and interpretability in RLHF. Conventional reward models struggle to jointly optimize multiple, sometimes conflicting, preference dimensions (e.g., factuality, helpfulness, safety) and offer limited transparency into why a score is assigned. CRM addresses these issues by decomposing preference evaluation into domain-specific agents that each produce partial signals, alongside global evaluators such as ranker-based and embedding-similarity rewards. A centralized aggregator fuses these signals at each timestep, balancing factors like step-wise correctness, multi-agent agreement, and repetition penalties, yielding a single training reward compatible with standard RL pipelines. The policy is optimized with advantage-based updates (e.g., GAE), while a value model regresses to the aggregated reward, enabling multi-perspective reward shaping without requiring additional human annotations beyond those used to train the evaluators. To support training and assessment, we introduce rewardBench, a benchmark and training suite aligned with the collaborative structure of CRM. Together, CRM and rewardBench provide a practical, modular path to more transparent reward modeling and more stable optimization.
academic
Multi-Agent Collaborative Reward Design for Enhancing Reasoning in Reinforcement Learning
Title: Multi-Agent Collaborative Reward Design for Enhancing Reasoning in Reinforcement Learning
Authors: Pei Yang (Gradient), Ke Zhang (Waseda University), Ji Wang (Columbia University), Xiao Chen (Hong Kong Polytechnic University), Yuxin Tang (Rice University & Gradient Network), Eric Yang, Lynn Ai, Bill Shi (Gradient)
Classification: cs.AI
Publication Date: November 20, 2025 (arXiv preprint, under review)
This paper proposes the CRM (Collaborative Reward Model) framework, which replaces a single black-box reward model with a coordinated team of expert evaluators to enhance the robustness and interpretability of RLHF (Reinforcement Learning from Human Feedback). Traditional reward models struggle to simultaneously optimize multiple potentially conflicting preference dimensions (such as factuality, helpfulness, and safety) and offer limited transparency regarding scoring rationale. CRM addresses these issues by decomposing preference assessment into domain-specific agents, each producing partial signals, complemented by a global evaluator based on ranking and embedding similarity. A centralized aggregator fuses these signals at each timestep, balancing step-wise correctness, multi-agent consistency, and repetition penalties, producing a single training reward compatible with standard RL pipelines. The paper also introduces the RewardBench benchmark suite, providing a practical pathway for modular, interpretable reward modeling.
The alignment of Large Language Models (LLMs) typically relies on RLHF technology, where a learned reward model guides policy toward preferred behaviors. However, traditional single-scalar reward models face several critical issues:
Multi-dimensional preference trade-offs: Human preferences are inherently multidimensional, encompassing factors such as factual accuracy, coherence, helpfulness, and safety. A single scalar reward cannot easily capture trade-offs between these sometimes competing criteria.
Insufficient interpretability: Traditional reward models provide limited insights, making it difficult to understand why a particular output received a high or low score.
Reward hacking risk: Opacity makes error diagnosis difficult and increases the risk of policies exploiting reward function vulnerabilities (producing high-scoring outputs that diverge from true intent).
As LLMs are increasingly deployed in critical applications, ensuring model behavior reliability, safety, and interpretability becomes paramount. The reward model, as a core component of the alignment pipeline, directly impacts the performance and trustworthiness of the final model.
Ensemble methods: While some research explores ensemble-based reward models to mitigate overoptimization, they lack structured assessment decomposition.
Multi-objective formulations: Existing work decomposes feedback into interpretable dimensions and re-aggregates through learned mixtures, but lacks real-time multi-perspective feedback mechanisms.
Self-reflection methods: Approaches like Critique-out-Loud output scores and critiques to improve interpretability, but do not integrate expert agents into reward modeling.
The core motivation is to redefine reward modeling from a single black-box oracle into an adaptive, interpretable, and scalable multi-agent evaluation ecosystem, achieving more transparent and robust reward shaping through coordinated distributed evaluators.
Novel Paradigm: Proposes an innovative multi-agent collaborative assessment paradigm extending RLHF, improving interpretability and robustness compared to single black-box reward models.
Structured Collaborative Mechanism: Designs MARM (Multi-Agent Reward Model) with structured collaborative reward mechanisms, including expert evaluators and a centralized aggregator that fuses multi-dimensional interpretable signals into a single reward usable by standard policy gradient methods.
RewardBench Benchmark: Releases a benchmark and training suite organized around multi-agent preferences, providing a common platform for research into modular, interpretable reward modeling.
Significant Performance Gains: Achieves substantial improvements on complex reasoning tasks with higher accuracy and stability compared to single RM baselines while maintaining fluency and safety, demonstrating the effectiveness of multi-perspective reward shaping.
Given a large-scale policy model πθ and a set of prompts x, the model generates structured outputs o = πθ(x) containing multi-step reasoning trajectories and final answers. The objective is to learn across a multi-dimensional evaluation space rather than optimizing a fixed scalar reward.
CRM reconstructs post-training as a distributed, feedback-driven optimization process, introducing a team of expert agents collaboratively assessing LLM outputs from complementary perspectives:
Four Core Agents:
Data Optimizer: Quantifies output efficiency and diversity, penalizing redundant reasoning trajectories while encouraging balanced exploration.
Quality Assessor: Provides fine-grained judgments, evaluating reasoning accuracy, factual consistency, and logical coherence of intermediate steps.
Data Synthesizer: Enhances supervision through synthetic perturbations and external knowledge integration, improving robustness and domain generalization.
Data Analyzer: Continuously monitors statistical trends in reward signals, enforcing stability and preventing collapse or pattern drift.
Distributed Evaluation Architecture: First to systematically transform reward modeling into a multi-agent collaborative process, with each agent focusing on specific evaluation dimensions.
Interpretability Enhancement: Each agent's score represents human-understandable assessments (e.g., factual accuracy), collectively forming a multi-dimensional picture of output quality.
Modular Design: Allows new evaluators to be introduced as plugin agents, providing a scalable pathway toward self-regulating and interpretable reward alignment.
No Additional Annotation Required: Multi-perspective reward shaping requires no additional human annotation beyond what is used to train the evaluators.
Standard Compatibility: Produces a single training reward fully compatible with standard RL pipelines (e.g., GRPO, PPO).
Effectiveness of Multi-perspective Assessment: The collaborative framework achieves significant improvements in reasoning robustness and mathematical precision without compromising dialogue quality.
Modular Advantages: Introduction of different agents brings incremental improvements, validating the value of decomposed assessment.
Stability Maintenance: Performance remains relatively stable on general dialogue tasks (Chat, Chat Hard), indicating that the reward fusion mechanism effectively balances multi-dimensional objectives.
Generalization Capability: Introduction of the Data Synthesizer significantly improves model performance on tasks requiring compositional reasoning.
Paradigm Shift: MARM successfully redefines reward modeling as a multi-agent assessment process rather than a single black-box oracle.
Performance Validation: Comprehensive experiments on RewardBench, Math, and GSM8K demonstrate that multi-agent collaboration significantly enhances reasoning accuracy, mathematical precision, and overall stability without compromising dialogue quality.
Modular Advantages: Introduction of roles such as Quality Assessor and Data Synthesizer further improves consistency and generalization, highlighting the benefits of domain-specific decomposition and coordinated feedback in reward modeling.
Practical Value: Provides scalable and modular design supporting integration of new evaluators as plugin agents, compatible with existing RLHF pipelines.
Computational Overhead: Multi-agent evaluation requires more computational resources compared to single reward models, with each agent requiring independent assessment.
Christiano et al. (2017) - Deep reinforcement learning from human preferences
Ouyang et al. (2022) - InstructGPT: Training language models to follow instructions with human feedback
Reward Modeling:
Coste et al. (2023) - Reward model ensembles help mitigate overoptimization
Wang et al. (2024) - Interpretable preferences via multi-objective reward modeling
Multi-agent Assessment:
Irving et al. (2018) - AI safety via debate
Chan et al. (2023) - ChatEval: Towards better LLM-based evaluators through multi-agent debate
Fine-grained Feedback:
Zheng et al. (2024) - GRPO: Guided reinforcement preference optimization
Ankner et al. (2024) - Critique-out-loud reward models
Overall Assessment: This paper proposes an innovative and practical multi-agent collaborative reward modeling framework, making important contributions to improving interpretability and reasoning capability in RLHF. Despite limitations such as limited experimental scale and insufficient implementation details, its core ideas possess significant academic value and application prospects. We anticipate that the authors will provide more implementation details, expand experimental scope, and open-source relevant code and models in subsequent work to promote community development.