2025-11-18T13:37:13.426950

Why is Your Language Model a Poor Implicit Reward Model?

Razin, Lin, Yao et al.

Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Toward a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Taken together, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.

academic

Why is Your Language Model a Poor Implicit Reward Model?

Basic Information

Paper ID: 2507.07981
Title: Why is Your Language Model a Poor Implicit Reward Model?
Authors: Noam Razin†, Yong Lin†, Jiarui Yao‡, Sanjeev Arora† (†Princeton University, ‡University of Illinois Urbana-Champaign)
Classification: cs.CL cs.AI cs.LG stat.ML
Publication Date/Venue: arXiv preprint (updated October 16, 2025)
Paper Link: https://arxiv.org/abs/2507.07981v2

Abstract

Reward models are critical components in language model post-training and inference pipelines. Recent research has demonstrated that every language model defines an implicit reward model (IM-RM) without requiring any architectural modifications. However, compared to explicit reward models (EX-RM) that apply dedicated linear heads to language model hidden representations, IM-RM typically exhibits poorer generalization capabilities, particularly in out-of-distribution scenarios. This generalization gap is puzzling because EX-RM and IM-RM are nearly identical—they can be trained using the same data, loss functions, and language models, differing only in how rewards are computed. This paper investigates the fundamental causes of this gap, finding that IM-RM relies more heavily on surface-level token-level cues, resulting in inferior generalization compared to EX-RM under both token-level distribution shifts and in-distribution scenarios.

Research Background and Motivation

Problem Definition

Reward models play a central role in the modern language model ecosystem, with widespread applications in reinforcement learning training, direct alignment algorithms, rejection sampling, data filtering, and inference-time scaling. Currently, two primary types of reward models exist:

Explicit Reward Models (EX-RM): Apply linear heads to language model hidden representations to compute rewards
Implicit Reward Models (IM-RM): Define rewards implicitly through the log-probabilities of language models

Research Motivation

Despite being nearly identical in architecture, prior research has observed that IM-RM often exhibits poorer generalization capabilities, particularly in out-of-distribution scenarios. This phenomenon is puzzling because both models can be trained on the same language model using identical data and loss functions, differing only in minor aspects of reward computation.

Significance

Understanding the implicit biases of different reward model types is important for:

Selecting appropriate reward model architectures
Improving reward model robustness
Optimizing language model post-training procedures

Core Contributions

Theoretical Analysis: Through learning dynamics analysis, reveals that IM-RM relies more on token-level cues while EX-RM generalizes primarily through hidden representations
Refutation of Intuitive Assumptions: Demonstrates that IM-RM's generalization problems do not stem from generation-verification gaps; learning verification does not require learning generation
Empirical Validation: Verifies through controlled experiments and real-world scenarios that IM-RM performs worse under token-level distribution shifts but may perform comparably or better under domain shifts
Theoretical Guarantees: Proves in simplified settings that IM-RM cannot generalize to unseen tokens while EX-RM can successfully generalize through well-structured hidden representations

Methodology Details

Task Definition

Studies ranking accuracy of reward models on preference data, i.e., given prompt-response pairs (x, y+, y-) where y+ is the preferred response and y- is the rejected response, evaluates whether the reward model can correctly rank: r(x, y+) > r(x, y-).

Model Architecture

Explicit Reward Model (EX-RM)

r^EX_θ(x,y) = ⟨u, h_{x,y}⟩

where u is the linear head parameter and h_{x,y} is the hidden representation produced by the language model for the prompt-response pair (x,y).

Implicit Reward Model (IM-RM)

r^IM_θ(x,y) = β ln(π_θ(y|x)/π_ref(y|x))

where β is a fixed coefficient and π_ref is the reference distribution (typically the initialized language model).

Technical Innovations

1. Learning Dynamics Analysis

By analyzing how gradient updates affect reward assignment, the paper discovers:

EX-RM Dynamics:

Δr^EX_θ(x̄,ȳ) = ⟨h_{x̄,ȳ}, h_{x,y+} - h_{x,y-}⟩ · ηg(θ_EX)

IM-RM Dynamics:

Δr^IM_θ(x̄,ȳ) = (∑∑ ρ_{k,l}(y+)⟨h_{x̄,ȳ<k}, h_{x,y+<l}⟩ - ∑∑ ρ_{k,l}(y-)⟨h_{x̄,ȳ<k}, h_{x,y-<l}⟩) · ηg(θ_IM)β²

Key Finding: EX-RM changes depend only on hidden representations, while IM-RM changes depend on specific tokens, with coefficients ρ_{k,l} reflecting token overlap patterns.

2. Generalization Gap Theory

Theorem 2: In simplified settings (single-token responses), IM-RM cannot generalize to unseen tokens (accuracy remains at 0.5), while EX-RM can successfully generalize through maximum margin separators in hidden representation space.

Experimental Setup

Datasets

Controlled Experiments:
- Persona dataset: Agreement/disagreement tasks
- Hamiltonian cycle verification: Synthetic graph theory tasks
Real-world Scenarios:
- UltraFeedback: General dialogue data
- RewardMATH: Mathematical reasoning data
- RewardBench: Multi-domain evaluation benchmark

Evaluation Metrics

Accuracy: Ranking accuracy on preference data
Absolute Reward Margin: Normalized values of |r(x,y+) - r(x,y-)|

Comparison Methods

Explicit Reward Model (EX-RM)
Implicit Reward Model (IM-RM)
Explicit Generation Reward Model (EX-GRM)

Implementation Details

Language Models: Pythia, Gemma-2, Qwen-2.5, Llama-3 series (1B-8B parameters)
Optimizer: Adam
Learning Rate: 1e-6
β coefficient: 0.01 (for IM-RM)
Loss Function: Bradley-Terry log-likelihood loss

Experimental Results

Main Results

1. Token-level Distribution Shift

UltraFeedback Training: EX-RM win rate 83.4% under token-level shift, IM-RM win rate 16.6%
RewardMATH Training: EX-RM win rate 100% under token-level shift, IM-RM win rate 0%

2. Domain Shift

UltraFeedback Training: Under domain shift, IM-RM win rate 66.7%, EX-RM win rate 33.3%
RewardMATH Training: Under domain shift, IM-RM win rate 33.4%, EX-RM win rate 66.6%

3. Controlled Experiment Results

On the Persona dataset paraphrasing task:

EX-RM achieves 100% accuracy on both original and paraphrased responses
IM-RM achieves 100% accuracy on original responses but only 2.2% accuracy on paraphrased responses

Ablation Studies

1. Generation-Verification Hypothesis Verification

Hamiltonian cycle experiments show:

IM-RM training accuracy: 100%, test accuracy: 99.3%
IM-RM correct generations: 0 (unable to generate any correct Hamiltonian cycles)
Proves that learning verification does not require learning generation

2. Alternative Hypothesis Testing

Tested EX-RM variants based on all hidden representations
Tested IM-RM variants without reference distributions
Results show the generalization gap persists

Experimental Findings

Token Sensitivity: IM-RM is extremely sensitive to surface token changes, failing even when semantics remain identical
Hidden Representation Generalization: EX-RM successfully generalizes through semantically rich hidden representations
Reward Margins: EX-RM consistently produces higher absolute reward margins, facilitating reinforcement learning optimization
Domain Adaptability: IM-RM performs better in certain domain shift scenarios

Reward Model Analysis

Existing research primarily focuses on sample complexity bounds and theoretical properties of reward models, but less attention is given to how different parameterizations affect generalization.

DPO vs RLHF

This research relates to comparisons between DPO (Direct Preference Optimization) and RLHF (Reinforcement Learning from Human Feedback), but with different emphasis: this paper focuses on reward model generalization capability rather than training algorithm comparisons.

Neural Network Learning Dynamics

Borrows methods from implicit bias literature for analyzing gradient training trajectories, but applies them to the specific context of reward models.

Conclusions and Discussion

Main Conclusions

Root Cause: IM-RM's generalization problems stem from over-reliance on surface-level token cues rather than generation-verification gaps
Design Impact: Seemingly minor design choices (how to compute rewards) can significantly affect generalization behavior
Application Guidance: Prefer EX-RM in token-level distribution shift scenarios; consider IM-RM in domain shift scenarios

Limitations

Theoretical Assumptions: Theoretical analysis relies on simplified assumptions of fixed hidden representations and single-token responses
Evaluation Metrics: Primarily focuses on accuracy, not covering all dimensions of reward model effectiveness
Model Scope: Primarily studies three reward model types, not covering all possible variants

Future Directions

Theoretical Extensions: Relax restrictive assumptions in current theoretical analysis
Factor Exploration: Investigate other factors affecting generalization of different reward model types
Evaluation Expansion: Develop more comprehensive reward model evaluation standards
Novel Architectures: Explore implicit biases of other reward model types

In-Depth Evaluation

Strengths

Theoretical Depth: Provides rigorous mathematical analysis explaining generalization gaps from learning dynamics perspective
Comprehensive Experiments: Combines controlled experiments and real-world scenarios across multiple language models and datasets
Systematic Hypothesis Testing: Systematically tests and refutes intuitive but incorrect explanations
Practical Value: Provides clear guidance for reward model selection in practical applications

Weaknesses

Assumption Limitations: Simplified assumptions in theoretical analysis may limit generalizability of conclusions
Mechanism Understanding: Lacks in-depth analysis of mechanisms underlying IM-RM's better performance in domain shift scenarios
Scale Validation: Experiments primarily conducted on medium-scale models; conclusions on large-scale models require further verification

Impact

Theoretical Contribution: Provides important theoretical foundation for understanding behavior of different reward model types
Practical Guidance: Offers direct guidance for applications of RLHF and DPO techniques
Research Inspiration: Opens new directions for further research on implicit biases of reward models

Applicable Scenarios

High-Quality Requirements: Applications requiring stable performance under distribution shifts
Token-Sensitive Tasks: Scenarios involving paraphrasing, translation, and other token-level variations
Robustness-Critical Systems: Systems with strict robustness requirements for reward models

References

The paper cites extensive related work, including:

Ouyang et al. (2022): Training language models to follow instructions with human feedback
Rafailov et al. (2023): Direct preference optimization: Your language model is secretly a reward model
Lin et al. (2024): On the limited generalization capability of the implicit reward model induced by direct preference optimization
Lambert et al. (2025): Rewardbench: Evaluating reward models for language modeling

Overall Assessment: This is a high-quality research paper that through rigorous theoretical analysis and comprehensive experimental validation deeply reveals the fundamental causes of generalization capability differences among different reward model types. The paper possesses significant theoretical value and provides valuable guidance for practical applications. The research methodology is scientifically rigorous, conclusions are convincing, and it represents an important contribution to the field of reward model research.