Why is Your Language Model a Poor Implicit Reward Model?
Razin, Lin, Yao et al.
Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Toward a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Taken together, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.
academic
Why is Your Language Model a Poor Implicit Reward Model?
Reward models are critical components in language model post-training and inference pipelines. Recent research has demonstrated that every language model defines an implicit reward model (IM-RM) without requiring any architectural modifications. However, compared to explicit reward models (EX-RM) that apply dedicated linear heads to language model hidden representations, IM-RM typically exhibits poorer generalization capabilities, particularly in out-of-distribution scenarios. This generalization gap is puzzling because EX-RM and IM-RM are nearly identical—they can be trained using the same data, loss functions, and language models, differing only in how rewards are computed. This paper investigates the fundamental causes of this gap, finding that IM-RM relies more heavily on surface-level token-level cues, resulting in inferior generalization compared to EX-RM under both token-level distribution shifts and in-distribution scenarios.
Reward models play a central role in the modern language model ecosystem, with widespread applications in reinforcement learning training, direct alignment algorithms, rejection sampling, data filtering, and inference-time scaling. Currently, two primary types of reward models exist:
Explicit Reward Models (EX-RM): Apply linear heads to language model hidden representations to compute rewards
Implicit Reward Models (IM-RM): Define rewards implicitly through the log-probabilities of language models
Despite being nearly identical in architecture, prior research has observed that IM-RM often exhibits poorer generalization capabilities, particularly in out-of-distribution scenarios. This phenomenon is puzzling because both models can be trained on the same language model using identical data and loss functions, differing only in minor aspects of reward computation.
Theoretical Analysis: Through learning dynamics analysis, reveals that IM-RM relies more on token-level cues while EX-RM generalizes primarily through hidden representations
Refutation of Intuitive Assumptions: Demonstrates that IM-RM's generalization problems do not stem from generation-verification gaps; learning verification does not require learning generation
Empirical Validation: Verifies through controlled experiments and real-world scenarios that IM-RM performs worse under token-level distribution shifts but may perform comparably or better under domain shifts
Theoretical Guarantees: Proves in simplified settings that IM-RM cannot generalize to unseen tokens while EX-RM can successfully generalize through well-structured hidden representations
Studies ranking accuracy of reward models on preference data, i.e., given prompt-response pairs (x, y+, y-) where y+ is the preferred response and y- is the rejected response, evaluates whether the reward model can correctly rank: r(x, y+) > r(x, y-).
Key Finding: EX-RM changes depend only on hidden representations, while IM-RM changes depend on specific tokens, with coefficients ρ_{k,l} reflecting token overlap patterns.
Theorem 2: In simplified settings (single-token responses), IM-RM cannot generalize to unseen tokens (accuracy remains at 0.5), while EX-RM can successfully generalize through maximum margin separators in hidden representation space.
Existing research primarily focuses on sample complexity bounds and theoretical properties of reward models, but less attention is given to how different parameterizations affect generalization.
This research relates to comparisons between DPO (Direct Preference Optimization) and RLHF (Reinforcement Learning from Human Feedback), but with different emphasis: this paper focuses on reward model generalization capability rather than training algorithm comparisons.
The paper cites extensive related work, including:
Ouyang et al. (2022): Training language models to follow instructions with human feedback
Rafailov et al. (2023): Direct preference optimization: Your language model is secretly a reward model
Lin et al. (2024): On the limited generalization capability of the implicit reward model induced by direct preference optimization
Lambert et al. (2025): Rewardbench: Evaluating reward models for language modeling
Overall Assessment: This is a high-quality research paper that through rigorous theoretical analysis and comprehensive experimental validation deeply reveals the fundamental causes of generalization capability differences among different reward model types. The paper possesses significant theoretical value and provides valuable guidance for practical applications. The research methodology is scientifically rigorous, conclusions are convincing, and it represents an important contribution to the field of reward model research.