2025-11-11T12:43:08.939159

Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

Wang, Jiang, He et al.

Recent studies show that Large Language Models (LLMs) achieve strong reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a key approach, the Process Reward Model (PRM), suffers from reward hacking, making it unreliable in identifying the best intermediate step. In addition, the cost of annotating reasoning processes for reward modeling is high, making large-scale collection of high-quality data challenging. To address this, we propose a novel reward model approach called the Hierarchical Reward Model (HRM), which evaluates both individual and consecutive reasoning steps at both fine-grained and coarse-grained levels. HRM excels at assessing multi-step reasoning coherence, especially when flawed steps are later corrected through self-reflection. To further reduce the cost of generating training data, we introduce a lightweight and effective data augmentation strategy called Hierarchical Node Compression (HNC), which merges two consecutive reasoning steps into one within the tree structure. By applying HNC to MCTS-generated reasoning trajectories, we enhance the diversity and robustness of HRM training data while introducing controlled noise with minimal computational overhead. Empirical results on the PRM800K dataset show that HRM, together with HNC, provides more stable and reliable evaluations than PRM. Furthermore, cross-domain evaluations on the MATH500 and GSM8K datasets demonstrate HRM's strong generalization and robustness across a variety of reasoning tasks.

academic

Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

Basic Information

Paper ID: 2503.13551
Title: Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Authors: Teng Wang, Zhangyi Jiang, Zhenqi He, Hailei Gong, Shenyang Tong, Wenhan Yang, Zeyu Li, Yanan Zheng, Zifan He, Zewen Ye, Shengjie Ma, Jianping Zhang
Classification: cs.CL cs.AI
Publication Date/Venue: arXiv Preprint (October 2025)
Paper Link: https://arxiv.org/abs/2503.13551

Abstract

Recent research demonstrates that Large Language Models (LLMs) can acquire powerful reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a critical approach—Process Reward Models (PRMs)—suffers from reward hacking issues, making them unreliable in identifying optimal intermediate steps. Furthermore, annotating reasoning processes for reward modeling is costly, making large-scale collection of high-quality data challenging. To address these issues, this paper proposes a novel reward modeling approach—Hierarchical Reward Models (HRM)—which evaluates individual and consecutive reasoning steps at both fine-grained and coarse-grained levels. HRM excels at assessing the coherence of multi-step reasoning, particularly when erroneous steps are subsequently corrected through self-reflection. To further reduce the cost of generating training data, this paper introduces a lightweight yet effective data augmentation strategy—Hierarchical Node Compression (HNC)—which merges two consecutive reasoning steps in a tree structure into one. By applying HNC to reasoning trajectories generated by MCTS, we enhance the diversity and robustness of HRM training data with minimal computational overhead while introducing controlled noise. Experimental results on the PRM800K dataset demonstrate that HRM combined with HNC provides more stable and reliable evaluation compared to PRM. Furthermore, cross-domain evaluations on MATH500 and GSM8K datasets validate HRM's strong generalization capability and robustness across various reasoning tasks.

Research Background and Motivation

Problem Definition

This research addresses two critical issues in large language models for mathematical reasoning tasks:

Reward Hacking Problem: Existing Process Reward Models (PRMs) are susceptible to exploitation by models, which may achieve high scores by gaming the reward signal rather than genuinely improving reasoning, compromising reliability in complex tasks.
High Annotation Costs: PRMs require expensive large-scale manual annotation of reasoning steps, limiting their reliability and scalability.

Research Significance

Mathematical reasoning is an important task for evaluating LLM reasoning capabilities. While existing methods such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) improve performance, they still have critical limitations:

CoT models lack mechanisms to detect and correct intermediate reasoning errors
ToT methods cannot inherently validate each intermediate step or guarantee retrieval of optimal reasoning trajectories

Limitations of Existing Approaches

Outcome Reward Models (ORM): Suffer from delayed feedback and credit assignment problems, making it difficult to determine which reasoning steps contribute to the final answer
Process Reward Models (PRM): While providing finer-grained supervision, they are susceptible to reward hacking and have high annotation costs

Research Motivation

Based on the aforementioned issues, this paper proposes Hierarchical Reward Models (HRM) to mitigate PRM limitations by combining fine-grained (single-step) and coarse-grained (consecutive multi-step) hierarchical supervision signals during training, enabling HRM to capture both local and global coherence in reasoning.

Core Contributions

Proposes Hierarchical Reward Models (HRM): Leverages hierarchical supervision of training data at single-step and multi-step levels to promote coherence and self-correction capabilities in multi-step reasoning, with robustness validated on the PRM800K dataset.
Introduces Hierarchical Node Compression (HNC): A lightweight MCTS data augmentation method that significantly increases diversity and robustness of HRM training data with minimal computational cost.
Enhances Policy Model Performance: Further improves reasoning performance through fine-tuning on high-quality reasoning trajectories filtered from MCTS.
Validates Generalization Capability: Demonstrates HRM's superior reasoning consistency and generalization compared to PRM on GSM8K and MATH500 datasets.

Methodology Details

Task Definition

This paper focuses on mathematical reasoning tasks, aiming to evaluate and improve LLM performance in solving multi-step mathematical problems. The input is a mathematical problem, the output is step-by-step reasoning and a final answer, with the constraint that reasoning steps must be correct and coherent.

Model Architecture

Hierarchical Reward Models (HRM)

The core idea of HRM is to employ hierarchical supervision during training, evaluating individual and consecutive reasoning steps:

Training Data Construction:

PRM training data: $D_{PRM} = \{(s_i, R(s_i)) | 1 \leq i \leq N\}$
HRM training data: $D_{HRM} = D_{PRM} \cup \{(s_i + s_{i+1}, R(s_i + s_{i+1})) | 1 \leq i < N\}$

where $s_i$ represents the $i$ -th reasoning step, $R(\cdot)$ is the reward function, and $N$ is the total number of reasoning steps.

Hierarchical Supervision Objectives:

Capture fine-grained and coarse-grained consistency
Enable self-reflection and error correction

Inference Phase: Although merged reasoning steps are used during training, HRM still evaluates step-by-step during inference, assigning rewards based only on the current step $s_i$ , similar to PRM.

Hierarchical Node Compression (HNC)

HNC is a data augmentation method that increases training data diversity by merging consecutive nodes in MCTS tree structures:

Core Mechanism:

Randomly merge two consecutive nodes, each corresponding to a reasoning step
Remove direct connections between nodes
Redirect connection relationships

Noise Introduction: When a random node is removed, the weights of remaining child nodes are redistributed from $\frac{1}{N}$ to $\frac{1}{N-1}$ , with variance increasing from $\frac{\sigma^2}{N}$ to $\frac{\sigma^2}{N-1}$ , introducing controlled noise.

Technical Innovations

Hierarchical Supervision Design: Unlike PRM which only evaluates individual steps, HRM considers interactions between multiple steps, capable of identifying corrections to early errors in subsequent reasoning.
Self-Correction Capability: Traditional PRM penalizes erroneous single steps without considering potential corrections in subsequent reasoning, while HRM evaluates reasoning coherence across multiple steps.
Low-Cost Data Augmentation: HNC achieves data augmentation with minimal computational overhead (approximately 30 minutes CPU time), nearly negligible compared to MCTS's 2457 A100 GPU hours.

Experimental Setup

Datasets

PRM800K: Contains manually annotated reasoning trajectories, serving as the foundation for training ORM, PRM, and HRM
MATH500: High school and university-level mathematical problems, used to evaluate generalization capability
GSM8K: Elementary school mathematics word problems, containing 1000 test problems

Evaluation Metrics

Accuracy: Problem-solving accuracy under Best-of-N strategy
Stability: Performance consistency as N increases
Robustness: Consistent performance across different policy models and datasets

Comparison Methods

ORM (Outcome Reward Model): Evaluates based on entire reasoning chain
PRM (Process Reward Model): Step-by-step evaluation of reasoning process
HRM (Hierarchical Reward Model): The hierarchical reward model proposed in this paper

Implementation Details

Reward Models: Fine-tuned based on Qwen2.5-1.5B-Math
Policy Models: Qwen2.5-72B-Math-Instruct, DeepSeek-Math-7B, Qwen2.5-7B-Math-Instruct
MCTS Configuration: 5-6 child nodes per parent node, maximum tree depth of 7
Training Optimization: FlashAttention, DeepSpeed, and mixed-precision training

Experimental Results

Main Results

Best-of-N Performance on PRM800K Dataset:

N	2	4	8	16	24
ORM	0.622	0.677	0.655	0.655	0.633
PRM	0.700	0.644	0.611	0.588	0.577
HRM	0.722	0.711	0.744	0.800	0.800

Key Findings:

HRM maintains stable performance as N increases, with accuracy stabilizing at 80%
ORM and PRM exhibit significant fluctuations, with accuracy declining as N grows
HRM demonstrates superior stability and reliability

Cross-Domain Generalization Experiments

Results on GSM8K and MATH500 Datasets:

Dataset	Method	N=2	N=64	N=256	N=512
GSM8K	PRM	0.784	0.905	0.927	0.918
GSM8K	HRM	0.784	0.907	0.930	0.926
MATH500	PRM	0.468	0.656	0.686	0.688
MATH500	HRM	0.490	0.742	0.740	0.736

Important Observations:

On the complex MATH500 dataset, HRM significantly outperforms PRM
On the relatively simple GSM8K, the difference is smaller but HRM still maintains a slight advantage
HRM demonstrates stronger cross-domain robustness

Ablation Studies

Comparison Across Different Policy Models: HRM trained on automatically annotated data generated by MCTS demonstrates better stability than PRM across multiple policy models:

DeepSeek-Math-7B
Qwen2.5-72B-Math
Qwen2.5-7B-Math

Self-Training Experiments

Supervised fine-tuning with KL divergence regularization further improves policy model performance, validating the value of high-quality reasoning data.

RLHF Framework

This paper builds upon the Reinforcement Learning from Human Feedback (RLHF) framework, which uses reward models to distinguish between high-quality and low-quality responses and employs PPO to optimize LLMs.

Reward Model Classification

ORM: Assigns rewards based on overall output, suffering from delayed feedback and credit assignment problems
PRM: Evaluates intermediate reasoning steps, providing finer-grained supervision but susceptible to reward hacking

MCTS Application in Reasoning

MCTS has been proposed as a method for autonomously annotating reasoning trajectories, but computational costs grow exponentially with search tree depth and breadth.

Conclusions and Discussion

Main Conclusions

HRM effectively mitigates PRM's reward hacking problem, providing more stable and reliable evaluation through hierarchical supervision
HNC is an efficient data augmentation strategy, significantly improving training data quality with minimal cost
HRM demonstrates excellent generalization capability, consistently outperforming PRM across multiple mathematical reasoning datasets

Limitations

Merging Step Constraints: Currently only merges two consecutive steps; merging more steps leads to exponential growth in label combination complexity
Domain Limitations: Primarily focuses on mathematical reasoning; applicability to other structured reasoning domains requires further verification
Computational Constraints: MCTS configuration is limited by computational resources, potentially affecting diversity of generated data

Future Directions

Explore more complex hierarchical structure designs
Extend to other structured reasoning tasks
Incorporate more efficient search algorithms to reduce computational costs
Investigate more sophisticated labeling strategies for handling multi-step merging

In-Depth Evaluation

Strengths

Strong Innovation: HRM's hierarchical supervision design cleverly combines local accuracy with global coherence
Comprehensive Experiments: Thorough evaluation across multiple datasets and policy models
High Practical Value: HNC provides a low-cost data augmentation solution
Solid Theoretical Foundation: In-depth analysis of reward hacking problems with targeted solutions

Weaknesses

Method Complexity: Compared to PRM, HRM's training data construction and labeling strategy are more complex
Scalability: Currently supports only two-step merging, limiting method extensibility
Domain Specificity: Primarily validated on mathematical reasoning tasks; applicability to other domains insufficiently verified

Impact

Academic Contribution: Provides new hierarchical perspectives for reward model design
Practical Value: HNC method can be directly applied to existing MCTS pipelines
Reproducibility: Provides detailed experimental settings and hyperparameter configurations

Applicable Scenarios

Mathematical Reasoning Tasks: Particularly suitable for complex mathematical problems requiring multi-step reasoning
Reasoning Tasks Requiring Self-Correction: HRM can identify and reward error corrections in reasoning processes
Resource-Constrained Scenarios: HNC provides low-cost data augmentation solutions

References

The paper cites important works in the field, including:

Lightman et al. (2023) - Let's verify step by step (PRM800K dataset)
Cobbe et al. (2021) - Training verifiers to solve math word problems
Wei et al. (2022) - Chain-of-thought prompting
Ouyang et al. (2022) - Training language models to follow instructions with human feedback

Overall Assessment: This is a high-quality research paper that proposes innovative solutions to critical PRM problems. HRM's hierarchical supervision design is theoretically sound with comprehensive experimental validation, and the HNC method demonstrates strong practical value. The paper excels in technical innovation, experimental design, and result analysis, providing valuable contributions to enhancing the reasoning capabilities of large language models.