2025-11-11T12:43:08.939159

Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

Wang, Jiang, He et al.
Recent studies show that Large Language Models (LLMs) achieve strong reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a key approach, the Process Reward Model (PRM), suffers from reward hacking, making it unreliable in identifying the best intermediate step. In addition, the cost of annotating reasoning processes for reward modeling is high, making large-scale collection of high-quality data challenging. To address this, we propose a novel reward model approach called the Hierarchical Reward Model (HRM), which evaluates both individual and consecutive reasoning steps at both fine-grained and coarse-grained levels. HRM excels at assessing multi-step reasoning coherence, especially when flawed steps are later corrected through self-reflection. To further reduce the cost of generating training data, we introduce a lightweight and effective data augmentation strategy called Hierarchical Node Compression (HNC), which merges two consecutive reasoning steps into one within the tree structure. By applying HNC to MCTS-generated reasoning trajectories, we enhance the diversity and robustness of HRM training data while introducing controlled noise with minimal computational overhead. Empirical results on the PRM800K dataset show that HRM, together with HNC, provides more stable and reliable evaluations than PRM. Furthermore, cross-domain evaluations on the MATH500 and GSM8K datasets demonstrate HRM's strong generalization and robustness across a variety of reasoning tasks.
academic

Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

Basic Information

  • Paper ID: 2503.13551
  • Title: Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
  • Authors: Teng Wang, Zhangyi Jiang, Zhenqi He, Hailei Gong, Shenyang Tong, Wenhan Yang, Zeyu Li, Yanan Zheng, Zifan He, Zewen Ye, Shengjie Ma, Jianping Zhang
  • Classification: cs.CL cs.AI
  • Publication Date/Venue: arXiv Preprint (October 2025)
  • Paper Link: https://arxiv.org/abs/2503.13551

Abstract

Recent research demonstrates that Large Language Models (LLMs) can acquire powerful reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a critical approach—Process Reward Models (PRMs)—suffers from reward hacking issues, making them unreliable in identifying optimal intermediate steps. Furthermore, annotating reasoning processes for reward modeling is costly, making large-scale collection of high-quality data challenging. To address these issues, this paper proposes a novel reward modeling approach—Hierarchical Reward Models (HRM)—which evaluates individual and consecutive reasoning steps at both fine-grained and coarse-grained levels. HRM excels at assessing the coherence of multi-step reasoning, particularly when erroneous steps are subsequently corrected through self-reflection. To further reduce the cost of generating training data, this paper introduces a lightweight yet effective data augmentation strategy—Hierarchical Node Compression (HNC)—which merges two consecutive reasoning steps in a tree structure into one. By applying HNC to reasoning trajectories generated by MCTS, we enhance the diversity and robustness of HRM training data with minimal computational overhead while introducing controlled noise. Experimental results on the PRM800K dataset demonstrate that HRM combined with HNC provides more stable and reliable evaluation compared to PRM. Furthermore, cross-domain evaluations on MATH500 and GSM8K datasets validate HRM's strong generalization capability and robustness across various reasoning tasks.

Research Background and Motivation

Problem Definition

This research addresses two critical issues in large language models for mathematical reasoning tasks:

  1. Reward Hacking Problem: Existing Process Reward Models (PRMs) are susceptible to exploitation by models, which may achieve high scores by gaming the reward signal rather than genuinely improving reasoning, compromising reliability in complex tasks.
  2. High Annotation Costs: PRMs require expensive large-scale manual annotation of reasoning steps, limiting their reliability and scalability.

Research Significance

Mathematical reasoning is an important task for evaluating LLM reasoning capabilities. While existing methods such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) improve performance, they still have critical limitations:

  • CoT models lack mechanisms to detect and correct intermediate reasoning errors
  • ToT methods cannot inherently validate each intermediate step or guarantee retrieval of optimal reasoning trajectories

Limitations of Existing Approaches

  1. Outcome Reward Models (ORM): Suffer from delayed feedback and credit assignment problems, making it difficult to determine which reasoning steps contribute to the final answer
  2. Process Reward Models (PRM): While providing finer-grained supervision, they are susceptible to reward hacking and have high annotation costs

Research Motivation

Based on the aforementioned issues, this paper proposes Hierarchical Reward Models (HRM) to mitigate PRM limitations by combining fine-grained (single-step) and coarse-grained (consecutive multi-step) hierarchical supervision signals during training, enabling HRM to capture both local and global coherence in reasoning.

Core Contributions

  1. Proposes Hierarchical Reward Models (HRM): Leverages hierarchical supervision of training data at single-step and multi-step levels to promote coherence and self-correction capabilities in multi-step reasoning, with robustness validated on the PRM800K dataset.
  2. Introduces Hierarchical Node Compression (HNC): A lightweight MCTS data augmentation method that significantly increases diversity and robustness of HRM training data with minimal computational cost.
  3. Enhances Policy Model Performance: Further improves reasoning performance through fine-tuning on high-quality reasoning trajectories filtered from MCTS.
  4. Validates Generalization Capability: Demonstrates HRM's superior reasoning consistency and generalization compared to PRM on GSM8K and MATH500 datasets.

Methodology Details

Task Definition

This paper focuses on mathematical reasoning tasks, aiming to evaluate and improve LLM performance in solving multi-step mathematical problems. The input is a mathematical problem, the output is step-by-step reasoning and a final answer, with the constraint that reasoning steps must be correct and coherent.

Model Architecture

Hierarchical Reward Models (HRM)

The core idea of HRM is to employ hierarchical supervision during training, evaluating individual and consecutive reasoning steps:

Training Data Construction:

  • PRM training data: DPRM={(si,R(si))1iN}D_{PRM} = \{(s_i, R(s_i)) | 1 \leq i \leq N\}
  • HRM training data: DHRM=DPRM{(si+si+1,R(si+si+1))1i<N}D_{HRM} = D_{PRM} \cup \{(s_i + s_{i+1}, R(s_i + s_{i+1})) | 1 \leq i < N\}

where sis_i represents the ii-th reasoning step, R()R(\cdot) is the reward function, and NN is the total number of reasoning steps.

Hierarchical Supervision Objectives:

  1. Capture fine-grained and coarse-grained consistency
  2. Enable self-reflection and error correction

Inference Phase: Although merged reasoning steps are used during training, HRM still evaluates step-by-step during inference, assigning rewards based only on the current step sis_i, similar to PRM.

Hierarchical Node Compression (HNC)

HNC is a data augmentation method that increases training data diversity by merging consecutive nodes in MCTS tree structures:

Core Mechanism:

  1. Randomly merge two consecutive nodes, each corresponding to a reasoning step
  2. Remove direct connections between nodes
  3. Redirect connection relationships

Noise Introduction: When a random node is removed, the weights of remaining child nodes are redistributed from 1N\frac{1}{N} to 1N1\frac{1}{N-1}, with variance increasing from σ2N\frac{\sigma^2}{N} to σ2N1\frac{\sigma^2}{N-1}, introducing controlled noise.

Technical Innovations

  1. Hierarchical Supervision Design: Unlike PRM which only evaluates individual steps, HRM considers interactions between multiple steps, capable of identifying corrections to early errors in subsequent reasoning.
  2. Self-Correction Capability: Traditional PRM penalizes erroneous single steps without considering potential corrections in subsequent reasoning, while HRM evaluates reasoning coherence across multiple steps.
  3. Low-Cost Data Augmentation: HNC achieves data augmentation with minimal computational overhead (approximately 30 minutes CPU time), nearly negligible compared to MCTS's 2457 A100 GPU hours.

Experimental Setup

Datasets

  1. PRM800K: Contains manually annotated reasoning trajectories, serving as the foundation for training ORM, PRM, and HRM
  2. MATH500: High school and university-level mathematical problems, used to evaluate generalization capability
  3. GSM8K: Elementary school mathematics word problems, containing 1000 test problems

Evaluation Metrics

  • Accuracy: Problem-solving accuracy under Best-of-N strategy
  • Stability: Performance consistency as N increases
  • Robustness: Consistent performance across different policy models and datasets

Comparison Methods

  • ORM (Outcome Reward Model): Evaluates based on entire reasoning chain
  • PRM (Process Reward Model): Step-by-step evaluation of reasoning process
  • HRM (Hierarchical Reward Model): The hierarchical reward model proposed in this paper

Implementation Details

  • Reward Models: Fine-tuned based on Qwen2.5-1.5B-Math
  • Policy Models: Qwen2.5-72B-Math-Instruct, DeepSeek-Math-7B, Qwen2.5-7B-Math-Instruct
  • MCTS Configuration: 5-6 child nodes per parent node, maximum tree depth of 7
  • Training Optimization: FlashAttention, DeepSpeed, and mixed-precision training

Experimental Results

Main Results

Best-of-N Performance on PRM800K Dataset:

N2481624
ORM0.6220.6770.6550.6550.633
PRM0.7000.6440.6110.5880.577
HRM0.7220.7110.7440.8000.800

Key Findings:

  • HRM maintains stable performance as N increases, with accuracy stabilizing at 80%
  • ORM and PRM exhibit significant fluctuations, with accuracy declining as N grows
  • HRM demonstrates superior stability and reliability

Cross-Domain Generalization Experiments

Results on GSM8K and MATH500 Datasets:

DatasetMethodN=2N=64N=256N=512
GSM8KPRM0.7840.9050.9270.918
GSM8KHRM0.7840.9070.9300.926
MATH500PRM0.4680.6560.6860.688
MATH500HRM0.4900.7420.7400.736

Important Observations:

  • On the complex MATH500 dataset, HRM significantly outperforms PRM
  • On the relatively simple GSM8K, the difference is smaller but HRM still maintains a slight advantage
  • HRM demonstrates stronger cross-domain robustness

Ablation Studies

Comparison Across Different Policy Models: HRM trained on automatically annotated data generated by MCTS demonstrates better stability than PRM across multiple policy models:

  • DeepSeek-Math-7B
  • Qwen2.5-72B-Math
  • Qwen2.5-7B-Math

Self-Training Experiments

Supervised fine-tuning with KL divergence regularization further improves policy model performance, validating the value of high-quality reasoning data.

RLHF Framework

This paper builds upon the Reinforcement Learning from Human Feedback (RLHF) framework, which uses reward models to distinguish between high-quality and low-quality responses and employs PPO to optimize LLMs.

Reward Model Classification

  1. ORM: Assigns rewards based on overall output, suffering from delayed feedback and credit assignment problems
  2. PRM: Evaluates intermediate reasoning steps, providing finer-grained supervision but susceptible to reward hacking

MCTS Application in Reasoning

MCTS has been proposed as a method for autonomously annotating reasoning trajectories, but computational costs grow exponentially with search tree depth and breadth.

Conclusions and Discussion

Main Conclusions

  1. HRM effectively mitigates PRM's reward hacking problem, providing more stable and reliable evaluation through hierarchical supervision
  2. HNC is an efficient data augmentation strategy, significantly improving training data quality with minimal cost
  3. HRM demonstrates excellent generalization capability, consistently outperforming PRM across multiple mathematical reasoning datasets

Limitations

  1. Merging Step Constraints: Currently only merges two consecutive steps; merging more steps leads to exponential growth in label combination complexity
  2. Domain Limitations: Primarily focuses on mathematical reasoning; applicability to other structured reasoning domains requires further verification
  3. Computational Constraints: MCTS configuration is limited by computational resources, potentially affecting diversity of generated data

Future Directions

  1. Explore more complex hierarchical structure designs
  2. Extend to other structured reasoning tasks
  3. Incorporate more efficient search algorithms to reduce computational costs
  4. Investigate more sophisticated labeling strategies for handling multi-step merging

In-Depth Evaluation

Strengths

  1. Strong Innovation: HRM's hierarchical supervision design cleverly combines local accuracy with global coherence
  2. Comprehensive Experiments: Thorough evaluation across multiple datasets and policy models
  3. High Practical Value: HNC provides a low-cost data augmentation solution
  4. Solid Theoretical Foundation: In-depth analysis of reward hacking problems with targeted solutions

Weaknesses

  1. Method Complexity: Compared to PRM, HRM's training data construction and labeling strategy are more complex
  2. Scalability: Currently supports only two-step merging, limiting method extensibility
  3. Domain Specificity: Primarily validated on mathematical reasoning tasks; applicability to other domains insufficiently verified

Impact

  1. Academic Contribution: Provides new hierarchical perspectives for reward model design
  2. Practical Value: HNC method can be directly applied to existing MCTS pipelines
  3. Reproducibility: Provides detailed experimental settings and hyperparameter configurations

Applicable Scenarios

  1. Mathematical Reasoning Tasks: Particularly suitable for complex mathematical problems requiring multi-step reasoning
  2. Reasoning Tasks Requiring Self-Correction: HRM can identify and reward error corrections in reasoning processes
  3. Resource-Constrained Scenarios: HNC provides low-cost data augmentation solutions

References

The paper cites important works in the field, including:

  • Lightman et al. (2023) - Let's verify step by step (PRM800K dataset)
  • Cobbe et al. (2021) - Training verifiers to solve math word problems
  • Wei et al. (2022) - Chain-of-thought prompting
  • Ouyang et al. (2022) - Training language models to follow instructions with human feedback

Overall Assessment: This is a high-quality research paper that proposes innovative solutions to critical PRM problems. HRM's hierarchical supervision design is theoretically sound with comprehensive experimental validation, and the HNC method demonstrates strong practical value. The paper excels in technical innovation, experimental design, and result analysis, providing valuable contributions to enhancing the reasoning capabilities of large language models.