2025-11-24T20:25:17.007327

ReZero: Boosting MCTS-based Algorithms by Backward-view and Entire-buffer Reanalyze

Xuan, Niu, Pu et al.

Monte Carlo Tree Search (MCTS)-based algorithms, such as MuZero and its derivatives, have achieved widespread success in various decision-making domains. These algorithms employ the reanalyze process to enhance sample efficiency from stale data, albeit at the expense of significant wall-clock time consumption. To address this issue, we propose a general approach named ReZero to boost tree search operations for MCTS-based algorithms. Specifically, drawing inspiration from the one-armed bandit model, we reanalyze training samples through a backward-view reuse technique which uses the value estimation of a certain child node to save the corresponding sub-tree search time. To further adapt to this design, we periodically reanalyze the entire buffer instead of frequently reanalyzing the mini-batch. The synergy of these two designs can significantly reduce the search cost and meanwhile guarantee or even improve performance, simplifying both data collecting and reanalyzing. Experiments conducted on Atari environments, DMControl suites and board games demonstrate that ReZero substantially improves training speed while maintaining high sample efficiency. The code is available as part of the LightZero MCTS benchmark at https://github.com/opendilab/LightZero.

academic

ReZero: Boosting MCTS-based Algorithms by Backward-view and Entire-buffer Reanalyze

Basic Information

Paper ID: 2404.16364
Title: ReZero: Boosting MCTS-based Algorithms by Backward-view and Entire-buffer Reanalyze
Authors: Chunyu Xuan, Yazhe Niu, Yuan Pu, Shuai Hu, Yu Liu, Jing Yang
Category: cs.AI
Publication Date: December 31, 2024 (latest version on arXiv)
Paper Link: https://arxiv.org/abs/2404.16364

Abstract

Monte Carlo Tree Search (MCTS)-based algorithms, such as MuZero and its derivatives, have achieved widespread success across various decision-making domains. These algorithms employ a reanalyze process to improve sample efficiency of stale data, but at the cost of significant clock time consumption. To address this issue, this paper proposes ReZero, a general-purpose method to accelerate tree search operations in MCTS algorithms. Specifically, inspired by the multi-armed bandit model, training samples are reanalyzed through backward-view reuse techniques, leveraging value estimates of specific child nodes to save search time for corresponding subtrees. To further adapt to this design, a strategy of periodically reanalyzing the entire buffer rather than frequently reanalyzing small batches is adopted. The synergistic effect of these two designs significantly reduces search costs while guaranteeing and even improving performance.

Research Background and Motivation

Problem Definition

The core problem faced by MCTS algorithms in reinforcement learning is excessive clock time overhead, manifested in two aspects:

Data Collection Phase: The agent must execute MCTS to select actions each time it receives a new state
Reanalysis Phase: To obtain higher-quality update targets, MCTS must be rerun with the latest model

Problem Significance

Although MCTS algorithms demonstrate excellent sample efficiency, time efficiency has become a bottleneck for further deployment
Tree search computation is difficult to parallelize using common vectorized environments, further amplifying speed disadvantages
Existing acceleration methods either require additional computational resources (e.g., SpeedyZero) or compress the search space through state abstraction (e.g., PTSAZero)

Research Motivation

This work aims to propose an acceleration strategy orthogonal to existing methods, requiring neither state space compression nor additional hardware overhead, but rather directly reducing the search space through value estimation.

Core Contributions

Proposed Backward-view Reanalysis Technique: Accelerates single tree search operations inspired by the multi-armed bandit model, with theoretical convergence guarantees
Designed Entire-buffer Reanalysis Framework: Further reduces MCTS invocation frequency and enhances parallelization capability
General-purpose Framework: Seamlessly integrates into various MCTS algorithms without requiring additional computational resources
Comprehensive Experimental Validation: Validates method effectiveness across Atari environments, DMControl suite, and board games

Method Details

Task Definition

This work investigates how to significantly reduce clock time overhead of MCTS algorithms while maintaining their sample efficiency. Input consists of trajectory data from MCTS algorithms, and output comprises accelerated search policies and value estimates.

Core Method Architecture

1. Backward-view Reanalysis

Theoretical Foundation: Inspired by the multi-armed bandit model, the root node of tree search is viewed as a bandit machine, with each child node as an arm. If the true state value of a child node can be known in advance, search time for that node can be saved.

Specific Implementation:

For adjacent timesteps S^t_l and S^{t+1}_l:
- When searching S^{t+1}_l, obtain root node value m^{t+1}_l
- When searching S^t_l, fix the value of S^{t+1}_l to m^{t+1}_l

Action Selection Strategy:

a_root = argmax_a I^t_l(a)

where I^t_l(a) = {
    UCBscore(S^t_l, a),  if a ≠ a^t_l
    r^t_l + γm^{t+1}_l,  if a = a^t_l
}

When selecting the action corresponding to S^{t+1}_l, the pre-stored value is used directly, avoiding subtree search.

2. Entire-buffer Reanalysis

Design Motivation: Backward-view reanalysis requires splitting batches into smaller sub-batches, potentially reducing parallelization advantages.

Solution:

Collection Phase Improvement: Directly sample actions from policy network output rather than MCTS selection
Periodic Reanalysis: After a fixed number of training iterations, reanalyze the entire buffer rather than reanalyzing small batches at each iteration

Advantages:

Similar to DQN's fixed target network mechanism, reducing policy target update frequency
Concentrates all MCTS invocations into the reanalysis process, fully leveraging large-batch parallelization advantages
Decouples reanalysis from training process, providing greater parallelization space

Theoretical Analysis

Theorem 1: For non-stationary bandits satisfying the assumptions in Equation (2), using sampled estimates rather than UCB values to evaluate specific arms ensures ET_i(n)/n → 0 as n → ∞.

This theorem proves the convergence of the backward-view reanalysis method with a lower regret upper bound, suggesting the algorithm may produce visit distributions more concentrated on optimal arms.

Experimental Setup

Datasets and Environments

Atari Environment: 26 representative games with high-dimensional visual inputs and discrete action spaces
DMControl Suite: Two continuous control tasks—ball_in_cup-catch and walker-stand
Board Games: Connect4 and Gomoku with specialized state spaces for strategy games

Evaluation Metrics

Time Efficiency: Clock time required to reach equivalent performance levels
Sample Efficiency: Environment interactions required to achieve successful policies
Search Acceleration: Time consumption and function call counts for single MCTS operations

Comparison Methods

MuZero: Original MuZero algorithm
EfficientZero: Improved MuZero variant
ReZero-M: MuZero integrated with ReZero
ReZero-E: EfficientZero integrated with ReZero

Implementation Details

Replay ratio: 0.25
Reanalysis frequency: 1
Batch size: 256 (Atari), 64 (DMControl)
MCTS simulation count: 50
Hardware: Single NVIDIA A100 GPU, 30 CPU cores, 120 GiB memory

Experimental Results

Main Results

Time Efficiency Improvements:

In most games, ReZero requires 2-4 times less clock time than baseline methods
Pong: ReZero-M 1.0±0.1 hours vs MuZero 4.0±0.5 hours
MsPacman: ReZero-M 1.4±0.2 hours vs MuZero 6.9±0.3 hours
Connect4: ReZero-M 5.5±0.6 hours vs MuZero 9.1±0.8 hours

Sample Efficiency Preservation: ReZero maintains comparable or even superior sample efficiency compared to baseline methods across all test environments.

Ablation Studies

1. Impact of Reanalysis Frequency

Tested reanalysis frequencies of {0, 1/3, 1, 2}:

Appropriate reanalysis frequency saves time overhead without significantly degrading performance
Frequency of 1 achieves optimal balance between time and sample efficiency

2. Backward-view Reanalysis Effectiveness

Detailed statistics show:

Average search time: ReZero-M 0.69±0.02ms vs MuZero 1.08±0.09ms
Tree search invocation count: ReZero-M 6089 vs MuZero 13284
Dynamics model invocation: ReZero-M 122 vs MuZero 256

Case Analysis

Toy Case Validation: Experiments in a 7×7 grid world intuitively demonstrate acceleration effects from skipping subtree searches. Positions farther from the terminal state require longer search times, and search time generally decreases after incorporating root node value assistance.

Experimental Findings

Backward-view reanalysis not only improves single search speed but also enhances sample efficiency
Entire-buffer reanalysis effectively reduces MCTS invocation frequency
The method demonstrates consistent acceleration effects across different types of decision-making environments

MCTS Algorithm Development

AlphaGo/AlphaZero: Combines MCTS with deep reinforcement learning, achieving breakthroughs in board games
MuZero: Extends to scenarios with unknown environment models, broadening application scope
Subsequent Improvements: EfficientZero improves sample efficiency; MuZero Unplugged extends to offline settings

MCTS Acceleration Research

SpeedyZero: Reduces time overhead through parallel system design, but requires more computational resources
PTSAZero: Compresses search space through state abstraction
KataGo: Uses naive information reuse techniques, but the backward-view method in this paper differs fundamentally

Conclusions and Discussion

Main Conclusions

ReZero successfully addresses the excessive clock time overhead problem of MCTS algorithms
The synergistic effect of backward-view reanalysis and entire-buffer reanalysis significantly improves time efficiency
The method is general-purpose and applicable to various MCTS algorithm variants
Achieves 2-4x time acceleration while maintaining sample efficiency

Limitations

Single-machine Setting Constraints: Current experiments are primarily conducted in single-machine environments; optimization space for distributed training remains unexplored
Environment Coverage: Experiments on continuous control environments are relatively limited, requiring more comprehensive benchmarking
Theoretical Analysis: While convergence proofs are provided, theoretical guarantees in actual complex environments require further investigation

Future Directions

Distributed Optimization: Apply ReZero to distributed reinforcement learning training
Offline Learning: Integration with MuZero Unplugged for offline dataset applications
Foundation Models: Construct decision foundation models combining RT-X and large-scale datasets
Weighted Sampling: Employ more rational approaches to prioritize reanalysis of specific samples in the buffer

In-depth Evaluation

Strengths

Strong Novelty: Backward-view reanalysis represents a novel approach to MCTS acceleration with solid theoretical foundations
High Practical Value: Significant time acceleration effects have important implications for practical MCTS applications
Good Generalizability: Framework design enables seamless integration into various MCTS algorithms
Comprehensive Experiments: Method effectiveness validated across multiple environment types with detailed ablation studies

Weaknesses

Theoretical Analysis Depth: While convergence proofs are provided, theoretical guarantees for complex environments require strengthening
Distributed Scenarios: Lacks validation and optimization in multi-machine, multi-GPU environments
Continuous Control: Experiments in continuous action spaces are relatively limited
Long-term Impact: Long-term effects on training stability and final performance require further analysis

Impact

Academic Contribution: Provides new research directions for MCTS acceleration, balancing theory and practice
Practical Value: Directly addresses critical deployment bottlenecks of MCTS algorithms
Reproducibility: Complete open-source implementation facilitates adoption and extension by the research community

Applicable Scenarios

Game AI: Board games, video games, and other real-time decision-making scenarios
Robot Control: Robot tasks requiring online planning
Autonomous Driving: Real-time path planning and decision-making
Financial Trading: Rapid decision-making in high-frequency trading

References

Schrittwieser, J., et al. (2019). Mastering Atari, Go, chess and shogi by planning with a learned model. Nature, 588, 604-609.
Silver, D., et al. (2017). Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815.
Ye, W., et al. (2021). Mastering atari games with limited data. Advances in Neural Information Processing Systems, 34, 25476-25488.
Mei, Y., et al. (2023). Speedyzero: Mastering atari with limited data and time. ICLR 2023.

Overall Assessment: This is a high-quality research paper that proposes innovative and practical solutions to address practical deployment bottlenecks of MCTS algorithms. The method design is ingenious, theoretical foundations are solid, and experimental validation is comprehensive. The work has significant value in promoting the adoption of MCTS algorithms in practical applications.