ReZero: Boosting MCTS-based Algorithms by Backward-view and Entire-buffer Reanalyze
Xuan, Niu, Pu et al.
Monte Carlo Tree Search (MCTS)-based algorithms, such as MuZero and its derivatives, have achieved widespread success in various decision-making domains. These algorithms employ the reanalyze process to enhance sample efficiency from stale data, albeit at the expense of significant wall-clock time consumption. To address this issue, we propose a general approach named ReZero to boost tree search operations for MCTS-based algorithms. Specifically, drawing inspiration from the one-armed bandit model, we reanalyze training samples through a backward-view reuse technique which uses the value estimation of a certain child node to save the corresponding sub-tree search time. To further adapt to this design, we periodically reanalyze the entire buffer instead of frequently reanalyzing the mini-batch. The synergy of these two designs can significantly reduce the search cost and meanwhile guarantee or even improve performance, simplifying both data collecting and reanalyzing. Experiments conducted on Atari environments, DMControl suites and board games demonstrate that ReZero substantially improves training speed while maintaining high sample efficiency. The code is available as part of the LightZero MCTS benchmark at https://github.com/opendilab/LightZero.
academic
ReZero: Boosting MCTS-based Algorithms by Backward-view and Entire-buffer Reanalyze
Monte Carlo Tree Search (MCTS)-based algorithms, such as MuZero and its derivatives, have achieved widespread success across various decision-making domains. These algorithms employ a reanalyze process to improve sample efficiency of stale data, but at the cost of significant clock time consumption. To address this issue, this paper proposes ReZero, a general-purpose method to accelerate tree search operations in MCTS algorithms. Specifically, inspired by the multi-armed bandit model, training samples are reanalyzed through backward-view reuse techniques, leveraging value estimates of specific child nodes to save search time for corresponding subtrees. To further adapt to this design, a strategy of periodically reanalyzing the entire buffer rather than frequently reanalyzing small batches is adopted. The synergistic effect of these two designs significantly reduces search costs while guaranteeing and even improving performance.
Although MCTS algorithms demonstrate excellent sample efficiency, time efficiency has become a bottleneck for further deployment
Tree search computation is difficult to parallelize using common vectorized environments, further amplifying speed disadvantages
Existing acceleration methods either require additional computational resources (e.g., SpeedyZero) or compress the search space through state abstraction (e.g., PTSAZero)
This work aims to propose an acceleration strategy orthogonal to existing methods, requiring neither state space compression nor additional hardware overhead, but rather directly reducing the search space through value estimation.
Proposed Backward-view Reanalysis Technique: Accelerates single tree search operations inspired by the multi-armed bandit model, with theoretical convergence guarantees
Designed Entire-buffer Reanalysis Framework: Further reduces MCTS invocation frequency and enhances parallelization capability
General-purpose Framework: Seamlessly integrates into various MCTS algorithms without requiring additional computational resources
Comprehensive Experimental Validation: Validates method effectiveness across Atari environments, DMControl suite, and board games
This work investigates how to significantly reduce clock time overhead of MCTS algorithms while maintaining their sample efficiency. Input consists of trajectory data from MCTS algorithms, and output comprises accelerated search policies and value estimates.
Theoretical Foundation: Inspired by the multi-armed bandit model, the root node of tree search is viewed as a bandit machine, with each child node as an arm. If the true state value of a child node can be known in advance, search time for that node can be saved.
Specific Implementation:
For adjacent timesteps S^t_l and S^{t+1}_l:
- When searching S^{t+1}_l, obtain root node value m^{t+1}_l
- When searching S^t_l, fix the value of S^{t+1}_l to m^{t+1}_l
Action Selection Strategy:
a_root = argmax_a I^t_l(a)
where I^t_l(a) = {
UCBscore(S^t_l, a), if a ≠ a^t_l
r^t_l + γm^{t+1}_l, if a = a^t_l
}
When selecting the action corresponding to S^{t+1}_l, the pre-stored value is used directly, avoiding subtree search.
Theorem 1: For non-stationary bandits satisfying the assumptions in Equation (2), using sampled estimates rather than UCB values to evaluate specific arms ensures ET_i(n)/n → 0 as n → ∞.
This theorem proves the convergence of the backward-view reanalysis method with a lower regret upper bound, suggesting the algorithm may produce visit distributions more concentrated on optimal arms.
In most games, ReZero requires 2-4 times less clock time than baseline methods
Pong: ReZero-M 1.0±0.1 hours vs MuZero 4.0±0.5 hours
MsPacman: ReZero-M 1.4±0.2 hours vs MuZero 6.9±0.3 hours
Connect4: ReZero-M 5.5±0.6 hours vs MuZero 9.1±0.8 hours
Sample Efficiency Preservation: ReZero maintains comparable or even superior sample efficiency compared to baseline methods across all test environments.
Toy Case Validation: Experiments in a 7×7 grid world intuitively demonstrate acceleration effects from skipping subtree searches. Positions farther from the terminal state require longer search times, and search time generally decreases after incorporating root node value assistance.
Single-machine Setting Constraints: Current experiments are primarily conducted in single-machine environments; optimization space for distributed training remains unexplored
Environment Coverage: Experiments on continuous control environments are relatively limited, requiring more comprehensive benchmarking
Theoretical Analysis: While convergence proofs are provided, theoretical guarantees in actual complex environments require further investigation
Schrittwieser, J., et al. (2019). Mastering Atari, Go, chess and shogi by planning with a learned model. Nature, 588, 604-609.
Silver, D., et al. (2017). Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815.
Ye, W., et al. (2021). Mastering atari games with limited data. Advances in Neural Information Processing Systems, 34, 25476-25488.
Mei, Y., et al. (2023). Speedyzero: Mastering atari with limited data and time. ICLR 2023.
Overall Assessment: This is a high-quality research paper that proposes innovative and practical solutions to address practical deployment bottlenecks of MCTS algorithms. The method design is ingenious, theoretical foundations are solid, and experimental validation is comprehensive. The work has significant value in promoting the adoption of MCTS algorithms in practical applications.