Monte Carlo Tree Search (MCTS)-based algorithms, such as MuZero and its derivatives, have achieved widespread success in various decision-making domains. These algorithms employ the reanalyze process to enhance sample efficiency from stale data, albeit at the expense of significant wall-clock time consumption. To address this issue, we propose a general approach named ReZero to boost tree search operations for MCTS-based algorithms. Specifically, drawing inspiration from the one-armed bandit model, we reanalyze training samples through a backward-view reuse technique which uses the value estimation of a certain child node to save the corresponding sub-tree search time. To further adapt to this design, we periodically reanalyze the entire buffer instead of frequently reanalyzing the mini-batch. The synergy of these two designs can significantly reduce the search cost and meanwhile guarantee or even improve performance, simplifying both data collecting and reanalyzing. Experiments conducted on Atari environments, DMControl suites and board games demonstrate that ReZero substantially improves training speed while maintaining high sample efficiency. The code is available as part of the LightZero MCTS benchmark at https://github.com/opendilab/LightZero.
è«æID : 2404.16364ã¿ã€ãã« : ReZero: Boosting MCTS-based Algorithms by Backward-view and Entire-buffer Reanalyzeèè
: Chunyu Xuan, Yazhe Niu, Yuan Pu, Shuai Hu, Yu Liu, Jing Yangåé¡ : cs.AIçºè¡šæ¥æ : 2024幎12æ31æ¥ (arXivã®ææ°ç)è«æãªã³ã¯ : https://arxiv.org/abs/2404.16364 ã¢ã³ãã«ã«ãæšæ¢çŽ¢(MCTS)ããŒã¹ã®ã¢ã«ãŽãªãºã (MuZeroããã³ãã®æŽŸçã¢ã«ãŽãªãºã ãªã©)ã¯ãæ§ã
ãªæææ±ºå®é åã§åºãæåãåããŠããããããã®ã¢ã«ãŽãªãºã ã¯ååæ(reanalyze)ããã»ã¹ãæ¡çšããŠå€ãããŒã¿ã®ãµã³ãã«å¹çãåäžãããŠãããããã®ä»£åãšããŠèããèšç®æéã®æ¶è²»ãçããŠããããã®åé¡ã解決ãããããæ¬è«æã§ã¯MCTSã¢ã«ãŽãªãºã ã®æšæ¢çŽ¢æäœãå éããããã®æ±çšçãªææ³ã§ããReZeroãææ¡ãããå
·äœçã«ã¯ãåè
ãã³ãã£ããã¢ãã«ã«çæ³ãåŸãŠãåŸåèŠå³åå©çšæè¡ãéããŠèšç·Žãµã³ãã«ãååæããç¹å®ã®åããŒãã®äŸ¡å€æšå®ãå©çšããŠå¯Ÿå¿ããéšåæšã®æ¢çŽ¢æéãç¯çŽããããã®èšèšã«ããã«é©å¿ããããããå°ãããã®é »ç¹ãªååæã§ã¯ãªãããããã¡å
šäœã®åšæçãªååææŠç¥ãæ¡çšãããããã2ã€ã®èšèšã®çžä¹å¹æã«ãããæ¢çŽ¢ã³ã¹ãã倧å¹
ã«åæžãããäžæ¹ã§ãæ§èœãä¿èšŒãããããã«ã¯åäžããã
MCTSã¢ã«ãŽãªãºã ã匷ååŠç¿ã§çŽé¢ããäžæ žçãªåé¡ã¯ãèšç®æéã®ãªãŒããŒããããé倧ã§ããããšã§ãããäž»ã«2ã€ã®åŽé¢ã«çŸãã:
ããŒã¿åéæ®µé : ãšãŒãžã§ã³ããæ°ããç¶æ
ãåãåããã³ã«ãåäœéžæã®ããã«MCTSãå®è¡ããå¿
èŠãããååææ®µé : ããé«åè³ªãªæŽæ°ç®æšãåŸããããææ°ã®ã¢ãã«ã䜿çšããŠMCTSãåå®è¡ããå¿
èŠãããMCTSã¢ã«ãŽãªãºã ã¯ãµã³ãã«å¹çã®é¢ã§åªããæ§èœã瀺ããŠããããæéå¹çããããªãæ®åã®ç¶é žãšãªã£ãŠãã æšæ¢çŽ¢ã®èšç®ã¯äžè¬çãªãã¯ãã«åç°å¢ã§ã®äžŠååãå°é£ã§ãããé床ã®å£äœæ§ãããã«å¢å¹
ããŠãã æ¢åã®é«éåææ³ã¯ã远å ã®èšç®ãªãœãŒã¹ãå¿
èŠãšããã(SpeedyZeroãªã©)ãç¶æ
æœè±¡åã«ããæ¢çŽ¢ç©ºéã®å§çž®ãè¡ãã(PTSAZeroãªã©)ã®ããããã§ãã æ¬è«æã¯ãæ¢åã®ææ³ãšçŽäº€ããé«éåæŠç¥ãææ¡ããããšãç®æããŠãããç¶æ
空éã®å§çž®ã远å ã®ããŒããŠã§ã¢ãªãŒããŒããããå¿
èŠãšãããäŸ¡å€æšå®ãéããŠçŽæ¥æ¢çŽ¢ç©ºéãåæžããã
åŸåèŠå³ååææè¡ã®ææ¡ : åè
ãã³ãã£ããã¢ãã«ã«çæ³ãåŸãææ³ã«ããåäžã®æšæ¢çŽ¢ãå éããåææ§ã®çè«çä¿èšŒãæäŸããå
šãããã¡ååæãã¬ãŒã ã¯ãŒã¯ã®èšèš : MCTSåŒã³åºãåæ°ãããã«åæžãã䞊ååèœåã匷åããæ±çšçãã¬ãŒã ã¯ãŒã¯ : 远å ã®èšç®ãªãœãŒã¹ãªãã«ãæ§ã
ãªMCTSã¢ã«ãŽãªãºã ã«ã·ãŒã ã¬ã¹ã«çµ±åå¯èœã§ããå
æ¬çãªå®éšæ€èšŒ : Atariç°å¢ãDMControlå¥ä»¶ãããã³æ£é¡ã²ãŒã ã«ãããŠææ³ã®æå¹æ§ãæ€èšŒããæ¬è«æã¯ãMCTSã¢ã«ãŽãªãºã ã®ãµã³ãã«å¹çãç¶æããªããããã®èšç®æéã®ãªãŒããŒãããã倧å¹
ã«åæžããæ¹æ³ãç ç©¶ãããå
¥åã¯MCTSã¢ã«ãŽãªãºã ã®è»è·¡ããŒã¿ã§ãããåºåã¯å éãããæ¢çŽ¢æŠç¥ãšäŸ¡å€æšå®ã§ããã
çè«çåºç€ : åè
ãã³ãã£ããã¢ãã«ã«çæ³ãåŸãŠãæšæ¢çŽ¢ã®ã«ãŒãããŒãããã³ãã£ãããšããŠãååããŒãã1ã€ã®ã¢ãŒã ãšããŠæ±ããããåããŒãã®çã®ç¶æ
䟡å€ãäºåã«ç¥ãããšãã§ããã°ããã®æ¢çŽ¢ã«è²»ããæéãç¯çŽã§ããã
å
·äœçãªå®è£
:
飿¥ããæéã¹ããã S^t_l ãš S^{t+1}_l ã«ã€ããŠ:
- S^{t+1}_l ãæ¢çŽ¢ããéãã«ãŒãããŒãäŸ¡å€ m^{t+1}_l ãååŸãã
- S^t_l ãæ¢çŽ¢ããéãS^{t+1}_l ã®äŸ¡å€ã m^{t+1}_l ã«åºå®ãã
åäœéžææŠç¥ :
a_root = argmax_a I^t_l(a)
ããã§ I^t_l(a) = {
UCBscore(S^t_l, a), if a â a^t_l
r^t_l + γm^{t+1}_l, if a = a^t_l
}
S^{t+1}_l ã«å¯Ÿå¿ããåäœãéžæããå Žåãäºåã«ä¿åããã䟡å€ãçŽæ¥äœ¿çšããéšåæšæ¢çŽ¢ãåé¿ããã
èšèšåæ© : åŸåèŠå³ååæã¯ããããããå°ããªãµããããã«åå²ããå¿
èŠãããã䞊ååã®å©ç¹ãäœäžãããå¯èœæ§ãããã
解決ç :
åéæ®µéã®æ¹å : MCTSéžæã§ã¯ãªããããªã·ãŒãããã¯ãŒã¯åºåããçŽæ¥åäœããµã³ããªã³ã°ããåšæçãªååæ : åå埩ã§å°ããããååæããã®ã§ã¯ãªããåºå®ãããèšç·ŽååŸ©åæ°åŸã«ãããã¡å
šäœãååæããå©ç¹ :
DQNã®åºå®ç®æšãããã¯ãŒã¯ã¡ã«ããºã ã«é¡äŒŒããããªã·ãŒç®æšã®æŽæ°é »åºŠãåæžãã ãã¹ãŠã®MCTSåŒã³åºããååæããã»ã¹ã«éäžãããå€§èŠæš¡ããã䞊ååã®å©ç¹ãååã«æŽ»çšãã ååæãšèšç·Žããã»ã¹ãåé¢ãããã倧ããªäžŠååã®äœå°ãæäŸãã å®ç1 : åŒ(2)ã®ä»®å®ãæºããéå®åžžãã³ãã£ããã«ã€ããŠãUCBå€è©äŸ¡ã§ã¯ãªããµã³ããªã³ã°æšå®ã䜿çšããŠç¹å®ã®ã¢ãŒã ãè©äŸ¡ããããšã§ãET_i(n) /n â 0 (n â â)ãä¿èšŒãããã
ãã®å®çã¯åŸåèŠå³ååæææ³ã®åææ§ã蚌æããããäœãåŸæäžçãæããããšã瀺ããŠãããã¢ã«ãŽãªãºã ãæé©ãªã¢ãŒã ãžã®ã¢ã¯ã»ã¹ååžã«å¯ŸããŠããéäžããå¯èœæ§ãããããšã瀺åããŠããã
Atariç°å¢ : 髿¬¡å
èŠèŠå
¥åãšé¢æ£åäœç©ºéãæã€26åã®ä»£è¡šçãªã²ãŒã DMControlå¥ä»¶ : ball_in_cup-catchãšwalker-standã®2ã€ã®é£ç¶å¶åŸ¡ã¿ã¹ã¯æ£é¡ã²ãŒã : Connect4ãšGomokuãç¹æ®ãªç¶æ
空éãæã€æŠç¥ã²ãŒã æéå¹ç : åãæ§èœã¬ãã«ã«éããããã«å¿
èŠãªèšç®æéãµã³ãã«å¹ç : æåããæŠç¥ã«éããããã«å¿
èŠãªç°å¢çžäºäœçšã®åæ°æ¢çŽ¢å é : åäžã®MCTSã®æéæ¶è²»ãšé¢æ°åŒã³åºãåæ°MuZero : ãªãªãžãã«ã®MuZeroã¢ã«ãŽãªãºã EfficientZero : æ¹åãããMuZeroå€äœReZero-M : ReZeroãçµ±åããMuZeroReZero-E : ReZeroãçµ±åããEfficientZeroãªãã¬ã€æ¯ç: 0.25 ååæé »åºŠ: 1 ããããµã€ãº: 256(Atari)ã64(DMControl) MCTS ã·ãã¥ã¬ãŒã·ã§ã³åæ°: 50 ããŒããŠã§ã¢: åäžã®NVIDIA A100 GPUã30åã®CPUã³ã¢ã120 GiBã¡ã¢ãª æéå¹çã®åäž :
ã»ãšãã©ã®ã²ãŒã ã«ãããŠãReZeroã¯ããŒã¹ã©ã€ã³ææ³ããã2ïœ4åå°ãªãèšç®æéãå¿
èŠãšãã Pongã²ãŒã : ReZero-M 1.0±0.1æé vs MuZero 4.0±0.5æé MsPacmanã²ãŒã : ReZero-M 1.4±0.2æé vs MuZero 6.9±0.3æé Connect4ã²ãŒã : ReZero-M 5.5±0.6æé vs MuZero 9.1±0.8æé ãµã³ãã«å¹çã®ç¶æ : ãã¹ãŠã®ãã¹ãç°å¢ã«ãããŠãReZeroã¯ããŒã¹ã©ã€ã³ææ³ãšåçãŸãã¯ãã以äžã®ãµã³ãã«å¹çãç¶æããŠããã
ååæé »åºŠ{0, 1/3, 1, 2}ã®å¹æããã¹ããã:
é©åãªååæé »åºŠã«ãããæ§èœãèããäœäžãããããšãªãæéãªãŒããŒããããç¯çŽã§ãã é »åºŠ1ã®å Žåãæéå¹çãšãµã³ãã«å¹çã®éã§æé©ãªãã©ã³ã¹ãéæããã 詳现ãªçµ±èšã¯ä»¥äžã瀺ããŠãã:
å¹³åæ¢çŽ¢æé: ReZero-M 0.69±0.02ms vs MuZero 1.08±0.09ms æšæ¢çŽ¢åŒã³åºãåæ°: ReZero-M 6089å vs MuZero 13284å åçã¢ãã«åŒã³åºã: ReZero-M 122å vs MuZero 256å ç©å
·ã±ãŒã¹ã®æ€èšŒ : 7Ã7ã°ãªããã¯ãŒã«ãã§ã®å®éšã«ãããéšåæšæ¢çŽ¢ãã¹ãããããå é广ãçŽæçã«ç€ºãããçµç¹ããé ãäœçœ®ã»ã©æ¢çŽ¢æéãé·ããã«ãŒãããŒã䟡å€è£å©ã䜿çšããåŸãæ¢çŽ¢æéã¯å
šè¬çã«åæžãããã
åŸåèŠå³ååæã¯åäžæ¢çŽ¢ã®é床ãåäžãããã ãã§ãªãããµã³ãã«å¹çãæ¹åãã å
šãããã¡ååæã¯MCTSåŒã³åºãåæ°ã广çã«åæžãã ææ³ã¯ç°ãªãã¿ã€ãã®æææ±ºå®ç°å¢ã«ãããŠäžè²«ããå é广ã瀺ã AlphaGo/AlphaZero : MCTSãšæ·±å±€åŒ·ååŠç¿ãçµã¿åãããæ£é¡ã²ãŒã ã§é©æ°çãªææãéæMuZero : æªç¥ã®ç°å¢ã¢ãã«ã®ã·ããªãªã«æ¡åŒµããé©çšç¯å²ãããåºããåŸç¶ã®æ¹å : EfficientZeroã¯ãµã³ãã«å¹çãåäžãããMuZero Unpluggedã¯ãªãã©ã€ã³èšå®ã«æ¡åŒµSpeedyZero : 䞊åã·ã¹ãã èšèšã«ããæéãªãŒããŒããããåæžããããããå€ãã®èšç®ãªãœãŒã¹ãå¿
èŠPTSAZero : ç¶æ
æœè±¡åã«ããæ¢çŽ¢ç©ºéãå§çž®KataGo : çŽ æŽãªæ
å ±åå©çšæå·§ã䜿çšããããæ¬è«æã®åŸåèŠå³ææ³ã¯æ ¹æ¬çã«ç°ãªãReZeroã¯MCTSã¢ã«ãŽãªãºã ã®èšç®æéãªãŒããŒããããé倧ã§ããåé¡ãæåè£ã«è§£æ±ºãã åŸåèŠå³ååæãšå
šãããã¡ååæã®çžä¹å¹æã«ãããæéå¹çã倧å¹
ã«åäžãã ææ³ã¯æ±çšçã§ãããæ§ã
ãªMCTSã¢ã«ãŽãªãºã å€äœã«é©çšå¯èœã§ãã ãµã³ãã«å¹çãç¶æããªããã2ïœ4åã®æéå éãå®çŸãã åäžãã·ã³èšå®ã®å¶é : çŸåšã®å®éšã¯äž»ã«åäžãã·ã³ç°å¢ã§å®æœãããŠããã忣èšç·Žã®æé©åã®äœå°ãæ®ãããŠããç°å¢ã«ãã¬ããž : é£ç¶å¶åŸ¡ç°å¢ã®å®éšã¯æ¯èŒçéå®çã§ãããããå
æ¬çãªãã³ãããŒã¯ãã¹ããå¿
èŠã§ããçè«çåæ : åææ§ã®èšŒæã¯æäŸãããŠããããå®éã®è€éãªç°å¢ã«ãããçè«çä¿èšŒã¯ãããªãç ç©¶ãå¿
èŠã§ãã忣æé©å : ReZeroãåæ£åŒ·ååŠç¿èšç·Žã«é©çšãããªãã©ã€ã³åŠç¿ : MuZero Unpluggedãšçµã¿åããããªãã©ã€ã³ããŒã¿ã»ããã§ã®å¿çšåºç€ã¢ãã« : RT-Xãªã©ã®å€§èŠæš¡ããŒã¿ã»ãããšçµã¿åãããæææ±ºå®åºç€ã¢ãã«ãæ§ç¯ããå éãµã³ããªã³ã° : ãããã¡å
ã®éšåãµã³ãã«ãåªå
çã«ååæããããåççãªæ¹æ³ã䜿çšãã驿°æ§ãé«ã : åŸåèŠå³ååæã¯MCTSå éã®æ°èŠãªæèã§ãããçè«çåºç€ãå
ç¢ã§ããå®çšäŸ¡å€ãé«ã : èããæéå é广ã¯MCTSã¢ã«ãŽãªãºã ã®å®éã®å¿çšã«éèŠãªæçŸ©ãæã€æ±çšæ§ãåªããŠãã : ãã¬ãŒã ã¯ãŒã¯èšèšã«ãããæ§ã
ãªMCTSã¢ã«ãŽãªãºã ã«ã·ãŒã ã¬ã¹ã«çµ±åå¯èœã§ããå®éšãå
åã§ãã : è€æ°ã®ç°å¢ã¿ã€ãã§ææ³ã®æå¹æ§ãæ€èšŒãã詳现ãªã¢ãã¬ãŒã·ã§ã³å®éšãå«ãçè«çåæã®æ·±ã : åææ§ã®èšŒæã¯æäŸãããŠããããè€éãªç°å¢ã«ãããçè«çä¿èšŒã¯ããã«åŒ·åãå¿
èŠã§ãã忣ã·ããªãª : è€æ°ãã·ã³è€æ°GPUç°å¢ã§ã®æ€èšŒãšæé©åãäžè¶³ããŠããé£ç¶å¶åŸ¡ : é£ç¶åäœç©ºéã§ã®å®éšã¯æ¯èŒçéå®çã§ããé·æçåœ±é¿ : èšç·Žã®å®å®æ§ãšæçµçãªæ§èœãžã®é·æçãªåœ±é¿ã«ã€ããŠãããªãåæãå¿
èŠã§ããåŠè¡çè²¢ç® : MCTSå éã«æ°ããç ç©¶æ¹åãæäŸããçè«ãšå®è·µã®äž¡é¢ãéèŠããŠããå®çšçäŸ¡å€ : MCTSã¢ã«ãŽãªãºã ã®å±éã«ãããéèŠãªããã«ããã¯åé¡ãçŽæ¥è§£æ±ºããåçŸæ§ : å®å
šãªãªãŒãã³ãœãŒã¹å®è£
ãæäŸããç ç©¶ã³ãã¥ããã£ã䜿çšããã³æ¡åŒµããããã²ãŒã AI : æ£é¡ã²ãŒã ããããªã²ãŒã ãªã©ãªã¢ã«ã¿ã€ã æææ±ºå®ãå¿
èŠãªã·ããªãªããããå¶åŸ¡ : ãªã³ã©ã€ã³èšç»ãå¿
èŠãªããããã¿ã¹ã¯èªåé転 : ãªã¢ã«ã¿ã€ã ãã¹èšç»ãšæææ±ºå®éèååŒ : é«é »åºŠååŒã«ãããé«éæææ±ºå®Schrittwieser, J., et al. (2019). Mastering Atari, Go, chess and shogi by planning with a learned model. Nature , 588, 604-609. Silver, D., et al. (2017). Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815 . Ye, W., et al. (2021). Mastering atari games with limited data. Advances in Neural Information Processing Systems , 34, 25476-25488. Mei, Y., et al. (2023). Speedyzero: Mastering atari with limited data and time. ICLR 2023 . ç·åè©äŸ¡ : ããã¯é«å質ãªç ç©¶è«æã§ãããMCTSã¢ã«ãŽãªãºã ã®å®éã®å±éããã«ããã¯ã«å¯ŸããŠé©æ°çã§å®çšçãªè§£æ±ºçãææ¡ããŠãããææ³èšèšã¯å·§åŠã§ãããçè«çåºç€ã¯å
ç¢ã§ãããå®éšæ€èšŒã¯å
åã§ãããMCTSã¢ã«ãŽãªãºã ã®å®éã®å¿çšã«ãããæ®åãæšé²ããäžã§éèŠãªäŸ¡å€ãæã€ã