Although average gain optimality is a commonly adopted performance measure in Markov Decision Processes (MDPs), it is often too asymptotic. Further incorporating measures of immediate losses leads to the hierarchy of bias optimalities, all the way up to Blackwell optimality. In this paper, we investigate the problem of identifying policies of such optimality orders. To that end, for each order, we construct a learning algorithm with vanishing probability of error. Furthermore, we characterize the class of MDPs for which identification algorithms can stop in finite time. That class corresponds to the MDPs with a unique Bellman optimal policy, and does not depend on the optimality order considered. Lastly, we provide a tractable stopping rule that when coupled to our learning algorithm triggers in finite time whenever it is possible to do so.
è«æID : 2510.13476ã¿ã€ãã« : Towards Blackwell Optimality: Bellman Optimality Is All You Can Getèè
: Victor Boone (ã°ã«ããŒãã«ã»ã¢ã«ãã¹å€§åŠãInriaãCNRSãã°ã«ããŒãã«INPãLIG & IRITããã¥ãŒã«ãŒãºå€§åŠ)ãAdrienne Tuynman (ãªãŒã«å€§åŠãInriaãCNRSãã»ã³ãã©ã«ã»ãªãŒã«ãUMR 9189-CRIStAL)åé¡ : cs.LG (æ©æ¢°åŠç¿)çºè¡šæ¥ : 2025幎10æ15æ¥ (arXiv ãã¬ããªã³ã)è«æãªã³ã¯ : https://arxiv.org/abs/2510.13476v1 å¹³åå©åŸæé©æ§ã¯ãã«ã³ã決å®éçš(MDP)ã«ãããäžè¬çãªæ§èœææšã§ããããã®æ§è³ªã¯æ¬è³ªçã«æŒžè¿çã§ãã峿æå€±ã®æž¬å®ãçµã¿åãããããšã§ãåå·®æé©æ§ã®éå±€æ§é ãçãããã©ãã¯ãŠã§ã«æé©æ§ã«è³ããŸããæ¬è«æã¯ããã®ãããªæé©æ§éæ°ã®æ¿çãèå¥ããåé¡ãç ç©¶ããŠããŸãããã®ç®çã®ãããåéæ°ã«å¯ŸããŠãæ¶å€±ãã誀ã確çãæã€åŠç¿ã¢ã«ãŽãªãºã ãæ§ç¯ããŸãããããã«ãèå¥ã¢ã«ãŽãªãºã ãæéæéå
ã«åæ¢ã§ããMDPã®ã¯ã©ã¹ãç¹æ§åããŸããããã®ã¯ã©ã¹ã¯ãå¯äžã®ãã«ãã³æé©æ¿çãæã€MDPã«å¯Ÿå¿ããèæ
®ãããæé©æ§éæ°ã«äŸåããŸãããæåŸã«ãæã
ã®åŠç¿ã¢ã«ãŽãªãºã ãšçµã¿åãããå Žåãå¯èœãªéãæéæéå
ã«çºåããæ±ãããã忢èŠåãæäŸããŸãã
æ¬è«æã®ç ç©¶ã®äžæ žãšãªãåé¡ã¯ããã«ã³ã決å®éçšã«ãããé«éæé©æ¿çã®åŠç¿å¯è奿§ã®åé¡ã§ããå
·äœçã«ã¯ïŒ
åŸæ¥ã®å¹³åå©åŸæé©æ§ã®éç ïŒMDPã«ãããŠãåŸæ¥ã®å¹³åå©åŸæé©æ§ã¯é·æçãªæŒžè¿æ§èœã®ã¿ã«çŠç¹ãåœãŠãçæçãªå³æå ±é
¬ã®éèŠæ§ãç¡èŠããŠããŸããé«éæé©æ§ã®éå±€æ§é ïŒå©åŸæé©æ§ãããã€ã¢ã¹æé©æ§ãçµç±ããŠãã©ãã¯ãŠã§ã«æé©æ§ã«è³ããŸã§ãå®å
šãªæé©æ§éå±€æ§é ã圢æãããåéå±€ã¯ãã现ããæ§èœå·®ç°ãèæ
®ããŠããŸããåŠç¿ã®å°é£æ§ ïŒè«æã¯åçŽãªäŸ(å³1)ãéããŠãæãåçŽãªå Žåã§ãããé«éæé©æ¿çã®åŠç¿ãæ ¹æ¬çãªå°é£ã«çŽé¢ããŠããããšã瀺ããŠããŸããè«æã®äžæ žçãªåæ©ã¯ãéèŠãªèгå¯ã«ç±æ¥ããŠããŸãïŒåäžã®æªç¥ãã©ã¡ãŒã¿ã®ã¿ãæã€åçŽãªMDPã§ãããé«éæé©æ¿çã®åŠç¿ã¯äžå¯èœã§ããå¯èœæ§ããããŸã ããã®äžèŠççŸããçŸè±¡ã¯ãé«éæé©æ§åŠç¿ã®æ¬è³ªçãªå°é£æ§ãæããã«ããŠããŸãã
æšæºåŠç¿ææ³ã®å€±å¹ ïŒåŸæ¥ã®çµéšçæé©æ¿çéžæã¯é«éæé©æ§ã§ã¯é©çšäžå¯èœã«ãªããŸãçµ±èšæ€å®ã®éç ïŒãã©ã¡ãŒã¿ã®æ£ç¢ºãªå€(äŸãã°Îž=0)ãçµ±èšæ€å®ã§æ±ºå®ããããšã¯ã§ããŸããäžé£ç¶æ§ã®åé¡ ïŒãã©ã¡ãŒã¿ç©ºéã«ãããæé©æ¿çéåã®äžé£ç¶æ§ãåŠç¿ã®å°é£æ§ããããããŸãäžè²«æ§ã¢ã«ãŽãªãºã HOPEã®æ§ç¯ ïŒåæé©æ§éæ°nâ¥-1ã«å¯ŸããŠãæ¶å€±ãã誀ã確çãæã€HOPE(Higher Order Policy iteration Epsilonized)ã¢ã«ãŽãªãºã ãææ¡ããŸãããééåMDPã¯ã©ã¹ã®å®å
šãªç¹æ§å ïŒMDPãééåã§ãã(ããªãã¡ãèå¥ã¢ã«ãŽãªãºã ãæéæéå
ã«åæ¢ã§ãã)ããšã¯ãåœäžã€ã®ã¿åœè©²MDPãå¯äžã®ãã«ãã³æé©æ¿çãæã€ããšãšåå€ã§ããããšã蚌æããŸãããéåæ§åŽ©å£å®ç ïŒãã¹ãŠã®æé©æ§éæ°(å©åŸæé©æ§ãããã©ãã¯ãŠã§ã«æé©æ§ãŸã§)ã«ãããééåMDPã®ã¯ã©ã¹ãå®å
šã«åäžã§ããããšã蚌æããŸãããããã¯é©ãã¹ãçµæã§ããèšç®å¯èœãªåæ¢èŠåã®æäŸ ïŒHOPEã¢ã«ãŽãªãºã ãå¯èœãªå Žåã«æéæéå
ã«åæ¢ã§ããããã«ããæ±ãããã忢èŠåãæäŸããŸãããå
¥å ïŒæªç¥ã®éä¿¡MDP M = (Z, Μ, p)ãããã§Zã¯ç¶æ
-è¡å察空éãΜã¯å ±é
¬é¢æ°ãpã¯é·ç§»æ žã§ã
åºå ïŒnéæé©æ¿çÏ â Î *_n(M)
ç®æš ïŒæšå¥šæ¿çã®èª€ã確çã0ã«åæãããããªèå¥ã¢ã«ãŽãªãºã ãæ§ç¯ããããš
å
¥åïŒææã®æé©æ§éæ° n ⥠-1
åæåïŒt â 0
ã«ãŒãïŒ
1. çµéšçMDP MÌ_t ãæ§ç¯
2. ε â t^(-1/4) ãèšå®
3. â_s A_n(s) â HOPI_{n,ε}(MÌ_t) ãèšç®
4. â_s A_n(s) å
ã®ä»»æã®æ¿çãæšå¥š
5. è¡åãåäžã«ãµã³ããªã³ã°ããå ±é
¬ãšæ¬¡ç¶æ
ã芳å¯
6. t â t + 1
HOPIã¯é«éæ¿çå埩ã®ãã€ãã·ãã³åãçã§ãããäž»èŠãªé©æ°ã¯ä»¥äžã®éãã§ãïŒ
ãœããargmaxæäœ ïŒæ£ç¢ºãªãã«ãã³æ¹çšåŒå¶çŽ Î^Ï_m(s,a) = 0 ã Î^Ï_m(s,a) †ε ã«ç·©åäºæ®µéæ¿çæ¹å ïŒç¬¬äžæ®µéïŒæ¢ç¥ã®m-1éæé©è¡åã®äžããm+1éæ¹åãæ¢çŽ¢ ç¬¬äºæ®µéïŒããã«m+2éæé©æ§ã«çްåå æ®µéç粟床å¶åŸ¡ ïŒÎµ_t = t^(-1/4) ãéžæããŠã¢ã«ãŽãªãºã ã®äžè²«æ§ã確ä¿åŸæ¥ã®æ¿çå埩ã¯ãã«ãã³æ¹çšåŒã®æ£ç¢ºãªæºè¶³ãèŠæ±ããŸãããåŠç¿ç°å¢ã§ã¯ããã¯äžå¯èœã§ããHOPIã¯ç·©åãã©ã¡ãŒã¿Îµãå°å
¥ããããšã§ããã€ãºç°å¢äžã§ã¢ã«ãŽãªãºã ãæ£ããæ©èœããããšãå¯èœã«ããŸãã
åœé¡5 ïŒä»»æã®åäžéãã«ãã³æé©æ¿çÏãšç²ŸåºŠÎµ > 0ã«å¯ŸããŠãM' â M ãååšã㊠d_â(M,M') < ε ãã€Ïã¯M'ã®å¯äžã®å©åŸæé©æ¿çã§ãã
ãã®æè¡ã¯äºæ®µéã§å®çŸãããŸãïŒ
æåã«éæé©è¡åã«ããã«ãã£ã課ãããšã§Ïãå¯äžã®ãã«ãã³æé©æ¿çã«ãã ãã®åŸãéæŽå€æã«ãã£ãŠÏãå¯äžã®å©åŸæé©æ¿çã«ãã éŸå€é¢æ°ãå®çŸ©ããŸãïŒ
β(M) := min{d_min(Î*)/((1+4α)(2+sp(b*))), 1/α}
ããã§Î±ã¯ææªå°éæéã§ããããã®éŸå€ã¯è¿åååŸã®æ£ç¢ºãªæž¬å®å€ãæäŸããŸãã
ãã¹ãŠã®n ⥠-1ã«å¯ŸããŠãHOPE(n)ã¢ã«ãŽãªãºã ã¯Î *_nã«å¯ŸããŠäžè²«ããŠããŸãã
n ⥠-1ãšããŸããMDPMãÎ *_n(M)ã«å¯ŸããŠééåã§ããããšã¯ãåœäžã€ã®ã¿åœè©²MDPãå¯äžã®ãã«ãã³æé©æ¿çãæã€ããšãšåå€ã§ãã
Î _{-1}, Î 0, Î _1, ..., Î âã«å¯Ÿããéåã¢ãã«éåã¯å®å
šã«åäžã§ãã
Mãééåã§ããã°ãæ¿çÏ â Î *(M)ãååšããŠMã®è¿åå
ã§æé©æ§ãä¿æããŸãã蚌æã¯ã¢ã«ãŽãªãºã 決å®ã®é£ç¶æ§ã䜿çšããŠããŸãã
æç€ºçãªåæ¢èŠåãšä¿¡é Œåºéãæ§ç¯ããããšã§ãå¯äžã®ãã«ãã³æé©æ¿çãæã€MDPãå®éã«ééåã§ããããšã蚌æããŸãã
æ¬è«æã¯äž»ã«çè«çç ç©¶ã§ãããå³å¯ãªæ°åŠç蚌æãéããŠãã¹ãŠã®äž»èŠãªçµæãæ€èšŒããŠããŸããäž»èŠãªæ€èšŒã«ã¯ä»¥äžãå«ãŸããŸãïŒ
ã¢ã«ãŽãªãºã ã®æ£ç¢ºæ§ ïŒHOPIããã€ãºã®ãªãå Žåã«æ£ããæé©æ¿çéåãè¿ãããšã蚌æäžè²«æ§ä¿èšŒ ïŒHOPEã¢ã«ãŽãªãºã ã®èª€ã確çãå®éã«0ã«åæããããšã蚌æåæ¢èŠåã®æå¹æ§ ïŒææ¡ããã忢èŠåãå®éã«æéæéå
ã«çºåããããšã蚌ææéèšç®é ïŒãã«ãã³æé©æ¿çã®å¯äžæ§ãå€å®ããæéèšç®éã¯O(|Z| + |S|^4)ãµã³ãã«èšç®é ïŒè«æã¯æ£ç¢ºãªãµã³ãã«èšç®éçãæäŸããŠããŸããããééåã®å Žåã«ã¢ã«ãŽãªãºã ãå¿
ãåæããããšã蚌æããŠããŸãå€è
ãã³ãã£ããåé¡ã«ãããæé©è
èå¥åé¡ãšé¢é£ããŠããŸãããMDPèšå®ã«ãããç¶æ
äŸåæ§ã«ããåé¡ãããè€éã«ãªããŸãã
(ε,ÎŽ)-PACèšå®ã«ãããå©åŸæé©æ¿çã®èå¥ã«é¢ããæè¿ã®ç ç©¶ã§ããããããã®ç ç©¶ã¯äž»ã«è¿äŒŒæé©æ§ã§ã¯ãªãæ£ç¢ºãªæé©æ§ã«çŠç¹ãåœãŠãŠããŸãã
Grand-Clément ãš Petrik (2023)ãªã©ã«ããå²åŒå åã®å±¥æŽå®çŸ©ã«åºã¥ããã©ãã¯ãŠã§ã«æé©æ¿çã®èšç®è€éæ§ã®ç ç©¶ã
Boone ãš Gaujal (2023)ã¯æ±ºå®è«çé·ç§»MDPã«ããããã©ãã¯ãŠã§ã«æé©æ¿çã®åŠç¿å¯èœæ§ãç ç©¶ããæ¬è«æã¯ããã確ççãªå Žåã«äžè¬åããŠããŸãã
ãã«ãã³æé©æ§ã¯æ ¹æ¬çãªå¶é ïŒãã¹ãŠã®é«éæé©æ§ã®åŠç¿å¯èœæ§ã¯ãã«ãã³æé©æ§ã®å¯äžæ§ã«åž°çããŸãéåæ§ã®æ®éæ§ ïŒç°ãªãæé©æ§éæ°ã®éåMDPã®ã¯ã©ã¹ã¯å®å
šã«åäžã§ãã¢ã«ãŽãªãºã ã®ååšæ§ ïŒå°é£ã«çŽé¢ããŠããã«ãããããããäžè²«æ§ã¢ã«ãŽãªãºã ã¯å®éã«ååšããŸãPACèšå®ã®æ¬ èœ ïŒè«æã¯äž»ã«æ£ç¢ºãªæé©æ§ã«çŠç¹ãåœãŠãŠãããè¿äŒŒæé©æ§ã®åŠç¿ã«ã¯å¯Ÿå¿ããŠããŸãããµã³ãã«èšç®éç ïŒæ£ç¢ºãªãµã³ãã«èšç®éåæãæäŸãããŠããŸããå®çšçå¿çš ïŒçè«ççµæã¯å®éã®åé¡ãžã®å¿çšãå¶éããå¯èœæ§ããããŸãPACæ çµã¿ã®æ¡åŒµ ïŒè¿äŒŒæé©æ¿çã®åŠç¿ãç ç©¶ããããšãµã³ãã«èšç®éåæ ïŒæ£ç¢ºãªãµã³ãã«èšç®éçã確ç«ããããšã¢ã«ãŽãªãºã ã®æé©å ïŒHOPEã¢ã«ãŽãªãºã ã®å®çšçæ§èœãæ¹åããããšçè«çæ·±ã ïŒé«éæé©æ§åŠç¿åé¡ã®å®å
šãªçè«çç¹æ§åãæäŸçµæã®æå€æ§ ïŒéåæ§åŽ©å£å®çã¯é©ãã¹ããã€æ·±ãçµæã§ãæè¡ç驿° ïŒç Žç æè¡ãšãœããargmaxæ¿çååŸ©ã¯æè¡ç驿°æ§ã瀺ããŠããŸãæç¢ºãªèšè¿° ïŒè«æã®æ§é ã¯æç¢ºã§ãæ°åŠç蚌æã¯å³å¯ã§ãå®çšæ§ã®å¶é ïŒçè«ççµæã¯ãã»ãšãã©ã®å®çšçMDPãéåããŠããå¯èœæ§ãããããšã瀺åããŠããŸãèšç®è€éæ§ ïŒå€é
åŒæéã¢ã«ãŽãªãºã ãæäŸãããŠããŸããã宿°ã倧ããå¯èœæ§ããããŸãå®éšæ€èšŒã®æ¬ èœ ïŒçŽç²ãªçè«çç ç©¶ãšããŠãå®éšçæ€èšŒãæ¬ ããŠããŸãçè«çè²¢ç® ïŒåŒ·ååŠç¿çè«ã«éèŠãªè² ã®çµæããããããŸãæ¹æ³ç瀺å ïŒç Žç æè¡ã¯ä»ã®åé¡ã§ã®å¿çšã®å¯èœæ§ããããŸãç ç©¶æ¹å ïŒåŸç¶ç ç©¶ã«PACèšå®ã®éèŠæ§ãæã瀺ããŸãçè«ç ç©¶ ïŒåŒ·ååŠç¿çè«ç ç©¶ã«éèŠãªåèãæäŸã¢ã«ãŽãªãºã èšèš ïŒå®çšçãªæ¿çåŠç¿ã¢ã«ãŽãªãºã ã®èšèšã«çè«çæå°ãæäŸåé¡åæ ïŒMDPæ§é ãåŠç¿é£åºŠã«äžãã圱é¿ã®çè§£ãæ¯æŽè«æã¯åŒ·ååŠç¿ãšãã«ã³ã決å®éçšåéã®éèŠãªæç®ãåŒçšããŠããã以äžãå«ã¿ãŸãïŒ
Puterman (1994): ãã«ã³ã決å®éçšã®å€å
žçæç§æž Blackwell (1962): ãã©ãã¯ãŠã§ã«æé©æ§ã®åå§å®çŸ© Kaufmann et al. (2016): æé©è
èå¥ã®è€éæ§çè« Boone and Gaujal (2023): 決å®è«çMDPã«ããããã©ãã¯ãŠã§ã«æé©æ§ã®åŠç¿å¯èœæ§ æ¬è«æã¯å³å¯ãªçè«åæãéããŠã匷ååŠç¿ã«ãããåºæ¬çãã€æ·±ãçŸè±¡ãæããã«ããŠããŸãïŒé«éæé©æ§ã®åŠç¿é£åºŠã¯å®å
šã«ãã«ãã³æé©æ§ã®æ§é ã«ãã£ãŠæ±ºå®ããããšããããšã§ãããã®çµæã¯éèŠãªçè«çæçŸ©ãæã€ã ãã§ãªããå®çšçãªã¢ã«ãŽãªãºã èšèšã«ãéèŠãªæå°ãæäŸããŸãã