2025-11-19T05:19:13.941336

Learning to Undo: Rollback-Augmented Reinforcement Learning with Reversibility Signals

Sorstkins, Tariq, Bilal

This paper proposes a reversible learning framework to improve the robustness and efficiency of value based Reinforcement Learning agents, addressing vulnerability to value overestimation and instability in partially irreversible environments. The framework has two complementary core mechanisms: an empirically derived transition reversibility measure called Phi of s and a, and a selective state rollback operation. We introduce an online per state action estimator called Phi that quantifies the likelihood of returning to a prior state within a fixed horizon K. This measure is used to adjust the penalty term during temporal difference updates dynamically, integrating reversibility awareness directly into the value function. The system also includes a selective rollback operator. When an action yields an expected return markedly lower than its instantaneous estimated value and violates a predefined threshold, the agent is penalized and returns to the preceding state rather than progressing. This interrupts sub optimal high risk trajectories and avoids catastrophic steps. By combining reversibility aware evaluation with targeted rollback, the method improves safety, performance, and stability. In the CliffWalking v0 domain, the framework reduced catastrophic falls by over 99.8 percent and yielded a 55 percent increase in mean episode return. In the Taxi v3 domain, it suppressed illegal actions by greater than or equal to 99.9 percent and achieved a 65.7 percent improvement in cumulative reward, while also sharply reducing reward variance in both environments. Ablation studies confirm that the rollback mechanism is the critical component underlying these safety and performance gains, marking a robust step toward safe and reliable sequential decision making.

academic

Learning to Undo: Rollback-Augmented Reinforcement Learning with Reversibility Signals

基本信息

论文ID: 2510.14503
标题: Learning to Undo: Rollback-Augmented Reinforcement Learning with Reversibility Signals
作者: Andrejs Sorstkins¹, Omer Tariq², Muhammad Bilal¹
分类: cs.LG
发表时间: 2025年10月17日 (arXiv preprint)
论文链接: https://arxiv.org/abs/2510.14503

摘要

本文提出了一个可逆学习框架，旨在提高基于价值的强化学习智能体的鲁棒性和效率，解决价值过估计和在部分不可逆环境中的不稳定性问题。该框架包含两个互补的核心机制：一个经验驱动的转移可逆性度量Φ(s,a)和一个选择性状态回滚操作。在CliffWalking-v0环境中，该框架将灾难性跌落减少了99.8%以上，平均回合回报提高了55%。在Taxi-v3环境中，非法动作被抑制了≥99.9%，累积奖励提升了65.7%，同时显著降低了两个环境中的奖励方差。

研究背景与动机

核心问题

价值过估计问题: 深度强化学习中普遍存在的Q函数过估计问题导致智能体偏好统计上虚假或低概率的轨迹，引发振荡性策略更新和收敛时间延长
不可逆环境中的安全性: 在安全关键应用中（如自动驾驶、机器人手术、医疗治疗规划），不可逆的错误可能导致灾难性后果
现有方法的局限性: 传统的Q值过估计解决方案（如双重Q学习、保守Q学习）通常以增加计算成本和样本复杂度为代价

研究动机

人类认知架构中的可逆性是审慎决策和适应性学习的基础。人类习惯性地评估给定动作的即时奖励以及该动作被后续步骤逆转或抵消的程度。本文将这种"撤销"次优决策的能力嵌入到强化学习框架中，为广泛的安全关键应用提供解决方案。

核心贡献

可扩展的无模型可逆性估计器: 提出了一个在线的、按状态-动作对的可逆性估计器Φ(s,a)，避免了分类器训练
显式回滚操作: 将显式回滚操作集成到表格Q学习和SARSA更新中
原理性耦合机制: 将Φ塑形和选择性回滚原理性地结合，在不抑制探索的情况下限制下行风险
全面评估: 通过广泛的评估、敏感性分析和消融实验，确定了对安全性和性能重要的组件

方法详解

任务定义

在马尔可夫决策过程(S,A,P,R,γ)中，智能体在状态s∈S选择动作a∈A，接收奖励r，并转移到s'～P(·|s,a)。目标是学习最优动作价值函数Q*(s,a)，同时在部分不可逆环境中确保安全性。

模型架构

1. 经验可逆性估计器

通过FIFO结构维护可逆性估计：

对每个观察到的转移(st,at)→st+1，将记录(s0,a0,d)推入FIFO列表L
d = t+K是必须返回s0的截止时间
使用指数移动平均(EMA)更新可逆性表：

Φ[s0,a0] ← (1-αφ)Φ[s0,a0] + αφ·y

其中y∈{0,1}表示是否在K步内返回到原状态。

2. TD学习与惩罚机制

形成惩罚奖励：

r' = r - λ(1 - Φ[st,at])

修改的TD误差为：

Q学习: δ = r' + γmax_a' Q(st+1,a') - Q(st,at)
SARSA: δ = r' + γQ(st+1,at+1) - Q(st,at)

3. 回滚操作

当阈值条件被触发时执行回滚：

snext = {
  st,     如果违反阈值
  st+1,   否则
}

阈值条件定义为：目标值 ≤ T·Q(st,at)

技术创新点

轻量级可逆性估计: 用基于FIFO的经验估计替代了基于分类器的先例估计，避免了策略特定的过拟合
局部化惩罚: 使用按状态-动作对的Φ产生局部化惩罚，而非全局阈值
显式撤销机制: 提供了可操作的恢复原语，在检测到高风险转移时立即撤销有害步骤
自适应时间窗口: 通过参数K控制时间范围，无需重新训练即可捕获短期或长期可逆性

实验设置

数据集

使用Gymnasium v1.2.0中的两个经典表格"toy-text"环境：

CliffWalking-v0: 4×12网格，确定性环境
- 观察空间：48个可达状态
- 动作空间：4个离散移动
- 悬崖惩罚：-100，常规步骤：-1
Taxi-v3: 5×5网格，出租车接送任务
- 观察空间：500个状态
- 动作空间：6个动作
- 非法动作惩罚：-10，成功送达：+20