2025-11-24T20:28:16.394652

Adversarial Fine-tuning in Offline-to-Online Reinforcement Learning for Robust Robot Control

Ayabe, Kera, Kawamoto

Offline reinforcement learning enables sample-efficient policy acquisition without risky online interaction, yet policies trained on static datasets remain brittle under action-space perturbations such as actuator faults. This study introduces an offline-to-online framework that trains policies on clean data and then performs adversarial fine-tuning, where perturbations are injected into executed actions to induce compensatory behavior and improve resilience. A performance-aware curriculum further adjusts the perturbation probability during training via an exponential-moving-average signal, balancing robustness and stability throughout the learning process. Experiments on continuous-control locomotion tasks demonstrate that the proposed method consistently improves robustness over offline-only baselines and converges faster than training from scratch. Matching the fine-tuning and evaluation conditions yields the strongest robustness to action-space perturbations, while the adaptive curriculum strategy mitigates the degradation of nominal performance observed with the linear curriculum strategy. Overall, the results show that adversarial fine-tuning enables adaptive and robust control under uncertain environments, bridging the gap between offline efficiency and online adaptability.

academic

Adversarial Fine-tuning in Offline-to-Online Reinforcement Learning for Robust Robot Control

基本信息

论文ID: 2510.13358
标题: Adversarial Fine-tuning in Offline-to-Online Reinforcement Learning for Robust Robot Control
作者: Shingo Ayabe, Hiroshi Kera, Kazuhiko Kawamoto (千叶大学)
分类: cs.RO (机器人学), cs.AI (人工智能)
发表时间: 2025年10月15日 (arXiv预印本)
论文链接: https://arxiv.org/abs/2510.13358

摘要

离线强化学习能够在不进行风险在线交互的情况下实现样本高效的策略获取，但在静态数据集上训练的策略在面对动作空间扰动（如执行器故障）时仍然脆弱。本研究提出了一个离线到在线的框架，首先在干净数据上训练策略，然后执行对抗性微调，在执行的动作中注入扰动以诱导补偿行为并提高鲁棒性。基于性能感知的课程进一步通过指数移动平均信号在训练过程中调整扰动概率，在整个学习过程中平衡鲁棒性和稳定性。在连续控制运动任务上的实验表明，所提出的方法在鲁棒性方面始终优于仅离线基线，并比从头训练收敛更快。

研究背景与动机

核心问题

本研究要解决的核心问题是离线强化学习策略在动作空间扰动下的脆弱性。具体来说：

离线RL的局限性：离线强化学习虽然避免了在线交互的风险和成本，但训练的策略在面对执行器故障、动作噪声等动作空间扰动时表现脆弱。
保守性与鲁棒性的根本冲突：作者识别出一个关键洞察——保守的离线RL方法与动作空间鲁棒性在根本上是不兼容的。保守方法约束策略保持在数据集动作分布内以防止外推错误，但对动作扰动的鲁棒性恰恰需要学习这些被约束禁止的分布外样本。