2025-11-24T20:28:16.394652

Adversarial Fine-tuning in Offline-to-Online Reinforcement Learning for Robust Robot Control

Ayabe, Kera, Kawamoto

Offline reinforcement learning enables sample-efficient policy acquisition without risky online interaction, yet policies trained on static datasets remain brittle under action-space perturbations such as actuator faults. This study introduces an offline-to-online framework that trains policies on clean data and then performs adversarial fine-tuning, where perturbations are injected into executed actions to induce compensatory behavior and improve resilience. A performance-aware curriculum further adjusts the perturbation probability during training via an exponential-moving-average signal, balancing robustness and stability throughout the learning process. Experiments on continuous-control locomotion tasks demonstrate that the proposed method consistently improves robustness over offline-only baselines and converges faster than training from scratch. Matching the fine-tuning and evaluation conditions yields the strongest robustness to action-space perturbations, while the adaptive curriculum strategy mitigates the degradation of nominal performance observed with the linear curriculum strategy. Overall, the results show that adversarial fine-tuning enables adaptive and robust control under uncertain environments, bridging the gap between offline efficiency and online adaptability.

academic

Adversarial Fine-tuning in Offline-to-Online Reinforcement Learning for Robust Robot Control

Basic Information

Paper ID: 2510.13358
Title: Adversarial Fine-tuning in Offline-to-Online Reinforcement Learning for Robust Robot Control
Authors: Shingo Ayabe, Hiroshi Kera, Kazuhiko Kawamoto (Chiba University)
Categories: cs.RO (Robotics), cs.AI (Artificial Intelligence)
Publication Date: October 15, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.13358

Abstract

Offline reinforcement learning enables sample-efficient policy acquisition without risky online interactions; however, policies trained on static datasets remain vulnerable to action space perturbations such as actuator failures. This study proposes an offline-to-online framework that first trains a policy on clean data, then performs adversarial fine-tuning by injecting perturbations into executed actions to induce compensatory behaviors and improve robustness. A performance-aware curriculum further balances robustness and stability throughout training by adjusting perturbation probability via exponential moving average signals. Experiments on continuous control locomotion tasks demonstrate that the proposed method consistently outperforms offline-only baselines in robustness and converges faster than training from scratch.

Research Background and Motivation

Core Problem

The core problem addressed by this research is the vulnerability of offline reinforcement learning policies to action space perturbations. Specifically:

Limitations of Offline RL: While offline reinforcement learning avoids the risks and costs of online interactions, the trained policies exhibit fragility when facing action space perturbations such as actuator failures and action noise.
Fundamental Conflict Between Conservatism and Robustness: The authors identify a critical insight—conservative offline RL methods are fundamentally incompatible with action space robustness. Conservative methods constrain policies to remain within the dataset action distribution to prevent extrapolation errors, yet robustness to action perturbations requires learning precisely those out-of-distribution samples that conservative methods prohibit.

Problem Significance

Safety-Critical Applications: In safety-critical domains such as healthcare, energy management, and robot control, policies must handle unexpected perturbations
Real-World Deployment Requirements: Actuator failures and action noise are inevitable in real robotic systems
Theory-Practice Gap: Existing offline RL methods primarily focus on state space perturbations, with insufficient research on action space perturbations

Limitations of Existing Methods

Offline RL Conservative Constraints: Methods such as TD3+BC constrain policies close to the dataset distribution through behavior cloning losses, limiting adaptability
Lack of Perturbation Data: Offline datasets typically lack perturbation-aware transitions, making it difficult to assess policy effectiveness under perturbations
State vs. Action Perturbations: Existing robustness research primarily addresses state perturbations (sensor noise), with limited work on action perturbations

Core Contributions

Proposes Adversarial Fine-tuning Method: Injects perturbations during online training to achieve targeted adaptation to action perturbations while maintaining sample efficiency from offline pretraining
Demonstrates Consistent Superior Performance: Adversarial fine-tuning consistently outperforms offline-only and fully online baselines in robustness
Designs Adaptive Curriculum Strategy: An adaptive curriculum that adjusts perturbation probability based on policy performance, preventing overfitting to adversarial conditions while maintaining training stability, addressing key limitations of fixed scheduling methods
Provides Theoretical Insights: Formalizes analysis of the fundamental incompatibility between conservative offline RL and action space robustness

Methodology Details

Task Definition

Objective: Find the optimal robust policy under action space perturbations $\pi^* = \arg\max_\pi \min_{\tilde{a} \in U} \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, \tilde{a})\right]$

where $\tilde{a}$ is an adversarial perturbed action from a predefined set $U$ .

Model Architecture

1. Offline Pretraining Phase

Pretrains using TD3+BC algorithm on a clean dataset: $\pi = \arg\max_\pi \mathbb{E}_{(s_t,a_t)\sim D}[Q^\pi(s_t, \pi(s_t)) - \|\pi(s_t) - a_t\|^2]$

The second term enforces the policy to remain close to the behavior policy to maintain conservatism.

2. Adversarial Fine-tuning Phase

Perturbation Injection Mechanism: $a'_t = a_t + \delta_a \odot a_t \text{ with probability } q$

where $\odot$ denotes element-wise multiplication and $\delta_a$ is a precomputed adversarial perturbation.

Target Update: $y_t = \tilde{r}_t + \gamma \min_{i\in\{1,2\}} Q_{\theta^-_i}(\tilde{s}_{t+1}, \pi_{\phi^-}(\tilde{s}_{t+1}) + \varepsilon)$

where $\tilde{s}_{t+1} \sim P(\cdot|s_t, \tilde{a}_t)$ , $\tilde{r}_t = r(s_t, \tilde{a}_t)$ .

3. Curriculum Learning Mechanism

Linear Curriculum: $q \leftarrow \text{clip}(q + c, 0, 1)$ where $c$ is a fixed step size.

Adaptive Curriculum: $\Delta q = \eta(\bar{R}_n - \bar{R}_{n-1})$ $\bar{R}_n = \beta R_n + (1-\beta)\bar{R}_{n-1}$

where $\bar{R}_n$ is the exponential moving average performance and $\eta$ and $\beta$ control adaptation dynamics.

Technical Innovations

Perturbation Precomputation: Uses differential evolution to pre-generate perturbation sets, avoiding expensive inner-loop minimization during fine-tuning
Performance-Aware Scheduling: Adaptive curriculum dynamically adjusts perturbation probability based on policy performance, increasing $q$ when performance improves to enhance robustness, and decreasing $q$ when performance declines to stabilize training
Balancing Mechanism: Exponential moving average filters short-term fluctuations, providing stable performance trend estimation

Experimental Setup

Datasets

Source: D4RL expert datasets
Environments: Hopper-v2, HalfCheetah-v2, Ant-v2 legged robot environments in OpenAI Gym
Physics Engine: MuJoCo physics simulation

Evaluation Metrics

Primary Metric: D4RL-normalized episode rewards
Evaluation Conditions: Normal (no perturbation), random perturbation, adversarial perturbation
Statistics: Average performance over 100 episodes, 5 independent runs

Baseline Methods

Offline-only: TD3+BC trained only offline
Fully Online (Adversarial): Online adversarial training from scratch
Fine-tuned Variants: Fine-tuned policies under different perturbation conditions

Implementation Details

Pretraining: 5 million steps of TD3+BC
Fine-tuning: 1 million steps of TD3 (3 million steps for curriculum experiments)
Perturbation Strength: Hopper/HalfCheetah $\epsilon=0.3$ , Ant $\epsilon=0.5$
Perturbation Probability: Hopper $q=0.5$ , HalfCheetah/Ant $q=0.1$
Adaptive Parameters: $\beta=0.9$ , $\eta$ adjusted per environment

Experimental Results

Main Results

Key Findings from Table 1:

Ant-v2 Adversarial Condition: Adversarial fine-tuning 91.6 vs. offline -21.0 vs. fully online 24.0
Hopper-v2 Adversarial Condition: Adversarial fine-tuning 83.5 vs. offline 13.7 vs. fully online 57.0
Consistent Advantage: Adversarial fine-tuning significantly outperforms baselines in adversarial evaluation across all environments

Key Insights:

Performance is best when fine-tuning conditions match evaluation conditions
Offline policy performance degrades sharply under perturbations (even negative rewards)
Adversarial fine-tuning converges faster than training from scratch

Ablation Studies

Curriculum Strategy Comparison (Table 2):

1M Steps: Adaptive curriculum $q_{ada}$ consistently outperforms fixed $q_{fix}$ and linear $q_{lin}$ across all environments
3M Steps: Linear curriculum exhibits overfitting, with degraded normal performance (Hopper: 95.1→76.5)
Adaptive Advantage: $q_{ada}$ maintains or improves normal performance while preserving adversarial robustness

Case Analysis

Curriculum Trajectories (Figure 5):

Linear Strategy: $q$ value increases relentlessly, leading to overfitting
Adaptive Strategy: Adjusts $q$ growth based on performance feedback, preventing excessive difficulty escalation

Experimental Findings

Convergence Speed: Adversarial fine-tuning achieves rapid convergence by leveraging offline pretraining
Robustness-Stability Trade-off: Adaptive curriculum successfully balances both objectives
Environment Specificity: Different environments require different hyperparameter adjustments

Offline Reinforcement Learning

Conservative Methods: TD3+BC, CQL, IQL, etc., constrain policies to remain close to data distribution
Core Challenge: Q-value overestimation for out-of-distribution state-action pairs

Robust Reinforcement Learning

State Perturbations: Methods like RORL improve robustness by smoothing value distributions
Action Perturbations: Relatively limited research; existing work shows offline policies are particularly fragile

Offline-to-Online Reinforcement Learning

Representative Methods: AWAC, O2O, Policy Expansion, etc.
Main Challenge: Performance degradation during early fine-tuning phases

Conclusions and Discussion

Main Conclusions

Fundamental Incompatibility: Conservative offline RL and action space robustness exhibit structural conflict
Effective Solution: Adversarial fine-tuning successfully bridges offline efficiency and online adaptability
Curriculum Learning Value: Adaptive scheduling outperforms fixed strategies, preventing overfitting

Limitations

Lack of Theoretical Guarantees: Absence of theoretical analysis for curriculum adaptation
Environment Complexity: Experiments limited to relatively simple locomotion tasks
Perturbation Types: Primarily focuses on multiplicative perturbations; other types insufficiently explored

Future Directions

Theoretical Development: Establish theoretical guarantees for curriculum adaptation
Complex Environments: Explore interactions between state and action space perturbations
Perturbation Diversity: Investigate broader range of perturbation types and patterns

In-Depth Evaluation

Strengths

Deep Core Insights: Identifying the fundamental conflict between conservatism and robustness is an important contribution
Reasonable Method Design: Adversarial fine-tuning framework is logically clear and technically feasible
Comprehensive Experiments: Multi-environment, multi-baseline, multi-metric evaluation
High Practical Value: Addresses critical problems in real robot deployment

Weaknesses

Insufficient Theoretical Analysis: Lacks convergence and robustness guarantees
Environment Limitations: Tested only in MuJoCo simulation; lacks real robot validation
Hyperparameter Sensitivity: Adaptive curriculum requires environment-specific parameter tuning
Computational Overhead: Perturbation precomputation and performance evaluation increase computational cost

Impact

Academic Contribution: Provides new perspectives and methods for offline RL robustness research
Practical Value: Offers practical solutions for safety-critical robot applications
Reproducibility: Detailed method description and clear experimental setup

Applicable Scenarios

Robot Control: Autonomous systems requiring handling of actuator failures
Safety-Critical Applications: Medical robots, industrial automation, etc.
Resource-Constrained Environments: Scenarios requiring sample efficiency with robustness demands

References

The paper cites important works in reinforcement learning, including:

Offline RL: Fujimoto & Gu (TD3+BC), Kumar et al. (CQL)
Robust RL: Pinto et al. (Adversarial Training), Yang et al. (RORL)
Offline-to-Online: Nair et al. (AWAC), Lee et al. (O2O)

Overall Assessment: This is a high-quality research paper with significant contributions in theoretical insights, methodological innovation, and experimental validation. While there remains room for improvement in theoretical analysis and real-world environment validation, it opens important directions for offline reinforcement learning robustness research and demonstrates substantial academic and practical value.