Offline reinforcement learning enables sample-efficient policy acquisition without risky online interaction, yet policies trained on static datasets remain brittle under action-space perturbations such as actuator faults. This study introduces an offline-to-online framework that trains policies on clean data and then performs adversarial fine-tuning, where perturbations are injected into executed actions to induce compensatory behavior and improve resilience. A performance-aware curriculum further adjusts the perturbation probability during training via an exponential-moving-average signal, balancing robustness and stability throughout the learning process. Experiments on continuous-control locomotion tasks demonstrate that the proposed method consistently improves robustness over offline-only baselines and converges faster than training from scratch. Matching the fine-tuning and evaluation conditions yields the strongest robustness to action-space perturbations, while the adaptive curriculum strategy mitigates the degradation of nominal performance observed with the linear curriculum strategy. Overall, the results show that adversarial fine-tuning enables adaptive and robust control under uncertain environments, bridging the gap between offline efficiency and online adaptability.
- Paper ID: 2510.13358
- Title: Adversarial Fine-tuning in Offline-to-Online Reinforcement Learning for Robust Robot Control
- Authors: Shingo Ayabe, Hiroshi Kera, Kazuhiko Kawamoto (Chiba University)
- Categories: cs.RO (Robotics), cs.AI (Artificial Intelligence)
- Publication Date: October 15, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.13358
Offline reinforcement learning enables sample-efficient policy acquisition without risky online interactions; however, policies trained on static datasets remain vulnerable to action space perturbations such as actuator failures. This study proposes an offline-to-online framework that first trains a policy on clean data, then performs adversarial fine-tuning by injecting perturbations into executed actions to induce compensatory behaviors and improve robustness. A performance-aware curriculum further balances robustness and stability throughout training by adjusting perturbation probability via exponential moving average signals. Experiments on continuous control locomotion tasks demonstrate that the proposed method consistently outperforms offline-only baselines in robustness and converges faster than training from scratch.
The core problem addressed by this research is the vulnerability of offline reinforcement learning policies to action space perturbations. Specifically:
- Limitations of Offline RL: While offline reinforcement learning avoids the risks and costs of online interactions, the trained policies exhibit fragility when facing action space perturbations such as actuator failures and action noise.
- Fundamental Conflict Between Conservatism and Robustness: The authors identify a critical insight—conservative offline RL methods are fundamentally incompatible with action space robustness. Conservative methods constrain policies to remain within the dataset action distribution to prevent extrapolation errors, yet robustness to action perturbations requires learning precisely those out-of-distribution samples that conservative methods prohibit.
- Safety-Critical Applications: In safety-critical domains such as healthcare, energy management, and robot control, policies must handle unexpected perturbations
- Real-World Deployment Requirements: Actuator failures and action noise are inevitable in real robotic systems
- Theory-Practice Gap: Existing offline RL methods primarily focus on state space perturbations, with insufficient research on action space perturbations
- Offline RL Conservative Constraints: Methods such as TD3+BC constrain policies close to the dataset distribution through behavior cloning losses, limiting adaptability
- Lack of Perturbation Data: Offline datasets typically lack perturbation-aware transitions, making it difficult to assess policy effectiveness under perturbations
- State vs. Action Perturbations: Existing robustness research primarily addresses state perturbations (sensor noise), with limited work on action perturbations
- Proposes Adversarial Fine-tuning Method: Injects perturbations during online training to achieve targeted adaptation to action perturbations while maintaining sample efficiency from offline pretraining
- Demonstrates Consistent Superior Performance: Adversarial fine-tuning consistently outperforms offline-only and fully online baselines in robustness
- Designs Adaptive Curriculum Strategy: An adaptive curriculum that adjusts perturbation probability based on policy performance, preventing overfitting to adversarial conditions while maintaining training stability, addressing key limitations of fixed scheduling methods
- Provides Theoretical Insights: Formalizes analysis of the fundamental incompatibility between conservative offline RL and action space robustness
Objective: Find the optimal robust policy under action space perturbations
π∗=argmaxπmina~∈UE[∑t=0∞γtr(st,a~)]
where a~ is an adversarial perturbed action from a predefined set U.
Pretrains using TD3+BC algorithm on a clean dataset:
π=argmaxπE(st,at)∼D[Qπ(st,π(st))−∥π(st)−at∥2]
The second term enforces the policy to remain close to the behavior policy to maintain conservatism.
Perturbation Injection Mechanism:
at′=at+δa⊙at with probability q
where ⊙ denotes element-wise multiplication and δa is a precomputed adversarial perturbation.
Target Update:
yt=r~t+γmini∈{1,2}Qθi−(s~t+1,πϕ−(s~t+1)+ε)
where s~t+1∼P(⋅∣st,a~t), r~t=r(st,a~t).
Linear Curriculum:
q←clip(q+c,0,1)
where c is a fixed step size.
Adaptive Curriculum:
Δq=η(Rˉn−Rˉn−1)Rˉn=βRn+(1−β)Rˉn−1
where Rˉn is the exponential moving average performance and η and β control adaptation dynamics.
- Perturbation Precomputation: Uses differential evolution to pre-generate perturbation sets, avoiding expensive inner-loop minimization during fine-tuning
- Performance-Aware Scheduling: Adaptive curriculum dynamically adjusts perturbation probability based on policy performance, increasing q when performance improves to enhance robustness, and decreasing q when performance declines to stabilize training
- Balancing Mechanism: Exponential moving average filters short-term fluctuations, providing stable performance trend estimation
- Source: D4RL expert datasets
- Environments: Hopper-v2, HalfCheetah-v2, Ant-v2 legged robot environments in OpenAI Gym
- Physics Engine: MuJoCo physics simulation
- Primary Metric: D4RL-normalized episode rewards
- Evaluation Conditions: Normal (no perturbation), random perturbation, adversarial perturbation
- Statistics: Average performance over 100 episodes, 5 independent runs
- Offline-only: TD3+BC trained only offline
- Fully Online (Adversarial): Online adversarial training from scratch
- Fine-tuned Variants: Fine-tuned policies under different perturbation conditions
- Pretraining: 5 million steps of TD3+BC
- Fine-tuning: 1 million steps of TD3 (3 million steps for curriculum experiments)
- Perturbation Strength: Hopper/HalfCheetah ϵ=0.3, Ant ϵ=0.5
- Perturbation Probability: Hopper q=0.5, HalfCheetah/Ant q=0.1
- Adaptive Parameters: β=0.9, η adjusted per environment
Key Findings from Table 1:
- Ant-v2 Adversarial Condition: Adversarial fine-tuning 91.6 vs. offline -21.0 vs. fully online 24.0
- Hopper-v2 Adversarial Condition: Adversarial fine-tuning 83.5 vs. offline 13.7 vs. fully online 57.0
- Consistent Advantage: Adversarial fine-tuning significantly outperforms baselines in adversarial evaluation across all environments
Key Insights:
- Performance is best when fine-tuning conditions match evaluation conditions
- Offline policy performance degrades sharply under perturbations (even negative rewards)
- Adversarial fine-tuning converges faster than training from scratch
Curriculum Strategy Comparison (Table 2):
- 1M Steps: Adaptive curriculum qada consistently outperforms fixed qfix and linear qlin across all environments
- 3M Steps: Linear curriculum exhibits overfitting, with degraded normal performance (Hopper: 95.1→76.5)
- Adaptive Advantage: qada maintains or improves normal performance while preserving adversarial robustness
Curriculum Trajectories (Figure 5):
- Linear Strategy: q value increases relentlessly, leading to overfitting
- Adaptive Strategy: Adjusts q growth based on performance feedback, preventing excessive difficulty escalation
- Convergence Speed: Adversarial fine-tuning achieves rapid convergence by leveraging offline pretraining
- Robustness-Stability Trade-off: Adaptive curriculum successfully balances both objectives
- Environment Specificity: Different environments require different hyperparameter adjustments
- Conservative Methods: TD3+BC, CQL, IQL, etc., constrain policies to remain close to data distribution
- Core Challenge: Q-value overestimation for out-of-distribution state-action pairs
- State Perturbations: Methods like RORL improve robustness by smoothing value distributions
- Action Perturbations: Relatively limited research; existing work shows offline policies are particularly fragile
- Representative Methods: AWAC, O2O, Policy Expansion, etc.
- Main Challenge: Performance degradation during early fine-tuning phases
- Fundamental Incompatibility: Conservative offline RL and action space robustness exhibit structural conflict
- Effective Solution: Adversarial fine-tuning successfully bridges offline efficiency and online adaptability
- Curriculum Learning Value: Adaptive scheduling outperforms fixed strategies, preventing overfitting
- Lack of Theoretical Guarantees: Absence of theoretical analysis for curriculum adaptation
- Environment Complexity: Experiments limited to relatively simple locomotion tasks
- Perturbation Types: Primarily focuses on multiplicative perturbations; other types insufficiently explored
- Theoretical Development: Establish theoretical guarantees for curriculum adaptation
- Complex Environments: Explore interactions between state and action space perturbations
- Perturbation Diversity: Investigate broader range of perturbation types and patterns
- Deep Core Insights: Identifying the fundamental conflict between conservatism and robustness is an important contribution
- Reasonable Method Design: Adversarial fine-tuning framework is logically clear and technically feasible
- Comprehensive Experiments: Multi-environment, multi-baseline, multi-metric evaluation
- High Practical Value: Addresses critical problems in real robot deployment
- Insufficient Theoretical Analysis: Lacks convergence and robustness guarantees
- Environment Limitations: Tested only in MuJoCo simulation; lacks real robot validation
- Hyperparameter Sensitivity: Adaptive curriculum requires environment-specific parameter tuning
- Computational Overhead: Perturbation precomputation and performance evaluation increase computational cost
- Academic Contribution: Provides new perspectives and methods for offline RL robustness research
- Practical Value: Offers practical solutions for safety-critical robot applications
- Reproducibility: Detailed method description and clear experimental setup
- Robot Control: Autonomous systems requiring handling of actuator failures
- Safety-Critical Applications: Medical robots, industrial automation, etc.
- Resource-Constrained Environments: Scenarios requiring sample efficiency with robustness demands
The paper cites important works in reinforcement learning, including:
- Offline RL: Fujimoto & Gu (TD3+BC), Kumar et al. (CQL)
- Robust RL: Pinto et al. (Adversarial Training), Yang et al. (RORL)
- Offline-to-Online: Nair et al. (AWAC), Lee et al. (O2O)
Overall Assessment: This is a high-quality research paper with significant contributions in theoretical insights, methodological innovation, and experimental validation. While there remains room for improvement in theoretical analysis and real-world environment validation, it opens important directions for offline reinforcement learning robustness research and demonstrates substantial academic and practical value.