2025-11-20T22:01:15.701145

Autonomous UAV Flight Navigation in Confined Spaces: A Reinforcement Learning Approach

Tayar, de Oliveira, Tommaselli et al.
Autonomous UAV inspection of confined industrial infrastructure, such as ventilation ducts, demands robust navigation policies where collisions are unacceptable. While Deep Reinforcement Learning (DRL) offers a powerful paradigm for developing such policies, it presents a critical trade-off between on-policy and off-policy algorithms. Off-policy methods promise high sample efficiency, a vital trait for minimizing costly and unsafe real-world fine-tuning. In contrast, on-policy methods often exhibit greater training stability, which is essential for reliable convergence in hazard-dense environments. This paper directly investigates this trade-off by comparing a leading on-policy algorithm, Proximal Policy Optimization (PPO), against an off-policy counterpart, Soft Actor-Critic (SAC), for precision flight in procedurally generated ducts within a high-fidelity simulator. Our results show that PPO consistently learned a stable, collision-free policy that completed the entire course. In contrast, SAC failed to find a complete solution, converging to a suboptimal policy that navigated only the initial segments before failure. This work provides evidence that for high-precision, safety-critical navigation tasks, the reliable convergence of a well-established on-policy method can be more decisive than the nominal sample efficiency of an off-policy algorithm.
academic

Autonomous UAV Flight Navigation in Confined Spaces: A Reinforcement Learning Approach

Basic Information

  • Paper ID: 2508.16807
  • Title: Autonomous UAV Flight Navigation in Confined Spaces: A Reinforcement Learning Approach
  • Authors: Marco S. Tayar, Lucas K. de Oliveira, Felipe Andrade G. Tommaselli, Juliano D. Negri, Thiago H. Segreto, Ricardo V. Godoy, Marcelo Becker (University of São Paulo)
  • Classification: cs.RO cs.AI cs.LG cs.SY eess.SY
  • Publication Date: October 11, 2025 (arXiv v2)
  • Paper Link: https://arxiv.org/abs/2508.16807

Abstract

This paper investigates autonomous UAV inspection in confined industrial infrastructure such as ventilation ducts, tasks requiring robust navigation strategies that preclude collisions. While deep reinforcement learning (DRL) provides a powerful paradigm for developing such strategies, critical trade-offs exist between on-policy and off-policy algorithms. Off-policy methods promise high sample efficiency, crucial for minimizing expensive and unsafe real-world fine-tuning. Conversely, on-policy methods typically exhibit superior training stability, essential for reliable convergence in high-hazard-density environments. This paper directly investigates this trade-off by comparing the precise flight performance of the leading on-policy algorithm PPO against the off-policy algorithm SAC in procedurally generated ducts within a high-fidelity simulator. Results demonstrate that PPO consistently learns stable, collision-free policies and completes entire flight paths, while SAC fails to find complete solutions, converging to suboptimal policies capable of navigating only initial segments.

Research Background and Motivation

Problem Definition

Manual inspection of industrial infrastructure such as pipelines and ventilation ducts is a complex, expensive, and time-consuming process critical to maintaining operational integrity. Unmanned aerial vehicles (UAVs) represent significant advances in industrial inspection, enabling automated and safe data collection in environments inaccessible or unsafe for humans.

Challenge Analysis

Navigating UAVs in confined spaces such as ducts presents unique challenges:

  1. Complex Aerodynamic Effects: Proximity to walls creates complex aerodynamic phenomena, increasing collision risk
  2. Limitations of Classical Methods: Traditional motion planning approaches lack adaptability and struggle with unmodeled aerodynamic phenomena (such as ground effect in narrow ducts)
  3. Safety-Critical Nature: In these environments, collisions are unacceptable, requiring highly reliable control strategies

Research Motivation

Deep reinforcement learning provides a powerful paradigm for addressing these challenges, but algorithm selection is critical. The core question is: for tasks requiring high precision and safety, is the stability of on-policy methods more important than the sample efficiency of off-policy algorithms?

Core Contributions

  1. Direct Comparative Analysis: Direct comparison of mature on-policy and off-policy algorithms on autonomous UAV navigation tasks in confined industrial ducts
  2. Empirical Evidence: Provides empirical evidence for high-hazard-density, high-precision tasks demonstrating that training stability of on-policy methods is more critical than sample efficiency of off-policy methods
  3. Simulation Workflow Validation: Validates a simulation workflow using procedurally generated environments and high-fidelity physics engines as a testing platform for developing and benchmarking industrial UAV control strategies

Methodology

Task Definition

Models goal-oriented UAV control as a Markov Decision Process (MDP): M = (S,A,T,R,γ)

State Space:

st = [prel, p̂Brel, q, vBlin, vBang, at-1] ∈ R20

where:

  • prel ∈ R³: position vector from UAV to next waypoint
  • p̂Brel ∈ R³: unit-normalized representation in body coordinates
  • q ∈ R⁴: unit quaternion (world to body)
  • vBlin, vBang ∈ R³: linear and angular velocities in body coordinates
  • at-1 ∈ R⁴: motor command vector from previous timestep

Action Space: Continuous actions at ∈ -1,1⁴, parameterizing each rotor command:

ωi = (1 + 0.8 at,i) ωhover, i = 1,...,4

where ωhover = 14.47 krpm is the calibrated hover speed.

Simulation Environment Design

Genesis Physics Engine: Employs the Genesis high-fidelity physics engine for GPU-accelerated parallel rigid-body simulation.

Procedural Duct Generation:

  • Different ducts are generated for each episode, ensuring the policy learns to navigate diverse and challenging scenarios
  • Ducts are constructed by connecting Ns straight duct segments end-to-end
  • Angular deviation between adjacent segments is controlled using Rodrigues' rotation formula:
v' = v cos θ + (k × v) sin θ + k(k · v)(1 - cos θ)

UAV Model: Uses a simulation model of the Bitcraze Crazyflie 2 (92×92×29 mm nano-quadrotor).

Learning Algorithm Comparison

Uses the skrl framework to ensure fair comparison, with both algorithms sharing identical network architecture:

  • Network Structure: actor-critic with two hidden layers (256, 128 units, ELU activation)
  • PPO Configuration: rollout horizon 256, 4096 parallel environments, adaptive KL target 0.01, γ=0.99, λ=0.95, ε=0.2
  • SAC Configuration: twin critics, replay buffer 10⁶, batch size 512, τ=0.005, γ=0.99, automatic entropy adjustment

Reward Function Design

Employs a modular reward function: Rt = Σk wk rk

Three Main Categories:

  1. Guidance Rewards:
    • Progress: rewards motion toward next waypoint
    • Centerline Deviation: penalizes deviation from duct centerline
    • Velocity Tracking: encourages target forward velocity
  2. Stability Rewards:
    • Orientation Alignment: rewards yaw/level attitude
    • Angular Velocity Damping: penalizes rotational velocity
    • Action Smoothness: penalizes abrupt motor command changes
  3. Event Rewards:
    • Waypoint Pass: sparse reward for passing waypoints
    • Duct Finish: large terminal reward for completing duct
    • Crash Penalty: large penalty for collision/violation

Experimental Setup

Experimental Environment

  • Platform: Genesis physics engine
  • Duct Configuration: procedurally generated, Rd = 0.5m, 7 waypoints
  • Training Configuration: PPO and SAC each trained for 500 checkpoints

Evaluation Metrics

  • Average Reward: mean reward per episode
  • Waypoints Passed: number of waypoints successfully navigated
  • Collisions per Episode: collision frequency
  • Average/Maximum Deviation: mean and maximum centerline deviation

Hyperparameter Optimization

Uses Weights & Biases sweep tool to optimize reward weights, with expanded weight ranges for primary guidance terms in SAC to accommodate its replay buffer characteristics.

Experimental Results

PPO Training Results

Checkpoint5075100150200300400500
Average Reward1.3k2.7k4.5k6.4k7.2k9.9k10.2k9.6k
Waypoints Passed1/72/74/75/76/77/77/77/7
Collisions/Episode1.000.700.300.000.000.000.000.00
Avg Deviation (m)0.1230.1130.0840.0650.0940.0640.0630.094

Key Findings:

  • Achieves 100% flight completion rate with zero collisions at checkpoint 300
  • Average centerline deviation decreases from 0.1128m to 0.0636m (between checkpoints 200-300)
  • Optimal performance achieved at checkpoint 400 (average reward 10.2k)

SAC Training Results

Checkpoint5075100150200300
Average Reward2.0k3.0k3.6k4.1k5.4k4.4k
Waypoints Passed0/71/72/73/73/73/7
Collisions/Episode1.001.001.001.001.001.00

Key Findings:

  • Flight completion rate remains 0% throughout training
  • Average of 1 collision per episode, indicating terminal failure is standard outcome
  • Converges to local optimum after navigating maximum 3 waypoints before crashing

Performance Comparison Analysis

PPO Success Factors:

  • On-policy updates provide consistent learning signals
  • Capable of escaping local optima and solving end-to-end tasks
  • Demonstrates classical learning pattern: first mastering primary objectives, then optimizing trajectories

SAC Failure Factors:

  • Replay buffer becomes saturated with experiences from initial simple segments
  • Biased toward refining trajectory beginning, neglecting later challenges
  • Sample efficiency becomes counterproductive in this context

DRL Applications in Robotics

  • DRL learns complex control policies through trial-and-error interaction, suitable for robotic tasks difficult to model precisely
  • Has achieved breakthroughs in dynamic motion skill generation for legged robots

Importance of High-Fidelity Simulation

  • Simulation becomes essential for DRL research due to high costs and safety risks of real-world interaction
  • Techniques such as domain randomization are critical for sim-to-real transfer

Autonomous UAV Navigation

  • DRL demonstrates superhuman performance in high-speed dynamic tasks such as drone racing
  • Navigation in confined environments presents greater challenges than open-space navigation, requiring more stable and reliable learning algorithms

Conclusions and Discussion

Main Conclusions

  1. Stability Over Efficiency: For high-precision, safety-critical navigation tasks, training stability of on-policy methods is more important than sample efficiency of off-policy methods
  2. Criticality of Algorithm Selection: PPO successfully learns robust collision-free policies while SAC converges to suboptimal solutions
  3. Limitations of Replay Buffers: SAC's replay buffer may lead to exploration bias in complex sequential tasks

Limitations

  1. Limited Algorithm Scope: Compares only PPO and SAC algorithms
  2. Reward Engineering Dependency: Performance heavily depends on carefully designed reward functions
  3. Sim-to-Real Gap: Validation on actual physical systems remains absent

Future Directions

  1. Sim-to-Real Transfer: Transfer successful PPO policies to physical UAV testing platforms
  2. Domain Randomization: Combine domain randomization and curriculum learning to enhance policy robustness
  3. Hybrid Algorithms: Investigate advanced algorithms that unify on-policy stability and off-policy data efficiency

In-Depth Evaluation

Strengths

  1. Problem Specificity: Addresses practical safety-critical problems in industrial inspection
  2. Rigorous Experimental Design: Ensures fair comparison using unified framework; procedurally generated environments enhance generalization
  3. Clear and Compelling Conclusions: Provides clear guidance principles for algorithm selection
  4. High Engineering Value: Offers valuable technical pathways for practical industrial applications

Weaknesses

  1. Narrow Algorithm Coverage: Compares only two algorithms, lacking comprehensive algorithm evaluation
  2. Insufficient Theoretical Analysis: Failure analysis primarily based on empirical observations, lacking theoretical support
  3. Absence of Real-World Validation: All experiments conducted in simulation environments without real-world verification
  4. Reward Design Sensitivity: Different reward weights for different algorithms may affect generalizability of conclusions

Impact

  1. Academic Contribution: Provides empirical guidance for DRL algorithm selection in safety-critical tasks
  2. Industrial Value: Offers technical reference for industrial inspection UAV development
  3. Methodological Value: Validates effectiveness of procedurally generated environments in DRL training

Applicable Scenarios

  • High-precision, safety-critical UAV navigation tasks
  • Robot control in confined spaces
  • Reinforcement learning applications requiring reliable convergence guarantees

References

The paper cites 26 relevant references covering DRL foundational theory, UAV navigation, simulation techniques, and other domains, providing solid theoretical foundation. Key references include original PPO and SAC papers, breakthrough work in drone racing, and important research on sim-to-real transfer.