Autonomous UAV Flight Navigation in Confined Spaces: A Reinforcement Learning Approach
Tayar, de Oliveira, Tommaselli et al.
Autonomous UAV inspection of confined industrial infrastructure, such as ventilation ducts, demands robust navigation policies where collisions are unacceptable. While Deep Reinforcement Learning (DRL) offers a powerful paradigm for developing such policies, it presents a critical trade-off between on-policy and off-policy algorithms. Off-policy methods promise high sample efficiency, a vital trait for minimizing costly and unsafe real-world fine-tuning. In contrast, on-policy methods often exhibit greater training stability, which is essential for reliable convergence in hazard-dense environments. This paper directly investigates this trade-off by comparing a leading on-policy algorithm, Proximal Policy Optimization (PPO), against an off-policy counterpart, Soft Actor-Critic (SAC), for precision flight in procedurally generated ducts within a high-fidelity simulator. Our results show that PPO consistently learned a stable, collision-free policy that completed the entire course. In contrast, SAC failed to find a complete solution, converging to a suboptimal policy that navigated only the initial segments before failure. This work provides evidence that for high-precision, safety-critical navigation tasks, the reliable convergence of a well-established on-policy method can be more decisive than the nominal sample efficiency of an off-policy algorithm.
academic
Autonomous UAV Flight Navigation in Confined Spaces: A Reinforcement Learning Approach
Title: Autonomous UAV Flight Navigation in Confined Spaces: A Reinforcement Learning Approach
Authors: Marco S. Tayar, Lucas K. de Oliveira, Felipe Andrade G. Tommaselli, Juliano D. Negri, Thiago H. Segreto, Ricardo V. Godoy, Marcelo Becker (University of São Paulo)
This paper investigates autonomous UAV inspection in confined industrial infrastructure such as ventilation ducts, tasks requiring robust navigation strategies that preclude collisions. While deep reinforcement learning (DRL) provides a powerful paradigm for developing such strategies, critical trade-offs exist between on-policy and off-policy algorithms. Off-policy methods promise high sample efficiency, crucial for minimizing expensive and unsafe real-world fine-tuning. Conversely, on-policy methods typically exhibit superior training stability, essential for reliable convergence in high-hazard-density environments. This paper directly investigates this trade-off by comparing the precise flight performance of the leading on-policy algorithm PPO against the off-policy algorithm SAC in procedurally generated ducts within a high-fidelity simulator. Results demonstrate that PPO consistently learns stable, collision-free policies and completes entire flight paths, while SAC fails to find complete solutions, converging to suboptimal policies capable of navigating only initial segments.
Manual inspection of industrial infrastructure such as pipelines and ventilation ducts is a complex, expensive, and time-consuming process critical to maintaining operational integrity. Unmanned aerial vehicles (UAVs) represent significant advances in industrial inspection, enabling automated and safe data collection in environments inaccessible or unsafe for humans.
Limitations of Classical Methods: Traditional motion planning approaches lack adaptability and struggle with unmodeled aerodynamic phenomena (such as ground effect in narrow ducts)
Safety-Critical Nature: In these environments, collisions are unacceptable, requiring highly reliable control strategies
Deep reinforcement learning provides a powerful paradigm for addressing these challenges, but algorithm selection is critical. The core question is: for tasks requiring high precision and safety, is the stability of on-policy methods more important than the sample efficiency of off-policy algorithms?
Direct Comparative Analysis: Direct comparison of mature on-policy and off-policy algorithms on autonomous UAV navigation tasks in confined industrial ducts
Empirical Evidence: Provides empirical evidence for high-hazard-density, high-precision tasks demonstrating that training stability of on-policy methods is more critical than sample efficiency of off-policy methods
Simulation Workflow Validation: Validates a simulation workflow using procedurally generated environments and high-fidelity physics engines as a testing platform for developing and benchmarking industrial UAV control strategies
Uses Weights & Biases sweep tool to optimize reward weights, with expanded weight ranges for primary guidance terms in SAC to accommodate its replay buffer characteristics.
Stability Over Efficiency: For high-precision, safety-critical navigation tasks, training stability of on-policy methods is more important than sample efficiency of off-policy methods
Criticality of Algorithm Selection: PPO successfully learns robust collision-free policies while SAC converges to suboptimal solutions
Limitations of Replay Buffers: SAC's replay buffer may lead to exploration bias in complex sequential tasks
The paper cites 26 relevant references covering DRL foundational theory, UAV navigation, simulation techniques, and other domains, providing solid theoretical foundation. Key references include original PPO and SAC papers, breakthrough work in drone racing, and important research on sim-to-real transfer.