2025-11-21T07:37:22.920666

Edge Delayed Deep Deterministic Policy Gradient: efficient continuous control for edge scenarios

Sinigaglia, Turcato, Carli et al.
Deep Reinforcement Learning is gaining increasing attention thanks to its capability to learn complex policies in high-dimensional settings. Recent advancements utilize a dual-network architecture to learn optimal policies through the Q-learning algorithm. However, this approach has notable drawbacks, such as an overestimation bias that can disrupt the learning process and degrade the performance of the resulting policy. To address this, novel algorithms have been developed that mitigate overestimation bias by employing multiple Q-functions. Edge scenarios, which prioritize privacy, have recently gained prominence. In these settings, limited computational resources pose a significant challenge for complex Machine Learning approaches, making the efficiency of algorithms crucial for their performance. In this work, we introduce a novel Reinforcement Learning algorithm tailored for edge scenarios, called Edge Delayed Deep Deterministic Policy Gradient (EdgeD3). EdgeD3 enhances the Deep Deterministic Policy Gradient (DDPG) algorithm, achieving significantly improved performance with $25\%$ less Graphics Process Unit (GPU) time while maintaining the same memory usage. Additionally, EdgeD3 consistently matches or surpasses the performance of state-of-the-art methods across various benchmarks, all while using $30\%$ fewer computational resources and requiring $30\%$ less memory.
academic

Edge Delayed Deep Deterministic Policy Gradient: Efficient Continuous Control for Edge Scenarios

Basic Information

  • Paper ID: 2412.06390
  • Title: Edge Delayed Deep Deterministic Policy Gradient: efficient continuous control for edge scenarios
  • Authors: Alberto Sinigaglia, Niccolò Turcato, Carli Ruggero, Gian Antonio Susto
  • Classification: cs.LG cs.AI
  • Published Journal: IEEE Transactions on Automation Science and Engineering
  • Paper Link: https://arxiv.org/abs/2412.06390

Abstract

Deep reinforcement learning (DRL) has gained significant attention for its ability to learn complex policies in high-dimensional input spaces. Modern DRL algorithms typically rely on dual-network Q-learning architectures to approximate optimal policies and overcome overestimation bias. However, with the emergence of edge computing scenarios, privacy concerns and strict hardware constraints demand efficient algorithms. This paper proposes Edge Delayed Deep Deterministic Policy Gradient (EdgeD3), a novel reinforcement learning algorithm specifically designed for edge computing environments. EdgeD3 significantly reduces GPU time (25%) and computational memory usage (30%) while consistently achieving or surpassing state-of-the-art performance across multiple benchmarks and real-world tasks.

Research Background and Motivation

Problem Definition

  1. Overestimation Bias Problem: Traditional Q-learning algorithms suffer from overestimation bias, which disrupts the learning process and degrades policy performance
  2. Edge Computing Resource Constraints: Edge devices have limited computational and memory resources, making existing multi-Q-network methods (e.g., TD3, SAC) computationally prohibitive
  3. Privacy Protection Requirements: Edge scenarios require on-device learning to avoid cloud transmission and protect data privacy

Research Significance

  • Edge computing has widespread applications in autonomous driving, smart manufacturing, and intelligent healthcare
  • Existing algorithms (TD3, SAC, etc.) employ up to 10 Q-networks, with memory and computational overhead 10 times that of baseline algorithms
  • Edge devices require efficient learning under limited resource constraints

Limitations of Existing Methods

  • TD3/SAC: Dual Q-network mechanisms increase memory usage by 29-31% and computational time by over 30%
  • Recent Algorithms (TQC, REDQ, etc.): Use 5-10 Q-networks with even greater computational overhead, unsuitable for edge scenarios
  • CDQ Mechanism: Lacks fine-grained control over bias-variance tradeoffs

Core Contributions

  1. Novel Expectile Loss Function: Proposes an expectile-based loss function that controls overestimation bias using only a single Q-network
  2. EdgeD3 Algorithm: Combines Expectile loss, delayed policy updates, and target smoothing techniques for an efficient algorithm
  3. Theoretical Analysis: Proves monotonicity and asymptotic convergence of the Expectile loss
  4. Comprehensive Experimental Validation: Verifies algorithm effectiveness on Mujoco simulation environments and real robot navigation tasks
  5. Resource Efficiency Improvements: Reduces GPU time by 25% compared to DDPG and computational/memory usage by 30% compared to state-of-the-art methods

Methodology Details

Task Definition

Studies continuous control Markov Decision Processes (MDPs) defined as five-tuples (S, A, P, R, γ):

  • S: Continuous state space
  • A: Continuous action space
  • P: State transition probability density function
  • R: Reward function r: S×A×S → ℝ
  • γ: Discount factor

The objective is to learn policy μ_φ(a_t|s_t) that maximizes expected cumulative reward.

Core Technical Innovations

1. Expectile Loss Function

An asymmetric version of traditional MSE loss:

L_{α,β}(f_θ(x), y) = 1/Z {
    α(y - f_θ(x))² if f_θ(x) < y
    β(y - f_θ(x))² otherwise
}

where Z = max(α,β) is a normalization constant.

Key Properties:

  • α = β: Degenerates to standard MSE
  • α < β: Tends toward underestimation, counteracting Q-learning overestimation
  • α > β: Tends toward overestimation

2. Theoretical Guarantees

Theorem 1 (Expectile Monotonicity): The Expectile function is monotonically non-decreasing with respect to τ, i.e., τ₁ ≤ τ₂ ⟹ t₁ ≤ t₂

Corollary 1.1 (Asymptotic Convergence): Through a decay function λ(t), the algorithm can be guaranteed to converge to unbiased estimation:

min(α_{t+1}, β_{t+1}) ← min(α_t, β_t) + |α_t - β_t| · λ(t)

3. EdgeD3 Algorithm Architecture

EdgeDDPG Base Version:

  • Critic Update: Replaces MSE with Expectile loss
  • Actor Update: Standard deterministic policy gradient

EdgeD3 Complete Version:

  • Delayed Policy Update: Updates actor network every k steps
  • Target Smoothing: Injects noise into target estimation
  • Expectile Loss: Controls estimation bias
# Key Update Formula
y = E_{ε~p(x)}[r + γQ_{θ'}(s', ε + μ_{φ'}(s'))]
∇L(θ) = ∇_θ N^{-1} Σ L_{α,β}(y, Q_θ(s,a))

Optimization Landscape Smoothing

Employs target noise injection instead of gradient penalty:

  • Traditional Method: L(θ) = MSE + ξ||∇_a Q(s,a)||² (computationally expensive)
  • Proposed Method: Injects noise into targets, equivalent to gradient penalty but computationally efficient

Experimental Setup

Simulation Environments

  • Dataset: Mujoco physics simulation environment suite
  • Tasks: Ant, Reacher, Hopper, Walker2d, Humanoid, HalfCheetah, Swimmer
  • Evaluation: 10 episodes evaluated every 5000 steps, 10 random seeds

Real Robot Experiments

  • Platform: Custom TurtleBot + Raspberry Pi3B + 2D laser scanner
  • Tasks: Corridor navigation, unstructured environment navigation
  • State: 16-dimensional laser scan + linear velocity + angular velocity
  • Action: 2-dimensional continuous control (linear velocity, angular velocity)

Comparison Methods

  • DDPG: Baseline Deep Deterministic Policy Gradient
  • TD3: Twin Delayed DDPG
  • SAC: Soft Actor-Critic
  • PPO: Proximal Policy Optimization

Evaluation Metrics

  • Performance: Cumulative reward
  • Resource Usage: GPU time, memory consumption
  • Training Efficiency: Performance under same time budget

Experimental Results

Resource Usage Comparison

Memory Usage (relative to EdgeD3):

  • DDPG: -1.2%
  • TD3: +29.3%
  • SAC: +31.1%

GPU Time Comparison:

  • EdgeD3: 214.0±7.1ms
  • DDPG: 285.5±7.4ms (-25.0%)
  • TD3: 308.2±2.7ms (-30.5%)
  • SAC (Delayed): 320.9±3.6ms (-33.3%)
  • SAC (Original): 492.9±2.9ms (-56.8%)

Performance Comparison

Best Performance in Simulation Environments (same time budget):

EnvironmentEdgeD3DDPGSACTD3
Ant-v34350.04990.552739.814208.10
Hopper-v33388.442222.853148.892786.22
Walker2d-v33788.071601.162974.403580.83
HalfCheetah10645.810309.08937.39677.5

EdgeD3 achieves best performance in 5 out of 7 tasks and ranks in the top two for remaining tasks.

Real Robot Results

  • Corridor Navigation: EdgeD3 demonstrates superior performance from training onset
  • Unstructured Navigation: EdgeD3 surpasses other methods after 30 minutes
  • Update Frequency: EdgeD3 (8Hz) > TD3 (5.9Hz) > DDPG (5.8Hz) > SAC (3.3Hz)

Ablation Study

Tests impact of different α, β combinations:

  • Swimmer: α>β (tendency toward overestimation) performs better
  • Ant: α<β (tendency toward underestimation) performs better
  • Demonstrates the flexibility advantage of Expectile loss over fixed CDQ mechanisms

Estimation Bias Mitigation

  • Double Q-learning: Uses two independent estimators
  • Ensemble Methods: TQC (5 networks), REDQ (10 networks), RAC (10 networks)
  • This Work's Contribution: Single-network solution with computational efficiency

Edge Computing RL

  • Model Compression: Quantization, pruning, and other techniques
  • Algorithm Optimization: First to address edge RL efficiency at the algorithmic level

Continuous Control

  • Actor-Critic Methods: DDPG, TD3, SAC, etc.
  • Policy Gradient: Direct optimization of policy parameters

Conclusions and Discussion

Main Conclusions

  1. Efficiency Improvement: EdgeD3 reduces computational and memory usage by 30% compared to state-of-the-art methods
  2. Performance Maintenance: Achieves or surpasses state-of-the-art performance on most tasks
  3. Practical Viability: Validates feasibility of edge deployment on real robots
  4. Theoretical Foundation: Provides complete theoretical analysis and convergence guarantees

Limitations

  1. Complex Tasks: Room for improvement on highly complex tasks like Humanoid
  2. Hyperparameters: While α=1, β=2 are good initial values, task-specific tuning remains necessary
  3. Environment Dependency: Different environments may require different α, β settings

Future Directions

  1. Adaptive Hyperparameters: Online adjustment of α, β parameters
  2. Alternative Loss Functions: Exploration of quantile loss, imbalanced Huber loss, etc.
  3. Model Compression Integration: Combination with quantization, pruning, and other techniques

In-Depth Evaluation

Strengths

  1. Strong Innovation: First to introduce Expectile regression to RL, addressing overestimation bias
  2. High Practical Value: Directly addresses resource constraints in edge computing
  3. Complete Theory: Provides theoretical guarantees on monotonicity and convergence
  4. Comprehensive Experiments: Dual validation through simulation and real robot experiments
  5. Clear Presentation: Detailed algorithm description with strong reproducibility

Weaknesses

  1. Limited Scope: Primarily focuses on continuous control; applicability to discrete action spaces unclear
  2. Hyperparameter Sensitivity: Different tasks require α, β adjustment; lacks automated methods
  3. Incomplete Comparisons: Missing comparisons with latest ensemble methods (e.g., recent energy-based models)

Impact

  1. Academic Contribution: Opens new direction for edge RL with theory-practice balance
  2. Industrial Application: Directly applicable to resource-constrained real-world deployments
  3. Reproducibility: Provides complete algorithms and hyperparameter specifications

Applicable Scenarios

  1. Edge Devices: Mobile robots, drones, IoT devices
  2. Real-time Control: Control tasks requiring low-latency responses
  3. Privacy Protection: Scenarios where data cannot be transmitted to cloud
  4. Resource-Constrained Environments: Settings with strict CPU, memory, and energy limitations

References

The paper cites 56 important references from reinforcement learning, continuous control, and edge computing domains, providing a solid theoretical foundation spanning from fundamental theory to practical applications.


Overall Assessment: This is a high-quality research paper with outstanding contributions in theoretical innovation, experimental validation, and practical value. The EdgeD3 algorithm elegantly addresses the RL efficiency problem in edge computing scenarios, demonstrating significant academic value and application potential.