2025-11-21T07:37:22.920666

Edge Delayed Deep Deterministic Policy Gradient: efficient continuous control for edge scenarios

Sinigaglia, Turcato, Carli et al.

Deep Reinforcement Learning is gaining increasing attention thanks to its capability to learn complex policies in high-dimensional settings. Recent advancements utilize a dual-network architecture to learn optimal policies through the Q-learning algorithm. However, this approach has notable drawbacks, such as an overestimation bias that can disrupt the learning process and degrade the performance of the resulting policy. To address this, novel algorithms have been developed that mitigate overestimation bias by employing multiple Q-functions. Edge scenarios, which prioritize privacy, have recently gained prominence. In these settings, limited computational resources pose a significant challenge for complex Machine Learning approaches, making the efficiency of algorithms crucial for their performance. In this work, we introduce a novel Reinforcement Learning algorithm tailored for edge scenarios, called Edge Delayed Deep Deterministic Policy Gradient (EdgeD3). EdgeD3 enhances the Deep Deterministic Policy Gradient (DDPG) algorithm, achieving significantly improved performance with $25\%$ less Graphics Process Unit (GPU) time while maintaining the same memory usage. Additionally, EdgeD3 consistently matches or surpasses the performance of state-of-the-art methods across various benchmarks, all while using $30\%$ fewer computational resources and requiring $30\%$ less memory.

academic

Edge Delayed Deep Deterministic Policy Gradient: Efficient Continuous Control for Edge Scenarios

Basic Information

Paper ID: 2412.06390
Title: Edge Delayed Deep Deterministic Policy Gradient: efficient continuous control for edge scenarios
Authors: Alberto Sinigaglia, Niccolò Turcato, Carli Ruggero, Gian Antonio Susto
Classification: cs.LG cs.AI
Published Journal: IEEE Transactions on Automation Science and Engineering
Paper Link: https://arxiv.org/abs/2412.06390

Abstract

Deep reinforcement learning (DRL) has gained significant attention for its ability to learn complex policies in high-dimensional input spaces. Modern DRL algorithms typically rely on dual-network Q-learning architectures to approximate optimal policies and overcome overestimation bias. However, with the emergence of edge computing scenarios, privacy concerns and strict hardware constraints demand efficient algorithms. This paper proposes Edge Delayed Deep Deterministic Policy Gradient (EdgeD3), a novel reinforcement learning algorithm specifically designed for edge computing environments. EdgeD3 significantly reduces GPU time (25%) and computational memory usage (30%) while consistently achieving or surpassing state-of-the-art performance across multiple benchmarks and real-world tasks.

Research Background and Motivation

Problem Definition

Overestimation Bias Problem: Traditional Q-learning algorithms suffer from overestimation bias, which disrupts the learning process and degrades policy performance
Edge Computing Resource Constraints: Edge devices have limited computational and memory resources, making existing multi-Q-network methods (e.g., TD3, SAC) computationally prohibitive
Privacy Protection Requirements: Edge scenarios require on-device learning to avoid cloud transmission and protect data privacy

Research Significance

Edge computing has widespread applications in autonomous driving, smart manufacturing, and intelligent healthcare
Existing algorithms (TD3, SAC, etc.) employ up to 10 Q-networks, with memory and computational overhead 10 times that of baseline algorithms
Edge devices require efficient learning under limited resource constraints

Limitations of Existing Methods

TD3/SAC: Dual Q-network mechanisms increase memory usage by 29-31% and computational time by over 30%
Recent Algorithms (TQC, REDQ, etc.): Use 5-10 Q-networks with even greater computational overhead, unsuitable for edge scenarios
CDQ Mechanism: Lacks fine-grained control over bias-variance tradeoffs

Core Contributions

Novel Expectile Loss Function: Proposes an expectile-based loss function that controls overestimation bias using only a single Q-network
EdgeD3 Algorithm: Combines Expectile loss, delayed policy updates, and target smoothing techniques for an efficient algorithm
Theoretical Analysis: Proves monotonicity and asymptotic convergence of the Expectile loss
Comprehensive Experimental Validation: Verifies algorithm effectiveness on Mujoco simulation environments and real robot navigation tasks
Resource Efficiency Improvements: Reduces GPU time by 25% compared to DDPG and computational/memory usage by 30% compared to state-of-the-art methods

Methodology Details

Task Definition

Studies continuous control Markov Decision Processes (MDPs) defined as five-tuples (S, A, P, R, γ):

S: Continuous state space
A: Continuous action space
P: State transition probability density function
R: Reward function r: S×A×S → ℝ
γ: Discount factor

The objective is to learn policy μ_φ(a_t|s_t) that maximizes expected cumulative reward.

Core Technical Innovations

1. Expectile Loss Function

An asymmetric version of traditional MSE loss:

L_{α,β}(f_θ(x), y) = 1/Z {
    α(y - f_θ(x))² if f_θ(x) < y
    β(y - f_θ(x))² otherwise
}

where Z = max(α,β) is a normalization constant.

Key Properties:

α = β: Degenerates to standard MSE
α < β: Tends toward underestimation, counteracting Q-learning overestimation
α > β: Tends toward overestimation

2. Theoretical Guarantees

Theorem 1 (Expectile Monotonicity): The Expectile function is monotonically non-decreasing with respect to τ, i.e., τ₁ ≤ τ₂ ⟹ t₁ ≤ t₂

Corollary 1.1 (Asymptotic Convergence): Through a decay function λ(t), the algorithm can be guaranteed to converge to unbiased estimation:

min(α_{t+1}, β_{t+1}) ← min(α_t, β_t) + |α_t - β_t| · λ(t)

3. EdgeD3 Algorithm Architecture

EdgeDDPG Base Version:

Critic Update: Replaces MSE with Expectile loss
Actor Update: Standard deterministic policy gradient

EdgeD3 Complete Version:

Delayed Policy Update: Updates actor network every k steps
Target Smoothing: Injects noise into target estimation
Expectile Loss: Controls estimation bias

# Key Update Formula
y = E_{ε~p(x)}[r + γQ_{θ'}(s', ε + μ_{φ'}(s'))]
∇L(θ) = ∇_θ N^{-1} Σ L_{α,β}(y, Q_θ(s,a))

Optimization Landscape Smoothing

Employs target noise injection instead of gradient penalty:

Traditional Method: L(θ) = MSE + ξ||∇_a Q(s,a)||² (computationally expensive)
Proposed Method: Injects noise into targets, equivalent to gradient penalty but computationally efficient

Experimental Setup

Simulation Environments

Dataset: Mujoco physics simulation environment suite
Tasks: Ant, Reacher, Hopper, Walker2d, Humanoid, HalfCheetah, Swimmer
Evaluation: 10 episodes evaluated every 5000 steps, 10 random seeds

Real Robot Experiments

Platform: Custom TurtleBot + Raspberry Pi3B + 2D laser scanner
Tasks: Corridor navigation, unstructured environment navigation
State: 16-dimensional laser scan + linear velocity + angular velocity
Action: 2-dimensional continuous control (linear velocity, angular velocity)

Comparison Methods

DDPG: Baseline Deep Deterministic Policy Gradient
TD3: Twin Delayed DDPG
SAC: Soft Actor-Critic
PPO: Proximal Policy Optimization

Evaluation Metrics

Performance: Cumulative reward
Resource Usage: GPU time, memory consumption
Training Efficiency: Performance under same time budget

Experimental Results

Resource Usage Comparison

Memory Usage (relative to EdgeD3):

DDPG: -1.2%
TD3: +29.3%
SAC: +31.1%

GPU Time Comparison:

EdgeD3: 214.0±7.1ms
DDPG: 285.5±7.4ms (-25.0%)
TD3: 308.2±2.7ms (-30.5%)
SAC (Delayed): 320.9±3.6ms (-33.3%)
SAC (Original): 492.9±2.9ms (-56.8%)

Performance Comparison

Best Performance in Simulation Environments (same time budget):

Environment	EdgeD3	DDPG	SAC	TD3
Ant-v3	4350.04	990.55	2739.81	4208.10
Hopper-v3	3388.44	2222.85	3148.89	2786.22
Walker2d-v3	3788.07	1601.16	2974.40	3580.83
HalfCheetah	10645.8	10309.0	8937.3	9677.5

EdgeD3 achieves best performance in 5 out of 7 tasks and ranks in the top two for remaining tasks.

Real Robot Results

Corridor Navigation: EdgeD3 demonstrates superior performance from training onset
Unstructured Navigation: EdgeD3 surpasses other methods after 30 minutes
Update Frequency: EdgeD3 (8Hz) > TD3 (5.9Hz) > DDPG (5.8Hz) > SAC (3.3Hz)

Ablation Study

Tests impact of different α, β combinations:

Swimmer: α>β (tendency toward overestimation) performs better
Ant: α<β (tendency toward underestimation) performs better
Demonstrates the flexibility advantage of Expectile loss over fixed CDQ mechanisms

Estimation Bias Mitigation

Double Q-learning: Uses two independent estimators
Ensemble Methods: TQC (5 networks), REDQ (10 networks), RAC (10 networks)
This Work's Contribution: Single-network solution with computational efficiency

Edge Computing RL

Model Compression: Quantization, pruning, and other techniques
Algorithm Optimization: First to address edge RL efficiency at the algorithmic level

Continuous Control

Actor-Critic Methods: DDPG, TD3, SAC, etc.
Policy Gradient: Direct optimization of policy parameters

Conclusions and Discussion

Main Conclusions

Efficiency Improvement: EdgeD3 reduces computational and memory usage by 30% compared to state-of-the-art methods
Performance Maintenance: Achieves or surpasses state-of-the-art performance on most tasks
Practical Viability: Validates feasibility of edge deployment on real robots
Theoretical Foundation: Provides complete theoretical analysis and convergence guarantees

Limitations

Complex Tasks: Room for improvement on highly complex tasks like Humanoid
Hyperparameters: While α=1, β=2 are good initial values, task-specific tuning remains necessary
Environment Dependency: Different environments may require different α, β settings

Future Directions

Adaptive Hyperparameters: Online adjustment of α, β parameters
Alternative Loss Functions: Exploration of quantile loss, imbalanced Huber loss, etc.
Model Compression Integration: Combination with quantization, pruning, and other techniques

In-Depth Evaluation

Strengths

Strong Innovation: First to introduce Expectile regression to RL, addressing overestimation bias
High Practical Value: Directly addresses resource constraints in edge computing
Complete Theory: Provides theoretical guarantees on monotonicity and convergence
Comprehensive Experiments: Dual validation through simulation and real robot experiments
Clear Presentation: Detailed algorithm description with strong reproducibility

Weaknesses

Limited Scope: Primarily focuses on continuous control; applicability to discrete action spaces unclear
Hyperparameter Sensitivity: Different tasks require α, β adjustment; lacks automated methods
Incomplete Comparisons: Missing comparisons with latest ensemble methods (e.g., recent energy-based models)

Impact

Academic Contribution: Opens new direction for edge RL with theory-practice balance
Industrial Application: Directly applicable to resource-constrained real-world deployments
Reproducibility: Provides complete algorithms and hyperparameter specifications

Applicable Scenarios

Edge Devices: Mobile robots, drones, IoT devices
Real-time Control: Control tasks requiring low-latency responses
Privacy Protection: Scenarios where data cannot be transmitted to cloud
Resource-Constrained Environments: Settings with strict CPU, memory, and energy limitations

References

The paper cites 56 important references from reinforcement learning, continuous control, and edge computing domains, providing a solid theoretical foundation spanning from fundamental theory to practical applications.

Overall Assessment: This is a high-quality research paper with outstanding contributions in theoretical innovation, experimental validation, and practical value. The EdgeD3 algorithm elegantly addresses the RL efficiency problem in edge computing scenarios, demonstrating significant academic value and application potential.