2025-11-16T22:04:13.069952

An Introduction to Zero-Order Optimization Techniques for Robotics

Jordana, Zhang, Amigo et al.
Zero-order optimization techniques are becoming increasingly popular in robotics due to their ability to handle non-differentiable functions and escape local minima. These advantages make them particularly useful for trajectory optimization and policy optimization. In this work, we propose a mathematical tutorial on random search. It offers a simple and unifying perspective for understanding a wide range of algorithms commonly used in robotics. Leveraging this viewpoint, we classify many trajectory optimization methods under a common framework and derive novel competitive RL algorithms.
academic

An Introduction to Zero-Order Optimization Techniques for Robotics

Basic Information

  • Paper ID: 2506.22087
  • Title: An Introduction to Zero-Order Optimization Techniques for Robotics
  • Authors: Armand Jordana, Jianghan Zhang, Joseph Amigo, Ludovic Righetti (New York University)
  • Classification: cs.RO (Robotics)
  • Publication Date: arXiv preprint, latest version October 10, 2025
  • Paper Link: https://arxiv.org/abs/2506.22087

Abstract

Zero-order optimization techniques are gaining increasing popularity in robotics because they can handle non-differentiable functions and escape local minima. These advantages make them particularly useful in trajectory optimization and policy optimization. This paper presents a mathematical tutorial on stochastic search, providing a simple unified perspective for understanding widely-used algorithms in robotics. Leveraging this perspective, the authors classify many trajectory optimization methods under a unified framework and derive novel and competitive reinforcement learning algorithms.

Research Background and Motivation

Core Problem

The core problem addressed in this paper is how to achieve a unified understanding of zero-order optimization algorithms widely used in robotics, including various methods in trajectory optimization (TO) and reinforcement learning (RL).

Problem Significance

  1. Practical Necessity: Robot systems frequently encounter non-differentiable objective functions, particularly in contact-rich problems (e.g., locomotion, manipulation)
  2. Computational Advances: Development of parallel computing and GPU hardware has made sampling-intensive zero-order methods feasible on complex robotic systems
  3. Theoretical Fragmentation: While existing algorithms have strong theoretical foundations, they lack unified understanding in the robotics community

Limitations of Existing Approaches

  1. Algorithm Isolation: Algorithms such as MPPI, CMA-ES, and REINFORCE appear unrelated, lacking a unified framework
  2. Scattered Theory: These algorithms are distributed across optimization, statistics, machine learning, and control domains
  3. Limited Application Guidance: Lack of guidance for designing new algorithms from a unified perspective

Research Motivation

By establishing a unified perspective through stochastic search and Gaussian smoothing, connecting zero-order methods in both trajectory optimization and policy optimization, the work aims to deepen theoretical understanding while guiding new algorithm design.

Core Contributions

  1. Unified Theoretical Framework: Provides a unified perspective for understanding zero-order algorithms in TO and RL based on stochastic search
  2. Algorithm Reinterpretation: Unifies classical algorithms (MPPI, CMA, REINFORCE) under a Gaussian smoothing framework
  3. Novel Algorithm Derivation: Derives new competitive RL algorithms (e.g., RS-DDPG, LSE-DDPG) based on the unified framework
  4. Theoretical Insights: Explains the theoretical mechanism by which stochastic algorithms escape local minima
  5. Experimental Validation: Verifies the framework's effectiveness and new algorithms' competitiveness on multiple robotic tasks

Methodology Details

Problem Formulation

This paper focuses on solving the following general optimization problem: minxRnf(x)\min_{x \in \mathbb{R}^n} f(x)

This formulation encompasses a broad range of problems in robotics:

  • Trajectory Optimization: Optimization in trajectory space (finite-dimensional)
  • Policy Optimization: Optimization in policy parameter space (infinite-dimensional functions)

Core Theoretical Framework

1. Stochastic Search Fundamentals

Pure Random Search (Algorithm 1):

Input: x₀ ∈ Rⁿ
while stopping condition not met:
    Randomly sample x̃ in Rⁿ
    if f(x̃) < f(x):
        x ← x̃
Output: x

Greedy Local Search (Algorithm 2):

Input: x₀ ∈ Rⁿ, Σ
while stopping condition not met:
    Sample d ~ N(0,Σ)
    if f(x+d) < f(x):
        x ← x+d

2. Gaussian Smoothing Gradient Approximation

Core Idea: Rather than directly approximating the gradient of the original function f, study a smoothed surrogate function: fμ(x)=E[f(x+μϵ)]f_μ(x) = \mathbb{E}[f(x + μϵ)] where ϵN(0,Σ)ϵ \sim \mathcal{N}(0,Σ)

Key Derivation: The gradient of the surrogate function can be estimated through function evaluations: fμ(x)=E[f(x+μϵ)f(x)μΣ1ϵ]\nabla f_μ(x) = \mathbb{E}\left[\frac{f(x+μϵ) - f(x)}{μ}Σ^{-1}ϵ\right]

This provides a gradient estimate: g=f(x+μϵ)f(x)μΣ1ϵg = \frac{f(x+μϵ) - f(x)}{μ}Σ^{-1}ϵ

3. Log-Sum-Exp Transformation

Theoretical Foundation of MPPI: Consider the continuous log-sum-exp transformation function: fμ,λ(x)=λlog(E[exp(1λf(x+μϵ))])f_{μ,λ}(x) = -λ \log\left(\mathbb{E}\left[\exp\left(-\frac{1}{λ}f(x+μϵ)\right)\right]\right)

Its gradient is: fμ,λ(x)=λE[exp(1λf(x+μϵ))Σ1ϵ]μE[exp(1λf(x+μϵ))]\nabla f_{μ,λ}(x) = \frac{-λ\mathbb{E}[\exp(-\frac{1}{λ}f(x+μϵ))Σ^{-1}ϵ]}{μ\mathbb{E}[\exp(-\frac{1}{λ}f(x+μϵ))]}

This directly corresponds to the MPPI update rule: xk=1Kwkxkx \leftarrow \sum_{k=1}^K w_k x_k where the weights are: wk=exp(1λ(f(xk)ρ))jexp(1λ(f(xj)ρ))w_k = \frac{\exp(-\frac{1}{λ}(f(x_k) - ρ))}{\sum_j \exp(-\frac{1}{λ}(f(x_j) - ρ))}

Technical Innovations

1. Establishment of Unified Perspective

  • Unifies seemingly different algorithms (MPPI, CMA, REINFORCE) under a Gaussian smoothing framework
  • Reveals log-sum-exp transformation as a generalization of Gaussian smoothing

2. Natural Gradient Interpretation

Proves that MPPI executes natural gradient steps: xxαF1gx \leftarrow x - αF^{-1}g where F is the Fisher information matrix, equal to the inverse of the covariance matrix for Gaussian distributions

3. CMA Derivation

Rederives CMA from the perspective of optimizing Gaussian distribution parameters: minθ=(x,Σ)EzN(x,Σ)[f(z)]\min_{θ=(x,Σ)} \mathbb{E}_{z\sim\mathcal{N}(x,Σ)}[f(z)]

Using natural gradient yields the update rules:

Σ ← (1-α∑wₖ)Σ + α∑wₖ(xₖ-x)(xₖ-x)ᵀ
x ← (1-α∑wₖ)x + α∑wₖxₖ

4. Theoretical Explanation of Global Convergence

Explains how stochasticity helps escape local minima through Langevin dynamics: xk+1=xkαkgk+γkϵkx_{k+1} = x_k - α_k g_k + γ_k ϵ_k

Experimental Setup

Trajectory Optimization Experiments

Benchmarks: Four benchmark problems based on Hydra

  • Cartpole: Classic inverted pendulum control
  • DoubleCartPole: Double inverted pendulum system
  • PushT: Pushing task
  • Humanoid: Humanoid robot control

Comparison Algorithms:

  • Predictive Sampling
  • Randomized Smoothing
  • MPPI
  • MPPI-CMA (proposed)

Experimental Configuration:

  • 2048 samples per iteration
  • MPPI temperature parameter λ = 0.1
  • Average over 6 random seeds
  • Control bounds enforced through penalty terms in cost function

Reinforcement Learning Experiments

Environments: 7 MuJoCo continuous control environments

Comparison Algorithms:

  • DDPG vs RS-DDPG vs LSE-DDPG
  • TD3 vs RS-TD3 vs LSE-TD3

Experimental Configuration:

  • Implementation based on CleanRL
  • 10 samples per update
  • Sampling noise standard deviation 0.1
  • Average over 5 runs

Evaluation Metrics

  • TO: Cost reduction curves during optimization
  • RL: Normalized scores and episode rewards

Experimental Results

Trajectory Optimization Results

  1. MPPI-CMA Performs Best: Consistently outperforms MPPI on all test problems
  2. Predictive Sampling Unexpectedly Effective: Despite simplicity, shows good performance
  3. Randomized Smoothing Sensitive: Highly sensitive to step size selection with high variance
  4. Value of Covariance Adaptation: Demonstrates importance of adaptive covariance matrices

Reinforcement Learning Results

  1. Significant DDPG Improvements: RS-DDPG and LSE-DDPG significantly outperform original DDPG
  2. Limited TD3 Improvements: TD3 is already a strong algorithm with limited improvement space
  3. Universal Smoothing Benefits: Demonstrates universal value of Q-function gradient smoothing

Key Findings

  1. Log-Sum-Exp Advantages: Better handles multimodal functions compared to standard Gaussian smoothing
  2. Temperature Parameter Importance: Appropriate temperature parameter λ is critical for performance
  3. Parallelization Friendly: All methods parallelize well

Trajectory Optimization Domain

  • Classical Methods: Gradient descent, Newton's method and other deterministic methods prone to local minima
  • Sampling Methods: Predictive Sampling, MPPI and other zero-order methods
  • Theoretical Connections: 13 first demonstrates similarity between MPPI and CMA-ES, 14 understands MPPI as approximate gradient method

Reinforcement Learning Domain

  • Parameter Space Search: 16,17 explore stochastic search in policy parameter space
  • Policy Gradient Connections: 18,19 establish connections between policy gradients and stochastic search
  • Evolution Strategies: 20,21 provide comprehensive surveys connecting RL and ES techniques

Positioning of This Work

This paper provides the first broad perspective connecting gradient-free methods in both TO and RL, filling the gap of a unified theoretical framework.

Conclusions and Discussion

Main Conclusions

  1. Unified Framework Effective: Stochastic search perspective successfully unifies multiple zero-order algorithms in TO and RL
  2. Theory Guides Practice: Unified understanding enables design of new competitive algorithms
  3. Value of Stochasticity: Theoretically explains the mechanism by which stochastic algorithms escape local minima
  4. Practical Validation: Verifies framework and new algorithms' effectiveness on multiple robotic tasks

Limitations

  1. Asymptotic Convergence: Global convergence guarantees are only asymptotic, with limited practical significance
  2. Curse of Dimensionality: Sampling methods still suffer from the curse of dimensionality
  3. Hyperparameter Sensitivity: Temperature parameter, step size, etc. require careful tuning
  4. Constraint Handling: Current framework primarily addresses unconstrained optimization

Future Directions

  1. Constrained Optimization: Extend to constrained zero-order optimization
  2. Global Solution Search: Develop more effective global solution search methods
  3. Adaptive Parameters: Automatically tune temperature, step size, and other hyperparameters
  4. Theory Refinement: Provide stronger theoretical guarantees for stochastic smoothing

In-Depth Evaluation

Strengths

  1. Outstanding Theoretical Contribution: Provides the first unified theoretical framework for zero-order optimization in robotics
  2. Mathematical Rigor: Derivations are rigorous and theoretical analysis is thorough
  3. Practical Guidance Value: Theoretical insights directly guide new algorithm design
  4. Comprehensive Experiments: Covers multiple benchmarks in both TO and RL domains
  5. Writing Clarity: Complex theory is expressed clearly and is easy to understand

Weaknesses

  1. Limited Novelty: Primarily reinterprets existing algorithms; original algorithm contributions are relatively limited
  2. Experimental Scale: RL experiments only tested on MuJoCo environments, lacking more complex robotic tasks
  3. Theory Gap: Global convergence theory for stochastic smoothing is not as complete as SPSA
  4. Practical Limitations: Some theoretical results (e.g., asymptotic convergence) have limited practical value

Impact

  1. Academic Value: Provides important theoretical unification for the robotics optimization domain
  2. Educational Significance: As a tutorial paper, has good educational value for students and researchers
  3. Method Inspiration: Unified framework may inspire more new algorithm designs
  4. Cross-Domain Connection: Promotes communication and collaboration between TO and RL communities

Applicable Scenarios

  1. Non-smooth Optimization: Robotic control problems involving contact and collision
  2. High-Dimensional Optimization: Neural network policy parameter optimization
  3. Parallel Computing: Scenarios with abundant parallel computing resources
  4. Exploratory Research: Complex optimization problems requiring escape from local minima

References

The paper cites 51 related references, primarily including:

  • Optimization Theory: 1 Conn et al. on derivative-free optimization, 12 Nesterov on stochastic smoothing
  • Robotics Applications: 2,3 recent sampling MPC applications, 4,5 RL successes in robotics
  • Classical Algorithms: 8 CMA-ES, 10 MPPI, 11 REINFORCE
  • Theoretical Foundations: 22 Spall's SPSA, 27 MCMC methods

Through a unified perspective of stochastic search, this paper successfully connects seemingly different optimization methods in robotics, providing not only important theoretical insights but also guiding new algorithm design. While somewhat limited in algorithmic originality, its theoretical unification value and educational significance make it an important contribution to the field.