2025-11-11T09:10:09.674062

CCDP: Composition of Conditional Diffusion Policies with Guided Sampling

Razmjoo, Calinon, Gienger et al.
Imitation Learning offers a promising approach to learn directly from data without requiring explicit models, simulations, or detailed task definitions. During inference, actions are sampled from the learned distribution and executed on the robot. However, sampled actions may fail for various reasons, and simply repeating the sampling step until a successful action is obtained can be inefficient. In this work, we propose an enhanced sampling strategy that refines the sampling distribution to avoid previously unsuccessful actions. We demonstrate that by solely utilizing data from successful demonstrations, our method can infer recovery actions without the need for additional exploratory behavior or a high-level controller. Furthermore, we leverage the concept of diffusion model decomposition to break down the primary problem, which may require long-horizon history to manage failures, into multiple smaller, more manageable sub-problems in learning, data collection, and inference, thereby enabling the system to adapt to variable failure counts. Our approach yields a low-level controller that dynamically adjusts its sampling space to improve efficiency when prior samples fall short. We validate our method across several tasks, including door opening with unknown directions, object manipulation, and button-searching scenarios, demonstrating that our approach outperforms traditional baselines.
academic

CCDP: Composition of Conditional Diffusion Policies with Guided Sampling

Basic Information

  • Paper ID: 2503.15386
  • Title: CCDP: Composition of Conditional Diffusion Policies with Guided Sampling
  • Authors: Amirreza Razmjoo (Honda Research Institute Europe & Idiap Research Institute & EPFL), Sylvain Calinon (Idiap Research Institute & EPFL), Michael Gienger (Honda Research Institute Europe), Fan Zhang (Honda Research Institute Europe)
  • Classification: cs.RO (Robotics), cs.AI (Artificial Intelligence)
  • Publication Date: October 10, 2025 (arXiv v2)
  • Paper Link: https://arxiv.org/abs/2503.15386

Abstract

Imitation learning provides a promising approach for learning directly from data without explicit models, simulation, or detailed task definitions. During inference, actions are sampled from the learned distribution and executed on the robot. However, sampled actions may fail for various reasons, and simply repeating the sampling steps until a successful action is obtained may be inefficient. This paper proposes an enhanced sampling strategy that avoids previously unsuccessful actions by improving the sampling distribution. By leveraging only successful demonstration data, the method can infer recovery actions without requiring additional exploration behaviors or high-level controllers. Furthermore, utilizing the concept of diffusion model decomposition, the primary problem that may require long-term history to manage failures is decomposed into multiple smaller, more manageable subproblems, enabling the system to adapt to variable failure counts. The method produces a low-level controller that dynamically adjusts its sampling space to improve efficiency when previous samples prove insufficient.

Research Background and Motivation

Problem Definition

The core problem this research addresses is: How can robots effectively recover when actions sampled from a learned policy distribution fail?

Problem Significance

  1. Practical Application Requirements: In real-world environments, robots frequently encounter partially constrained or uncertain situations, such as searching for a bedside lamp switch or pushing a door with uncertain direction
  2. Efficiency Issues: Traditional methods simply resample from the same distribution, ignoring information about known failure regions, leading to inefficiency
  3. Practical Limitations: Existing failure recovery methods typically require additional resources (simulation environments, high-level reasoning models, expert guidance) that may be unavailable in practical applications

Limitations of Existing Methods

  1. Two-Level Planning Approaches:
    • High-level planners select action primitives, low-level controllers execute them
    • Subject to suboptimal results and combinatorial explosion problems
    • Decision-making becomes computationally expensive as options increase
  2. Robust Policy Learning:
    • Methods similar to robust reinforcement learning
    • Can only handle certain types of failures (e.g., environmental parameter changes)
    • For broader failure types (e.g., button searching), a single robust policy may not exist
  3. History-Aware Policies:
    • Require failure data for training, increasing data collection complexity
    • Require long-term memory, resulting in high computational complexity

Core Contributions

  1. Proposed a Decomposed Diffusion Policy Framework: Enhanced modularity and controllability of diffusion policies and analyzed the impact of each module
  2. Designed a Recovery Strategy Based on Negative Guidance: Unlike traditional methods, uses failure cases as negative guidance to steer the policy away from failure regions
  3. Achieved Failure Recovery Without Data Annotation: Uses only successful demonstration data, identifying recovery actions through offline analysis
  4. Validated Method Effectiveness: Comprehensive comparison with state-of-the-art baselines on multiple tasks

Method Details

Task Definition

Given a dataset of M successful demonstrations D={(at,xt,htH)i}i=1M\mathcal{D} = \{(a_t, x_t, h^H_t)_i\}_{i=1}^M, the goal is to learn a diffusion policy to model the conditional distribution pπD(atxt,htH)p_\pi^{\mathcal{D}}(a_t | x_t, h^H_t), where:

  • atRdua_t \in \mathbb{R}^{d_u}: action at time t
  • xtRdsx_t \in \mathbb{R}^{d_s}: state
  • htH=[atH:t1T,xtH:t1T]Th^H_t = [a_{t-H:t-1}^T, x_{t-H:t-1}^T]^T: history of H previous actions and states

When an action fails, the system needs to condition on the failure feature set: atpπ(atxt,htH,z1:Nf)a_t \sim p_\pi(a_t | x_t, h^H_t, z^f_{1:N})

where zif=z(aif,xif)z^f_i = z(a^f_i, x^f_i) extracts key features from the i-th failure.

Model Architecture

Diffusion Model Decomposition

The conditional distribution is decomposed into a product of multiple simpler subproblems:

pπ(atxt,htH,z1:Nf)ps(atxt)pa(at)ph(athtH)pa(at)i=1Npz(atzif)pa(at)p_\pi(a_t | x_t, h^H_t, z^f_{1:N}) \propto \frac{p_s(a_t | x_t)}{p_a(a_t)} \cdot \frac{p_h(a_t | h^H_t)}{p_a(a_t)} \cdot \prod_{i=1}^N \frac{p_z(a_t | z^f_i)}{p_a(a_t)}

The corresponding denoising term decomposition is: ε^(atk,k)=εa(at,k)+ws(εs(at,xt,k)εa(at,k))+wh(εh(at,htH,k)εa(at,k))+i=1Nwzi(εz(at,zif,k)εa(at,k))\hat{\varepsilon}(a^k_t, k) = \varepsilon_a(a_t, k) + w_s(\varepsilon_s(a_t, x_t, k) - \varepsilon_a(a_t, k)) + w_h(\varepsilon_h(a_t, h^H_t, k) - \varepsilon_a(a_t, k)) + \sum_{i=1}^N w^i_z(\varepsilon_z(a_t, z^f_i, k) - \varepsilon_a(a_t, k))

Module Functions

  1. εa(at,k)\varepsilon_a(a_t, k): Encourages sampling actions similar to demonstrations
  2. εs(at,xt,k)\varepsilon_s(a_t, x_t, k): Guides actions to match current state
  3. εh(at,htH,k)\varepsilon_h(a_t, h^H_t, k): Promotes temporal continuity
  4. εz(at,zif,k)\varepsilon_z(a_t, z^f_i, k): Negative guidance, steering away from failure regions

Recovery Model Design

Recovery Action Definition

Define the recovery action set:

\|z(a,x) - z(a^f, x^f)\|_2 > \delta_z \\ \|x - x^f\|_2 < \delta_x \end{cases}$$ where $\delta_z$ defines sufficient dissimilarity in the failure feature space, and $\delta_x$ defines similarity in the state space. #### Data Synthesis Strategy To address the sparsity of recovery data, data synthesis is performed: $$\mathcal{D}_s(x_s) = \{(a, x_s) | a \sim \bar{p}_{\mathcal{D}}(a|x), x \in x_s + \xi_x, \xi_x \sim \mathcal{N}(0, \sigma^2 I)\}$$ The corresponding noise estimator: $$\bar{\varepsilon}(a, x, k) = \varepsilon_a(a, k) + w_s(\varepsilon_s(a, x, k) - \varepsilon_a(a, k))$$ #### Failure Key Features Three practical failure feature extraction methods are proposed: 1. **Direct Use of Failed Action**: $z(a^f, x^f) = a^f$ 2. **Use of Final State**: $z(a^f, x^f) = x^f_T$ 3. **Action Primitives**: $z(a^f, x^f) = m$ (discrete label) ## Experimental Setup ### Experimental Tasks The paper designs five different types of tasks to validate method effectiveness: 1. **Door Opening (DO)**: Opening a door with unknown direction (upward, sliding, pulling) 2. **Button Pressing (BP)**: Pressing a button at an unknown location within a predefined area 3. **Object Manipulation (OM)**: Selecting manipulation strategy based on object weight (single-hand, dual-hand, pushing) 4. **Object Packing (OP)**: Placing objects in designated baskets, selecting the nearest available basket when full 5. **Bartender (BT)**: Filling multiple cups, prioritizing the nearest cup ### Evaluation Metrics 1. **Task Success Rate**: Percentage of completed tasks 2. **Implicit Goal Achievement Rate**: Percentage conforming to implicit preferences in demonstration data ### Comparison Methods 1. **DP (Diffusion Policy)**: Standard diffusion policy baseline 2. **DP***: Enhanced diffusion policy using rejection sampling and region segmentation ### Experimental Configuration - History length H: 0-2 - Prediction length L: 1-8 - Application steps p: 1-8 - Batch size: 32-1024 - Training epochs: 100 - Denoising steps: 100 ## Experimental Results ### Main Results | Task | CCDP | DP | DP* | |------|------|----|----| | Door Opening | 99% | 76% | 100% | | Button Pressing | 96% | 73% | 86% | | Object Manipulation | 70% | 40% | 72% | | Object Packing | 94% | 10% | 100% | | Bartender | 100% | 27% | 100% | ### Implicit Goal Achievement Rate | Task | CCDP | DP | DP* | |------|------|----|----| | Object Manipulation | 66% | 88% | 38% | | Object Packing | 73% | 62% | 48% | | Bartender | 97% | 100% | 12% | ### Key Findings 1. **CCDP significantly outperforms DP in task success rate**, approaching or exceeding DP* on most tasks 2. **CCDP better preserves implicit objectives from demonstration data**, while DP* performs poorly in this regard 3. **Negative guidance strategy is more flexible than positive constraints**, allowing the system to leverage broader contextual information ### Method Comparison Analysis - **CCDP vs DP**: CCDP significantly improves success rate by considering historical failure information - **CCDP vs DP***: - DP* requires pre-classification, CCDP requires no annotation - DP* uses positive enforcement (restricting sampling regions), CCDP uses negative guidance (avoiding failure regions) - CCDP's negative guidance strategy provides greater flexibility ## Related Work ### Imitation Learning - **Traditional Methods**: ProMP, TP-GMM and other probabilistic motion primitives - **Modern Methods**: Implicit Behavior Cloning, diffusion policies, flow matching policies - **Limitations**: No guarantee of single-sample success, repeated sampling is inefficient ### Guided Policy Inference - **Parametric Conditioning Methods**: Update policy parameters based on system features - **Hierarchical Methods**: Use high-level decision variables to control low-level policies - **Rejection Sampling**: Discard failed samples and generate new ones ### Multi-Model Composition - **Product of Experts (PoE)**: Decompose complex problems into simple subproblems - **Energy Models**: Applications in high-dimensional complex distributions - **Constrained Model Composition**: Successful applications in task and motion planning ## Conclusions and Discussion ### Main Conclusions 1. **Decomposition Strategy is Effective**: Complex failure recovery problems can be decomposed into multiple manageable subproblems 2. **Negative Guidance Outperforms Positive Constraints**: Provides greater exploration flexibility 3. **No Additional Data Required**: Failure recovery can be achieved using only successful demonstrations 4. **Modular Design**: Supports variable numbers of failure cases ### Limitations 1. **Hand-Crafted Failure Features**: Currently requires manual definition of failure key features, lacking automatic extraction mechanisms 2. **Weight Tuning Issues**: Optimal tuning strategies for combination weights remain insufficiently studied 3. **Static Failure Assumption**: Assumes failure causes remain temporally static 4. **NOT Operation Instability**: Attempted NOT operation methods exhibit stability issues ### Future Directions 1. **Automatic Feature Extraction**: Develop automatic failure feature extraction methods based on latent spaces 2. **Weight Optimization**: Research adaptive tuning strategies for combination weights 3. **Offline Exploration Mechanisms**: Integrate offline exploration mechanisms to extract more effective recovery data 4. **Dynamic Failure Handling**: Extend to scenarios with time-varying failure causes ## In-Depth Evaluation ### Strengths 1. **Strong Innovation**: First to propose diffusion policy composition method based on negative guidance 2. **High Practical Value**: Requires no additional annotation or simulation environment, using only successful demonstration data 3. **Solid Theoretical Foundation**: Based on rigorous mathematical foundations in probability theory and diffusion models 4. **Comprehensive Experiments**: Validates method effectiveness across multiple different task types 5. **Modular Design**: Decomposition strategy improves method interpretability and controllability ### Weaknesses 1. **Failure Detection Dependency**: Requires external failure detection system, increasing system complexity 2. **Feature Engineering**: Failure key features require manual design, limiting method generality 3. **Static Assumptions**: The assumption of static failure causes may not hold in certain dynamic environments 4. **Computational Overhead**: Multi-model composition may increase computational complexity during inference 5. **Hyperparameter Sensitivity**: Weight parameter selection significantly impacts performance ### Impact 1. **Academic Contribution**: Provides new theoretical framework and practical methods for robot failure recovery 2. **Practical Applications**: Broad application prospects in service robotics, industrial automation, and related fields 3. **Method Inspiration**: Negative guidance concepts can be generalized to other generative models and control problems 4. **Reproducibility**: Provides detailed implementation details and hyperparameter settings ### Applicable Scenarios 1. **Partially Constrained Environments**: Suitable for robot tasks where environmental parameters are partially unknown 2. **Interactive Tasks**: Tasks requiring strategy adjustment based on feedback 3. **Multi-Modal Tasks**: Tasks with multiple valid solutions 4. **Safety-Critical Applications**: Safety-sensitive scenarios requiring avoidance of repeated failures ## References The paper cites 35 related references covering important works in imitation learning, diffusion models, robot control, and other domains, providing solid theoretical foundation and technical support for this research. --- **Overall Assessment**: This is a high-quality robotics learning paper that proposes an innovative failure recovery strategy, demonstrating excellence in both theoretical contributions and practical application value. The method design is ingenious, experimental validation is comprehensive, and it makes important contributions to the field of intelligent robot control.