2025-11-11T09:10:09.674062

CCDP: Composition of Conditional Diffusion Policies with Guided Sampling

Razmjoo, Calinon, Gienger et al.

Imitation Learning offers a promising approach to learn directly from data without requiring explicit models, simulations, or detailed task definitions. During inference, actions are sampled from the learned distribution and executed on the robot. However, sampled actions may fail for various reasons, and simply repeating the sampling step until a successful action is obtained can be inefficient. In this work, we propose an enhanced sampling strategy that refines the sampling distribution to avoid previously unsuccessful actions. We demonstrate that by solely utilizing data from successful demonstrations, our method can infer recovery actions without the need for additional exploratory behavior or a high-level controller. Furthermore, we leverage the concept of diffusion model decomposition to break down the primary problem, which may require long-horizon history to manage failures, into multiple smaller, more manageable sub-problems in learning, data collection, and inference, thereby enabling the system to adapt to variable failure counts. Our approach yields a low-level controller that dynamically adjusts its sampling space to improve efficiency when prior samples fall short. We validate our method across several tasks, including door opening with unknown directions, object manipulation, and button-searching scenarios, demonstrating that our approach outperforms traditional baselines.

academic

CCDP: Composition of Conditional Diffusion Policies with Guided Sampling

基本信息

论文ID: 2503.15386
标题: CCDP: Composition of Conditional Diffusion Policies with Guided Sampling
作者: Amirreza Razmjoo (Honda Research Institute Europe & Idiap Research Institute & EPFL), Sylvain Calinon (Idiap Research Institute & EPFL), Michael Gienger (Honda Research Institute Europe), Fan Zhang (Honda Research Institute Europe)
分类: cs.RO (Robotics), cs.AI (Artificial Intelligence)
发表时间: 2025年10月10日 (arXiv v2)
论文链接: https://arxiv.org/abs/2503.15386

摘要

模仿学习提供了一种直接从数据中学习的有前途方法，无需显式模型、仿真或详细任务定义。在推理过程中，从学习到的分布中采样动作并在机器人上执行。然而，采样的动作可能因各种原因失败，简单地重复采样步骤直到获得成功动作可能效率低下。本文提出了一种增强的采样策略，通过改进采样分布来避免先前不成功的动作。通过仅利用成功演示的数据，该方法可以推断恢复动作，无需额外的探索行为或高级控制器。此外，利用扩散模型分解的概念，将可能需要长期历史来管理失败的主要问题分解为多个更小、更易管理的子问题，从而使系统能够适应可变的失败计数。该方法产生了一个低级控制器，当先前的样本不足时，动态调整其采样空间以提高效率。

研究背景与动机

问题定义

该研究要解决的核心问题是：当机器人从学习到的策略分布中采样的动作失败时，如何有效地进行恢复？

问题重要性

实际应用需求：在真实环境中，机器人经常遇到部分约束或不确定的情况，如摸索床头灯开关、推门方向不确定等
效率问题：传统方法简单重复采样相同分布，忽略了已知失败区域的信息，导致效率低下
实用性限制：现有失败恢复方法通常需要额外资源（仿真环境、高级推理模型、专家指导），在实际应用中可能不可用

现有方法局限性

两级规划方法：
- 高级规划器选择动作原语，低级控制器执行
- 存在次优结果和组合爆炸问题
- 随着选项增加，决策变得计算昂贵
鲁棒策略学习：
- 类似鲁棒强化学习的方法
- 只能处理部分类型失败（如环境参数变化）
- 对于更广泛的失败类型（如按钮搜索），单一鲁棒策略可能不存在
历史感知策略：
- 需要失败数据进行训练，增加数据收集复杂性
- 需要长期历史记忆，计算复杂度高

核心贡献

提出了分解扩散策略框架：增强了扩散策略的模块化和可控性，并分析了各模块的影响
设计了基于负向引导的恢复策略：与传统方法不同，使用失败案例作为负向引导，引导策略远离失败区域
实现了无需数据标注的失败恢复：仅使用成功演示数据，通过离线分析识别恢复动作
验证了方法有效性：在多个任务上与最先进基线进行了全面比较

方法详解

任务定义

给定M个成功演示的数据集 $\mathcal{D} = \{(a_t, x_t, h^H_t)_i\}_{i=1}^M$ ，目标是学习一个扩散策略来建模条件分布 $p_\pi^{\mathcal{D}}(a_t | x_t, h^H_t)$ ，其中：

$a_t \in \mathbb{R}^{d_u}$ ：时刻t的动作
$x_t \in \mathbb{R}^{d_s}$ ：状态
$h^H_t = [a_{t-H:t-1}^T, x_{t-H:t-1}^T]^T$ ：前H个动作和状态的历史

当动作失败时，系统需要条件化到失败特征集合： $a_t \sim p_\pi(a_t | x_t, h^H_t, z^f_{1:N})$

其中 $z^f_i = z(a^f_i, x^f_i)$ 提取第i次失败的关键特征。

模型架构

扩散模型分解

将条件分布分解为多个简单子问题的乘积：

$p_\pi(a_t | x_t, h^H_t, z^f_{1:N}) \propto \frac{p_s(a_t | x_t)}{p_a(a_t)} \cdot \frac{p_h(a_t | h^H_t)}{p_a(a_t)} \cdot \prod_{i=1}^N \frac{p_z(a_t | z^f_i)}{p_a(a_t)}$

对应的去噪项分解为： $\hat{\varepsilon}(a^k_t, k) = \varepsilon_a(a_t, k) + w_s(\varepsilon_s(a_t, x_t, k) - \varepsilon_a(a_t, k)) + w_h(\varepsilon_h(a_t, h^H_t, k) - \varepsilon_a(a_t, k)) + \sum_{i=1}^N w^i_z(\varepsilon_z(a_t, z^f_i, k) - \varepsilon_a(a_t, k))$