Imitation learning provides a promising approach for learning directly from data without explicit models, simulation, or detailed task definitions. During inference, actions are sampled from the learned distribution and executed on the robot. However, sampled actions may fail for various reasons, and simply repeating the sampling steps until a successful action is obtained may be inefficient. This paper proposes an enhanced sampling strategy that avoids previously unsuccessful actions by improving the sampling distribution. By leveraging only successful demonstration data, the method can infer recovery actions without requiring additional exploration behaviors or high-level controllers. Furthermore, utilizing the concept of diffusion model decomposition, the primary problem that may require long-term history to manage failures is decomposed into multiple smaller, more manageable subproblems, enabling the system to adapt to variable failure counts. The method produces a low-level controller that dynamically adjusts its sampling space to improve efficiency when previous samples prove insufficient.
The core problem this research addresses is: How can robots effectively recover when actions sampled from a learned policy distribution fail?
Given a dataset of M successful demonstrations , the goal is to learn a diffusion policy to model the conditional distribution , where:
When an action fails, the system needs to condition on the failure feature set:
where extracts key features from the i-th failure.
The conditional distribution is decomposed into a product of multiple simpler subproblems:
The corresponding denoising term decomposition is:
Define the recovery action set:
\|z(a,x) - z(a^f, x^f)\|_2 > \delta_z \\ \|x - x^f\|_2 < \delta_x \end{cases}$$ where $\delta_z$ defines sufficient dissimilarity in the failure feature space, and $\delta_x$ defines similarity in the state space. #### Data Synthesis Strategy To address the sparsity of recovery data, data synthesis is performed: $$\mathcal{D}_s(x_s) = \{(a, x_s) | a \sim \bar{p}_{\mathcal{D}}(a|x), x \in x_s + \xi_x, \xi_x \sim \mathcal{N}(0, \sigma^2 I)\}$$ The corresponding noise estimator: $$\bar{\varepsilon}(a, x, k) = \varepsilon_a(a, k) + w_s(\varepsilon_s(a, x, k) - \varepsilon_a(a, k))$$ #### Failure Key Features Three practical failure feature extraction methods are proposed: 1. **Direct Use of Failed Action**: $z(a^f, x^f) = a^f$ 2. **Use of Final State**: $z(a^f, x^f) = x^f_T$ 3. **Action Primitives**: $z(a^f, x^f) = m$ (discrete label) ## Experimental Setup ### Experimental Tasks The paper designs five different types of tasks to validate method effectiveness: 1. **Door Opening (DO)**: Opening a door with unknown direction (upward, sliding, pulling) 2. **Button Pressing (BP)**: Pressing a button at an unknown location within a predefined area 3. **Object Manipulation (OM)**: Selecting manipulation strategy based on object weight (single-hand, dual-hand, pushing) 4. **Object Packing (OP)**: Placing objects in designated baskets, selecting the nearest available basket when full 5. **Bartender (BT)**: Filling multiple cups, prioritizing the nearest cup ### Evaluation Metrics 1. **Task Success Rate**: Percentage of completed tasks 2. **Implicit Goal Achievement Rate**: Percentage conforming to implicit preferences in demonstration data ### Comparison Methods 1. **DP (Diffusion Policy)**: Standard diffusion policy baseline 2. **DP***: Enhanced diffusion policy using rejection sampling and region segmentation ### Experimental Configuration - History length H: 0-2 - Prediction length L: 1-8 - Application steps p: 1-8 - Batch size: 32-1024 - Training epochs: 100 - Denoising steps: 100 ## Experimental Results ### Main Results | Task | CCDP | DP | DP* | |------|------|----|----| | Door Opening | 99% | 76% | 100% | | Button Pressing | 96% | 73% | 86% | | Object Manipulation | 70% | 40% | 72% | | Object Packing | 94% | 10% | 100% | | Bartender | 100% | 27% | 100% | ### Implicit Goal Achievement Rate | Task | CCDP | DP | DP* | |------|------|----|----| | Object Manipulation | 66% | 88% | 38% | | Object Packing | 73% | 62% | 48% | | Bartender | 97% | 100% | 12% | ### Key Findings 1. **CCDP significantly outperforms DP in task success rate**, approaching or exceeding DP* on most tasks 2. **CCDP better preserves implicit objectives from demonstration data**, while DP* performs poorly in this regard 3. **Negative guidance strategy is more flexible than positive constraints**, allowing the system to leverage broader contextual information ### Method Comparison Analysis - **CCDP vs DP**: CCDP significantly improves success rate by considering historical failure information - **CCDP vs DP***: - DP* requires pre-classification, CCDP requires no annotation - DP* uses positive enforcement (restricting sampling regions), CCDP uses negative guidance (avoiding failure regions) - CCDP's negative guidance strategy provides greater flexibility ## Related Work ### Imitation Learning - **Traditional Methods**: ProMP, TP-GMM and other probabilistic motion primitives - **Modern Methods**: Implicit Behavior Cloning, diffusion policies, flow matching policies - **Limitations**: No guarantee of single-sample success, repeated sampling is inefficient ### Guided Policy Inference - **Parametric Conditioning Methods**: Update policy parameters based on system features - **Hierarchical Methods**: Use high-level decision variables to control low-level policies - **Rejection Sampling**: Discard failed samples and generate new ones ### Multi-Model Composition - **Product of Experts (PoE)**: Decompose complex problems into simple subproblems - **Energy Models**: Applications in high-dimensional complex distributions - **Constrained Model Composition**: Successful applications in task and motion planning ## Conclusions and Discussion ### Main Conclusions 1. **Decomposition Strategy is Effective**: Complex failure recovery problems can be decomposed into multiple manageable subproblems 2. **Negative Guidance Outperforms Positive Constraints**: Provides greater exploration flexibility 3. **No Additional Data Required**: Failure recovery can be achieved using only successful demonstrations 4. **Modular Design**: Supports variable numbers of failure cases ### Limitations 1. **Hand-Crafted Failure Features**: Currently requires manual definition of failure key features, lacking automatic extraction mechanisms 2. **Weight Tuning Issues**: Optimal tuning strategies for combination weights remain insufficiently studied 3. **Static Failure Assumption**: Assumes failure causes remain temporally static 4. **NOT Operation Instability**: Attempted NOT operation methods exhibit stability issues ### Future Directions 1. **Automatic Feature Extraction**: Develop automatic failure feature extraction methods based on latent spaces 2. **Weight Optimization**: Research adaptive tuning strategies for combination weights 3. **Offline Exploration Mechanisms**: Integrate offline exploration mechanisms to extract more effective recovery data 4. **Dynamic Failure Handling**: Extend to scenarios with time-varying failure causes ## In-Depth Evaluation ### Strengths 1. **Strong Innovation**: First to propose diffusion policy composition method based on negative guidance 2. **High Practical Value**: Requires no additional annotation or simulation environment, using only successful demonstration data 3. **Solid Theoretical Foundation**: Based on rigorous mathematical foundations in probability theory and diffusion models 4. **Comprehensive Experiments**: Validates method effectiveness across multiple different task types 5. **Modular Design**: Decomposition strategy improves method interpretability and controllability ### Weaknesses 1. **Failure Detection Dependency**: Requires external failure detection system, increasing system complexity 2. **Feature Engineering**: Failure key features require manual design, limiting method generality 3. **Static Assumptions**: The assumption of static failure causes may not hold in certain dynamic environments 4. **Computational Overhead**: Multi-model composition may increase computational complexity during inference 5. **Hyperparameter Sensitivity**: Weight parameter selection significantly impacts performance ### Impact 1. **Academic Contribution**: Provides new theoretical framework and practical methods for robot failure recovery 2. **Practical Applications**: Broad application prospects in service robotics, industrial automation, and related fields 3. **Method Inspiration**: Negative guidance concepts can be generalized to other generative models and control problems 4. **Reproducibility**: Provides detailed implementation details and hyperparameter settings ### Applicable Scenarios 1. **Partially Constrained Environments**: Suitable for robot tasks where environmental parameters are partially unknown 2. **Interactive Tasks**: Tasks requiring strategy adjustment based on feedback 3. **Multi-Modal Tasks**: Tasks with multiple valid solutions 4. **Safety-Critical Applications**: Safety-sensitive scenarios requiring avoidance of repeated failures ## References The paper cites 35 related references covering important works in imitation learning, diffusion models, robot control, and other domains, providing solid theoretical foundation and technical support for this research. --- **Overall Assessment**: This is a high-quality robotics learning paper that proposes an innovative failure recovery strategy, demonstrating excellence in both theoretical contributions and practical application value. The method design is ingenious, experimental validation is comprehensive, and it makes important contributions to the field of intelligent robot control.