2025-11-13T07:31:10.185499

Learning to Recognize Correctly Completed Procedure Steps in Egocentric Assembly Videos through Spatio-Temporal Modeling

Schoonbeek, Hung, Lehman et al.

Procedure step recognition (PSR) aims to identify all correctly completed steps and their sequential order in videos of procedural tasks. The existing state-of-the-art models rely solely on detecting assembly object states in individual video frames. By neglecting temporal features, model robustness and accuracy are limited, especially when objects are partially occluded. To overcome these limitations, we propose Spatio-Temporal Occlusion-Resilient Modeling for Procedure Step Recognition (STORM-PSR), a dual-stream framework for PSR that leverages both spatial and temporal features. The assembly state detection stream operates effectively with unobstructed views of the object, while the spatio-temporal stream captures both spatial and temporal features to recognize step completions even under partial occlusion. This stream includes a spatial encoder, pre-trained using a novel weakly supervised approach to capture meaningful spatial representations, and a transformer-based temporal encoder that learns how these spatial features relate over time. STORM-PSR is evaluated on the MECCANO and IndustReal datasets, reducing the average delay between actual and predicted assembly step completions by 11.2% and 26.1%, respectively, compared to prior methods. We demonstrate that this reduction in delay is driven by the spatio-temporal stream, which does not rely on unobstructed views of the object to infer completed steps. The code for STORM-PSR, along with the newly annotated MECCANO labels, is made publicly available at https://timschoonbeek.github.io/stormpsr .

academic

Learning to Recognize Correctly Completed Procedure Steps in Egocentric Assembly Videos through Spatio-Temporal Modeling

基本信息

论文ID: 2510.12385
标题: Learning to Recognize Correctly Completed Procedure Steps in Egocentric Assembly Videos through Spatio-Temporal Modeling
作者: Tim J. Schoonbeek, Shao-Hsuan Hung, Dan Lehman, Hans Onvlee, Jacek Kustra, Peter H.N. de With, Fons van der Sommen
分类: cs.CV (Computer Vision)
发表时间: 2025年10月14日 (arXiv预印本)
期刊: Computer Vision and Image Understanding (已接收)
论文链接: https://arxiv.org/abs/2510.12385

摘要

程序步骤识别(PSR)旨在识别程序任务视频中所有正确完成的步骤及其顺序。现有最先进的模型仅依赖于检测单个视频帧中的装配对象状态，忽略了时间特征，导致模型鲁棒性和准确性受限，特别是在对象部分遮挡时。为克服这些限制，本文提出了STORM-PSR(Spatio-Temporal Occlusion-Resilient Modeling for Procedure Step Recognition)，这是一个利用空间和时间特征的双流PSR框架。装配状态检测流在对象无遮挡时有效工作，而时空流捕获空间和时间特征，即使在部分遮挡下也能识别步骤完成。该方法在MECCANO和IndustReal数据集上评估，相比现有方法，分别将实际和预测装配步骤完成之间的平均延迟减少了11.2%和26.1%。

研究背景与动机

问题定义

程序步骤识别(PSR)是计算机视觉在工业辅助场景中的一个重要任务，旨在识别视频中正确完成的程序步骤及其完成时间。这对于工业自动化、质量控制和操作员辅助系统具有重要意义。

现有方法的局限性

依赖完整视图: 现有方法主要基于装配状态检测(ASD)，需要对象完全可见且无遮挡
忽略时间信息: 仅使用单帧空间信息，未利用视频的时间连续性
第一人称视角挑战: 在自中心视角视频中，手部和工具频繁遮挡关键对象，导致识别延迟

研究动机

工业场景中，及时准确的步骤识别对于：

实时质量监控
操作员指导和错误预防
自动化装配验证等应用至关重要。现有方法在遮挡情况下的显著延迟限制了其实用性。

核心贡献

STORM-PSR框架: 提出首个直接优化PSR任务的双流时空模型，而非从装配状态推断步骤完成
新颖训练策略:
- 关键帧采样(KFS): 弱监督预训练空间编码器
- 关键片段感知采样(KCAS): 针对时间编码器的新颖采样策略
数据集贡献: 为MECCANO数据集提供PSR和ASD标注，建立性能基准
显著性能提升: 在两个数据集上大幅减少识别延迟，同时保持或提升其他性能指标

方法详解

任务定义

给定视频输入 $X_t = (x_1, x_2, \cdots, x_t)$ 和程序动作集合 $P = \{p_0, \cdots, p_N\}$ ，PSR任务目标是预测到时刻t已完成的步骤集合：

$\hat{Y}_t = \{(\hat{a}_{\sigma(0)}, \hat{t}_{\sigma(0)}), \cdots (\hat{a}_{\sigma(m)}, \hat{t}_{\sigma(m)})\}$

其中 $\hat{a}_{\sigma(i)}$ 表示预测的动作完成， $\hat{t}_{\sigma(i)}$ 表示完成时间。

模型架构

双流框架设计

STORM-PSR采用双流架构：

装配状态检测流(S): 处理无遮挡帧，基于YOLOv8-M检测完整装配状态
时空流(T): 处理遮挡情况，直接预测步骤完成

最终预测通过等权重融合： $\hat{y}_k = 0.5 \cdot \hat{y}_{S,k} + 0.5 \cdot \hat{y}_{T,k}$

时空流架构

空间编码器: 预训练的ViT-S模型，提取帧级空间特征
时间编码器: Transformer架构，学习时间依赖关系
分类头: MLP实现多标签分类

关键技术创新

1. 关键帧采样(KFS)

弱监督预训练策略，利用稀疏的步骤完成标注：

在步骤完成时间戳周围采样帧
使用监督对比损失学习鲁棒的空间表示
可整合合成数据增强训练

2. 关键片段感知采样(KCAS)

基于双峰分布的采样策略： $p_i(x) = \sum_{t_j \in T} [g(x | t_j - \delta, \sigma) + g(x | t_j + \delta, \sigma)]$

过采样步骤完成前后的片段
欠采样模糊时刻和背景片段
提供更多正样本和困难负样本

实验设置

数据集

IndustReal: 26.9K标注帧，包含合成数据支持
MECCANO: 新标注的13.6K帧，更具挑战性的遮挡场景

评价指标

程序顺序相似度(POS): 基于编辑距离的顺序准确性
F1分数: 精确率和召回率的调和平均
平均延迟(τ): 实际完成与识别之间的时间差

实现细节

空间编码器：ImageNet-21K预训练ViT-S
时间编码器：6层自注意力，8个注意力头
优化器：SGD，学习率10^-3，余弦退火调度
输入分辨率：224×224像素

实验结果

主要结果

方法	IndustReal			MECCANO
	POS↑	F1↑	τ↓	POS↑	F1↑	τ↓
IndustReal基线	0.797	0.891	21.0	0.354	0.545	99.8
时空流单独	0.497	0.506	14.2	0.206	0.247	120.3
STORM-PSR	0.812	0.901	15.5	0.377	0.497	88.6