2025-11-13T18:28:11.410735

Action-Dynamics Modeling and Cross-Temporal Interaction for Online Action Understanding

Yang, Jiang, Zhou et al.

Action understanding, encompassing action detection and anticipation, plays a crucial role in numerous practical applications. However, untrimmed videos are often characterized by substantial redundant information and noise. Moreover, in modeling action understanding, the influence of the agent's intention on the action is often overlooked. Motivated by these issues, we propose a novel framework called the State-Specific Model (SSM), designed to unify and enhance both action detection and anticipation tasks. In the proposed framework, the Critical State-Based Memory Compression module compresses frame sequences into critical states, reducing information redundancy. The Action Pattern Learning module constructs a state-transition graph with multi-dimensional edges to model action dynamics in complex scenarios, on the basis of which potential future cues can be generated to represent intention. Furthermore, our Cross-Temporal Interaction module models the mutual influence between intentions and past as well as current information through cross-temporal interactions, thereby refining present and future features and ultimately realizing simultaneous action detection and anticipation. Extensive experiments on multiple benchmark datasets -- including EPIC-Kitchens-100, THUMOS'14, TVSeries, and the introduced Parkinson's Disease Mouse Behaviour (PDMB) dataset -- demonstrate the superior performance of our proposed framework compared to other state-of-the-art approaches. These results highlight the importance of action dynamics learning and cross-temporal interactions, laying a foundation for future action understanding research.

academic

Action-Dynamics Modeling and Cross-Temporal Interaction for Online Action Understanding

基本信息

论文ID: 2510.10682
标题: Action-Dynamics Modeling and Cross-Temporal Interaction for Online Action Understanding
作者: Xinyu Yang, Zheheng Jiang, Feixiang Zhou, Yihang Zhu, Na Lv, Nan Xing, Huiyu Zhou
分类: cs.CV (Computer Vision)
发表时间: 2025年10月12日 (arXiv预印本)
论文链接: https://arxiv.org/abs/2510.10682

摘要

动作理解包括动作检测和动作预测，在众多实际应用中发挥着关键作用。然而，未剪辑的视频通常包含大量冗余信息和噪声。此外，在建模动作理解时，智能体意图对动作的影响往往被忽视。基于这些问题，本文提出了一个名为状态特定模型(State-Specific Model, SSM)的新框架，旨在统一和增强动作检测与预测任务。该框架包含关键状态记忆压缩模块、动作模式学习模块和跨时序交互模块，通过状态转换图建模动作动态，生成潜在未来线索表示意图，并通过跨时序交互同时实现动作检测和预测。

研究背景与动机

核心问题

信息冗余问题：未剪辑视频包含大量背景帧和噪声，这些冗余信息会干扰模型对关键动作模式的学习
意图建模缺失：现有方法主要关注历史信息对当前/未来动作的影响，忽视了智能体意图在动作执行中的指导作用
任务割裂问题：动作检测和预测任务通常分别处理，未能充分利用两者间的互补性

研究重要性

在线动作理解对于智能监控、人机交互、自动驾驶等应用至关重要。准确的动作检测和预测能够使系统更好地理解和响应人类行为。

现有方法局限性

基于记忆的方法：如LSTR、GateHub等依赖处理完整序列，在长视频中容易受到噪声干扰
单任务设计：大多数方法专注于单一任务，未能利用检测和预测任务间的相互促进关系
缺乏意图建模：忽视了意图作为动作驱动力的重要作用

核心贡献

提出SSM框架：统一动作检测和预测任务的新颖框架，通过建模动作动态和跨时序交互增强动作理解
关键状态记忆压缩(CSMC)模块：引入时序加权注意机制，将原始序列压缩为关键状态，减少信息冗余
动作模式学习(APL)模块：构建多维状态转换图建模复杂场景下的动作动态，生成表示意图的潜在未来线索
跨时序交互(CTI)模块：建模意图与过去/当前信息间的相互影响，同时优化检测和预测性能
全面实验验证：在多个基准数据集上验证了方法的有效性和泛化能力

方法详解

任务定义

给定视频特征序列 $F = \{f_i\}_{0}^{L-1} \in \mathbb{R}^{L \times D}$ ，其中包含记忆序列 $F_m = \{f\}_{-1}^{-L_m}$ 和当前帧 $F_{current} = \{f\}_0$ ，目标是同时实现：

在线动作检测：识别当前时刻的动作类别
动作预测：预测未来时刻的动作类别

模型架构

1. 关键状态记忆压缩(CSMC)模块

关键帧提取：

使用ProPos表示学习和高斯混合模型(GMM)进行视频帧聚类
概率密度建模： $p(f(x_i)) = \sum_{k=1}^K \pi_k \mathcal{N}(f(x_i) | \mu_k, \Sigma_k)$
后验概率计算： $p(k|f(x_i)) = \frac{\pi_k \mathcal{N}(f(x_i)|\mu_k,\Sigma_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(f(x_i)|\mu_j,\Sigma_j)}$
选择距离聚类中心最近的帧作为关键帧： $x_k^c = \arg\min_{x_i} \|f(x_i) - \mu_k\|_2$

时序加权注意机制(TWA)：

关键帧作为查询(Q)，原始序列帧作为键(K)和值(V)
时序权重函数： $g(\Delta t_{i,j}) = \exp(-\frac{\Delta t_{i,j}^2}{2\delta^2})$
注意力权重： $a_{i,j} = \sigma(\frac{Q_i \cdot K_j^T}{\sqrt{d_k}} \cdot g(\Delta t_{i,j}))$
关键状态表示： $S_i = \sum_{j=1}^L a_{ij}V_j$