2025-11-12T14:07:10.510276

Structured Universal Adversarial Attacks on Object Detection for Video Sequences

Jacob, Shao, Kasneci

Video-based object detection plays a vital role in safety-critical applications. While deep learning-based object detectors have achieved impressive performance, they remain vulnerable to adversarial attacks, particularly those involving universal perturbations. In this work, we propose a minimally distorted universal adversarial attack tailored for video object detection, which leverages nuclear norm regularization to promote structured perturbations concentrated in the background. To optimize this formulation efficiently, we employ an adaptive, optimistic exponentiated gradient method that enhances both scalability and convergence. Our results demonstrate that the proposed attack outperforms both low-rank projected gradient descent and Frank-Wolfe based attacks in effectiveness while maintaining high stealthiness. All code and data are publicly available at https://github.com/jsve96/AO-Exp-Attack.

academic

Structured Universal Adversarial Attacks on Object Detection for Video Sequences

Basic Information

Paper ID: 2510.14460
Title: Structured Universal Adversarial Attacks on Object Detection for Video Sequences
Authors: Sven Jacob (BAuA & TUM), Weijia Shao (BAuA), Gjergji Kasneci (TUM)
Category: cs.CV (Computer Vision)
Publication Date: October 16, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.14460v1

Abstract

Video object detection plays a crucial role in safety-critical applications. While deep learning-based object detectors have achieved impressive performance, they remain vulnerable to adversarial attacks, particularly those involving universal perturbations. This paper proposes a minimum-distortion universal adversarial attack method for video object detection that leverages nuclear norm regularization to promote structured perturbations concentrated in the background. To efficiently optimize this formulation, an adaptive optimistic exponential gradient method is employed, improving scalability and convergence. Experimental results demonstrate that the proposed attack method outperforms low-rank projected gradient descent and Frank-Wolfe attacks while maintaining high imperceptibility.

Research Background and Motivation

Problem Definition

This research addresses the vulnerability of video object detection systems to adversarial attacks, particularly in safety-critical application scenarios.

Importance Analysis

Safety-Critical Nature: Video object detection is widely deployed in safety-critical domains such as autonomous driving, industrial safety monitoring, and real-time surveillance
Real-World Threats: Adversarial attacks can cause detection system failures, leading to serious safety incidents
Universal Challenge: Universal adversarial perturbations (UAP) pose stronger threats as they can transfer across frames without requiring further access to the target model

Limitations of Existing Methods

Norm Constraint Limitations: Existing methods primarily focus on ℓ₂ and ℓ∞ norm-constrained perturbations
Visual Perceptibility: ℓ₁ attacks produce visible artifacts on moving objects in videos, reducing imperceptibility
Missing Temporal Consistency: Frame-by-frame processing ignores the temporal coherence of video data

Research Motivation

Based on robust principal component analysis and structured adversarial perturbation methods, this work proposes a novel strategy for achieving target disappearance attacks through structured yet unsuspicious background modifications.

Core Contributions

Novel Attack Formulation: Proposes a minimum-distortion universal attack formulation based on nuclear norm regularization that promotes structured perturbations with orthogonal spatial patterns across video frames
Efficient Optimization Algorithm: Adapts the adaptive optimistic exponential gradient descent method for scalable optimization under nuclear norm constraints
Comprehensive Experimental Evaluation: Conducts thorough evaluation on public video datasets and state-of-the-art video object detection models
Performance Advantages: Demonstrates superior attack success rates and computational efficiency compared to existing nuclear norm attack methods

Methodology Details

Task Definition

Given a video frame sequence $\{x_b|1 \leq b \leq B\}$ , the objective is to find a universal adversarial perturbation $\delta$ that, when applied to all frames, causes the object detector $f$ to fail while maintaining minimal and structured perturbation.

Model Architecture

Loss Function Design

The loss function is decomposed into foreground and background losses: $L = L_{fg} + L_{bg}$

Where:

Foreground Loss: $L_{fg} = \frac{1}{|F|}\sum_{i \in F} CE(p_i, y_i)$
Background Loss: $L_{bg} = \frac{1}{|B|}\sum_{i \in B} CE(p_i, y_i)$
Confidence Loss: $L_{conf} = \sum_{i \in [S]} \xi_i \cdot \mathbf{1}(\xi_i > \tau)$

Total loss is: $L_{total} = \alpha L_{fg} + \gamma L_{conf} + \beta L_{bg}$

Regularization Design

A combination of Frobenius norm and nuclear norm is employed: $R(\delta) = \lambda_1 ||\delta||_* + \lambda_2 ||\delta||_F$

Optimization Objective

The complete optimization problem for universal attacks: $\min_{\delta \in \mathbb{R}^{H \times W \times C}} -\frac{1}{B}\sum_{b=1}^{B} L_{total}(f(x_b + \delta), f(x_b)) + \sum_{c=1}^{C}(\lambda_1||\delta_c||_* + \frac{\lambda_2}{2}||\delta_c||_F^2)$

AO-Exp Algorithm

Core Concept

Employs the adaptive optimistic exponential gradient method, maintaining decision variables through SVD decomposition: $\delta_c^t = U_{c,t} \text{diag}(z_c^t) V_{c,t}^T$

Algorithm Steps

Optimistic Update: $\eta_c^t \leftarrow \eta_c^{t-1} + \frac{t^2}{||\nabla G(\delta_c^t) - \nabla G(\delta_c^{t-1})||_\infty^2}$
Singular Value Update: $z_{c,i}^{t+1} = \frac{\eta_c^t}{\lambda_2} W_0\left(\frac{\lambda_2}{\eta_c^t} \exp\left(\frac{\lambda_2 + \max\{\theta_{c,i}^t - \lambda_1, 0\}}{\eta_t}\right)\right) - 1$
Perturbation Reconstruction: $\delta_c^{t+1} = \frac{2}{t(t+1)} \sum_{s=1}^{t} s \cdot U_{c,t} \text{diag}(z_{s,1:k}^c) V_{c,t}^T$

Technical Innovations

Structured Background Perturbation: Nuclear norm regularization promotes low-rank structure concentrated in background regions
Temporal Consistency: Universal perturbations ensure temporal consistency across frames
Efficient Optimization: AO-Exp method achieves fast convergence under nuclear norm constraints
Low-Rank Adaptation: Further information compression through top-k singular value selection

Experimental Setup

Datasets

PETS 2009 S2L1: 7 scenes, 768×576 resolution, average 795 frames/scene
EPFL-RLC: 3 scenes, 1920×1080 resolution, average 5000 frames/scene
CW4C: 15 scenes, 1920×880 resolution, average 7200 frames/scene

Evaluation Metrics

IoU Accumulation (IoUacc): Evaluates attack impact on entire sequence
Adversarial Bounding Box Ratio (advBR): Ratio of adversarial to clean bounding boxes
Mean Absolute Perturbation (MAP): Measures perceptibility
Nuclear Norm $||\delta||_*$ : Evaluates perturbation structure

Comparison Methods

LoRa-PGD: Low-rank projected gradient descent attack
FW-Nucl: Frank-Wolfe nuclear norm group attack
AO-Exp Variants: Including low-rank adaptation version

Implementation Details

Iterations: 100 (AO-Exp and LoRa-PGD), 30 (FW-Nucl)
Regularization parameters: λ₁ and λ₂ adjusted per dataset
Target model: Mask R-CNN

Experimental Results

Main Results

Dataset	Method	IoUacc(↓)	advBR(↓)	MAP(↓)	$\\|\\|\delta\\|\\|_*$ (↓)
PETS2009	FW-Nucl	4.77±1.09	1.04±0.25	1.2±0.3	36.5±5.84
	LoRa-PGD-100	1.22±0.91	0.63±0.42	4.0±0.3	60.3±10.3
	AO-Exp	0.29±0.27	0.06±0.04	2.9±0.1	41.3±16.6
EPFL-RLC	FW-Nucl	4.83±0.96	0.86±0.14	5.4±2.0	37.54±1.53
	LoRa-PGD-100	0.20±0.06	0.37±0.11	14.0±3.0	43.5±4.3
	AO-Exp	0.9±0.37	0.22±0.07	6.0±4.0	27.52±15.8

Key Findings

Attack Effectiveness: AO-Exp achieves the lowest IoUacc and advBR across all datasets
Imperceptibility: MAP metrics demonstrate AO-Exp maintains good visual imperceptibility
Structured Degree: Nuclear norm results indicate AO-Exp generates more structured perturbations

Ablation Studies

Singular Value Count Impact: Analysis of different k values on advBR across camera viewpoints in EPFL dataset
Low-Rank Adaptation Effect: AO-Exp (LoRa) variant significantly reduces nuclear norm while maintaining comparable performance

Visual Analysis

ℓ₁ attacks produce flickering noise following moving objects
Nuclear norm attacks generate more structured spatially coherent perturbations concentrated in background regions

Current State of Adversarial Attack Research

Image Classification Attacks: Relatively mature research with abundant methods
Object Detection Attacks: Relatively scarce, particularly in video scenarios
Universal Adversarial Perturbations: Input-agnostic, uniformly applied across inputs

Low-Rank Structure Research

Manifold Hypothesis: High-dimensional data tends to lie near low-dimensional manifolds
Dimensionality Reduction Methods: PCA, UMAP, autoencoders, etc.
Adversarial Applications: Nuclear norm regularization applications in adversarial attacks

Advantages of This Work

Temporal Consistency: Considers temporal characteristics of video data
Structured Design: Leverages nuclear norm to promote background structured perturbations
Efficient Optimization: AO-Exp method improves computational efficiency

Conclusions and Discussion

Main Conclusions

Proposes a novel structured universal adversarial attack method for video object detection
Nuclear norm regularization effectively promotes structured perturbations in background regions
AO-Exp algorithm outperforms existing methods in both effectiveness and efficiency
Method consistently suppresses bounding boxes across multiple datasets

Limitations

Static Camera Assumption: Current method assumes static camera settings, limiting applicability to dynamic camera scenarios
Hyperparameter Sensitivity: Attack performance is sensitive to choices of nuclear norm weight and Frobenius regularization parameters
Computational Complexity: Each iteration requires SVD decomposition, increasing computational cost

Future Directions

Dynamic Camera Extension: Extend to dynamic camera settings
Object Tracking Applications: Extend method to object tracking tasks
Adaptive Hyperparameters: Develop adaptive or learned hyperparameter strategies
Defense Mechanisms: Explore countermeasures and defenses against structured temporal-consistent adversarial attacks

In-Depth Evaluation

Strengths

Methodological Innovation: First systematic application of nuclear norm regularization to video object detection adversarial attacks
Solid Theoretical Foundation: Based on robust PCA and structured perturbation with solid theoretical grounding
Comprehensive Experiments: Thorough evaluation across multiple datasets
High Practical Value: Addresses important problems in safety-critical applications
Open-Source Contribution: Code and data publicly available for reproducibility

Weaknesses

Application Scenario Limitations: Only applicable to static camera scenarios
Insufficient Defense Consideration: Lacks evaluation against existing defense methods
Physical World Verification: Absence of validation experiments in real physical environments
Computational Cost Analysis: Insufficient analysis of SVD decomposition computational overhead

Impact

Academic Contribution: Provides new perspectives for video adversarial attack research
Security Awareness: Raises awareness of video detection system vulnerabilities
Methodological Inspiration: Nuclear norm regularization may inspire other structured attack research

Applicable Scenarios

Security Assessment: Robustness evaluation of industrial safety monitoring systems
Research Tool: Benchmark method for adversarial robustness research
Defense Development: Provides attack samples for developing targeted defense methods

References

The paper cites 41 relevant references covering multiple domains including adversarial attacks, object detection, and video analysis, providing solid theoretical foundation and comparison baselines.

Overall Assessment: This is a high-quality paper with significant contributions to the field of adversarial attacks on video object detection. The method demonstrates strong innovation, comprehensive experimental evaluation, and important practical significance for safety-critical applications. Despite some limitations, it provides valuable insights and future research directions for the field.