2025-11-12T14:07:10.510276

Structured Universal Adversarial Attacks on Object Detection for Video Sequences

Jacob, Shao, Kasneci
Video-based object detection plays a vital role in safety-critical applications. While deep learning-based object detectors have achieved impressive performance, they remain vulnerable to adversarial attacks, particularly those involving universal perturbations. In this work, we propose a minimally distorted universal adversarial attack tailored for video object detection, which leverages nuclear norm regularization to promote structured perturbations concentrated in the background. To optimize this formulation efficiently, we employ an adaptive, optimistic exponentiated gradient method that enhances both scalability and convergence. Our results demonstrate that the proposed attack outperforms both low-rank projected gradient descent and Frank-Wolfe based attacks in effectiveness while maintaining high stealthiness. All code and data are publicly available at https://github.com/jsve96/AO-Exp-Attack.
academic

Structured Universal Adversarial Attacks on Object Detection for Video Sequences

Basic Information

  • Paper ID: 2510.14460
  • Title: Structured Universal Adversarial Attacks on Object Detection for Video Sequences
  • Authors: Sven Jacob (BAuA & TUM), Weijia Shao (BAuA), Gjergji Kasneci (TUM)
  • Category: cs.CV (Computer Vision)
  • Publication Date: October 16, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.14460v1

Abstract

Video object detection plays a crucial role in safety-critical applications. While deep learning-based object detectors have achieved impressive performance, they remain vulnerable to adversarial attacks, particularly those involving universal perturbations. This paper proposes a minimum-distortion universal adversarial attack method for video object detection that leverages nuclear norm regularization to promote structured perturbations concentrated in the background. To efficiently optimize this formulation, an adaptive optimistic exponential gradient method is employed, improving scalability and convergence. Experimental results demonstrate that the proposed attack method outperforms low-rank projected gradient descent and Frank-Wolfe attacks while maintaining high imperceptibility.

Research Background and Motivation

Problem Definition

This research addresses the vulnerability of video object detection systems to adversarial attacks, particularly in safety-critical application scenarios.

Importance Analysis

  1. Safety-Critical Nature: Video object detection is widely deployed in safety-critical domains such as autonomous driving, industrial safety monitoring, and real-time surveillance
  2. Real-World Threats: Adversarial attacks can cause detection system failures, leading to serious safety incidents
  3. Universal Challenge: Universal adversarial perturbations (UAP) pose stronger threats as they can transfer across frames without requiring further access to the target model

Limitations of Existing Methods

  1. Norm Constraint Limitations: Existing methods primarily focus on ℓ₂ and ℓ∞ norm-constrained perturbations
  2. Visual Perceptibility: ℓ₁ attacks produce visible artifacts on moving objects in videos, reducing imperceptibility
  3. Missing Temporal Consistency: Frame-by-frame processing ignores the temporal coherence of video data

Research Motivation

Based on robust principal component analysis and structured adversarial perturbation methods, this work proposes a novel strategy for achieving target disappearance attacks through structured yet unsuspicious background modifications.

Core Contributions

  1. Novel Attack Formulation: Proposes a minimum-distortion universal attack formulation based on nuclear norm regularization that promotes structured perturbations with orthogonal spatial patterns across video frames
  2. Efficient Optimization Algorithm: Adapts the adaptive optimistic exponential gradient descent method for scalable optimization under nuclear norm constraints
  3. Comprehensive Experimental Evaluation: Conducts thorough evaluation on public video datasets and state-of-the-art video object detection models
  4. Performance Advantages: Demonstrates superior attack success rates and computational efficiency compared to existing nuclear norm attack methods

Methodology Details

Task Definition

Given a video frame sequence {xb1bB}\{x_b|1 \leq b \leq B\}, the objective is to find a universal adversarial perturbation δ\delta that, when applied to all frames, causes the object detector ff to fail while maintaining minimal and structured perturbation.

Model Architecture

Loss Function Design

The loss function is decomposed into foreground and background losses: L=Lfg+LbgL = L_{fg} + L_{bg}

Where:

  • Foreground Loss: Lfg=1FiFCE(pi,yi)L_{fg} = \frac{1}{|F|}\sum_{i \in F} CE(p_i, y_i)
  • Background Loss: Lbg=1BiBCE(pi,yi)L_{bg} = \frac{1}{|B|}\sum_{i \in B} CE(p_i, y_i)
  • Confidence Loss: Lconf=i[S]ξi1(ξi>τ)L_{conf} = \sum_{i \in [S]} \xi_i \cdot \mathbf{1}(\xi_i > \tau)

Total loss is: Ltotal=αLfg+γLconf+βLbgL_{total} = \alpha L_{fg} + \gamma L_{conf} + \beta L_{bg}

Regularization Design

A combination of Frobenius norm and nuclear norm is employed: R(δ)=λ1δ+λ2δFR(\delta) = \lambda_1 ||\delta||_* + \lambda_2 ||\delta||_F

Optimization Objective

The complete optimization problem for universal attacks: minδRH×W×C1Bb=1BLtotal(f(xb+δ),f(xb))+c=1C(λ1δc+λ22δcF2)\min_{\delta \in \mathbb{R}^{H \times W \times C}} -\frac{1}{B}\sum_{b=1}^{B} L_{total}(f(x_b + \delta), f(x_b)) + \sum_{c=1}^{C}(\lambda_1||\delta_c||_* + \frac{\lambda_2}{2}||\delta_c||_F^2)

AO-Exp Algorithm

Core Concept

Employs the adaptive optimistic exponential gradient method, maintaining decision variables through SVD decomposition: δct=Uc,tdiag(zct)Vc,tT\delta_c^t = U_{c,t} \text{diag}(z_c^t) V_{c,t}^T

Algorithm Steps

  1. Optimistic Update: ηctηct1+t2G(δct)G(δct1)2\eta_c^t \leftarrow \eta_c^{t-1} + \frac{t^2}{||\nabla G(\delta_c^t) - \nabla G(\delta_c^{t-1})||_\infty^2}
  2. Singular Value Update: zc,it+1=ηctλ2W0(λ2ηctexp(λ2+max{θc,itλ1,0}ηt))1z_{c,i}^{t+1} = \frac{\eta_c^t}{\lambda_2} W_0\left(\frac{\lambda_2}{\eta_c^t} \exp\left(\frac{\lambda_2 + \max\{\theta_{c,i}^t - \lambda_1, 0\}}{\eta_t}\right)\right) - 1
  3. Perturbation Reconstruction: δct+1=2t(t+1)s=1tsUc,tdiag(zs,1:kc)Vc,tT\delta_c^{t+1} = \frac{2}{t(t+1)} \sum_{s=1}^{t} s \cdot U_{c,t} \text{diag}(z_{s,1:k}^c) V_{c,t}^T

Technical Innovations

  1. Structured Background Perturbation: Nuclear norm regularization promotes low-rank structure concentrated in background regions
  2. Temporal Consistency: Universal perturbations ensure temporal consistency across frames
  3. Efficient Optimization: AO-Exp method achieves fast convergence under nuclear norm constraints
  4. Low-Rank Adaptation: Further information compression through top-k singular value selection

Experimental Setup

Datasets

  1. PETS 2009 S2L1: 7 scenes, 768×576 resolution, average 795 frames/scene
  2. EPFL-RLC: 3 scenes, 1920×1080 resolution, average 5000 frames/scene
  3. CW4C: 15 scenes, 1920×880 resolution, average 7200 frames/scene

Evaluation Metrics

  1. IoU Accumulation (IoUacc): Evaluates attack impact on entire sequence
  2. Adversarial Bounding Box Ratio (advBR): Ratio of adversarial to clean bounding boxes
  3. Mean Absolute Perturbation (MAP): Measures perceptibility
  4. Nuclear Norm δ||\delta||_*: Evaluates perturbation structure

Comparison Methods

  1. LoRa-PGD: Low-rank projected gradient descent attack
  2. FW-Nucl: Frank-Wolfe nuclear norm group attack
  3. AO-Exp Variants: Including low-rank adaptation version

Implementation Details

  • Iterations: 100 (AO-Exp and LoRa-PGD), 30 (FW-Nucl)
  • Regularization parameters: λ₁ and λ₂ adjusted per dataset
  • Target model: Mask R-CNN

Experimental Results

Main Results

DatasetMethodIoUacc(↓)advBR(↓)MAP(↓)δ\|\|\delta\|\|_*(↓)
PETS2009FW-Nucl4.77±1.091.04±0.251.2±0.336.5±5.84
LoRa-PGD-1001.22±0.910.63±0.424.0±0.360.3±10.3
AO-Exp0.29±0.270.06±0.042.9±0.141.3±16.6
EPFL-RLCFW-Nucl4.83±0.960.86±0.145.4±2.037.54±1.53
LoRa-PGD-1000.20±0.060.37±0.1114.0±3.043.5±4.3
AO-Exp0.9±0.370.22±0.076.0±4.027.52±15.8

Key Findings

  1. Attack Effectiveness: AO-Exp achieves the lowest IoUacc and advBR across all datasets
  2. Imperceptibility: MAP metrics demonstrate AO-Exp maintains good visual imperceptibility
  3. Structured Degree: Nuclear norm results indicate AO-Exp generates more structured perturbations

Ablation Studies

  1. Singular Value Count Impact: Analysis of different k values on advBR across camera viewpoints in EPFL dataset
  2. Low-Rank Adaptation Effect: AO-Exp (LoRa) variant significantly reduces nuclear norm while maintaining comparable performance

Visual Analysis

  • ℓ₁ attacks produce flickering noise following moving objects
  • Nuclear norm attacks generate more structured spatially coherent perturbations concentrated in background regions

Current State of Adversarial Attack Research

  1. Image Classification Attacks: Relatively mature research with abundant methods
  2. Object Detection Attacks: Relatively scarce, particularly in video scenarios
  3. Universal Adversarial Perturbations: Input-agnostic, uniformly applied across inputs

Low-Rank Structure Research

  1. Manifold Hypothesis: High-dimensional data tends to lie near low-dimensional manifolds
  2. Dimensionality Reduction Methods: PCA, UMAP, autoencoders, etc.
  3. Adversarial Applications: Nuclear norm regularization applications in adversarial attacks

Advantages of This Work

  1. Temporal Consistency: Considers temporal characteristics of video data
  2. Structured Design: Leverages nuclear norm to promote background structured perturbations
  3. Efficient Optimization: AO-Exp method improves computational efficiency

Conclusions and Discussion

Main Conclusions

  1. Proposes a novel structured universal adversarial attack method for video object detection
  2. Nuclear norm regularization effectively promotes structured perturbations in background regions
  3. AO-Exp algorithm outperforms existing methods in both effectiveness and efficiency
  4. Method consistently suppresses bounding boxes across multiple datasets

Limitations

  1. Static Camera Assumption: Current method assumes static camera settings, limiting applicability to dynamic camera scenarios
  2. Hyperparameter Sensitivity: Attack performance is sensitive to choices of nuclear norm weight and Frobenius regularization parameters
  3. Computational Complexity: Each iteration requires SVD decomposition, increasing computational cost

Future Directions

  1. Dynamic Camera Extension: Extend to dynamic camera settings
  2. Object Tracking Applications: Extend method to object tracking tasks
  3. Adaptive Hyperparameters: Develop adaptive or learned hyperparameter strategies
  4. Defense Mechanisms: Explore countermeasures and defenses against structured temporal-consistent adversarial attacks

In-Depth Evaluation

Strengths

  1. Methodological Innovation: First systematic application of nuclear norm regularization to video object detection adversarial attacks
  2. Solid Theoretical Foundation: Based on robust PCA and structured perturbation with solid theoretical grounding
  3. Comprehensive Experiments: Thorough evaluation across multiple datasets
  4. High Practical Value: Addresses important problems in safety-critical applications
  5. Open-Source Contribution: Code and data publicly available for reproducibility

Weaknesses

  1. Application Scenario Limitations: Only applicable to static camera scenarios
  2. Insufficient Defense Consideration: Lacks evaluation against existing defense methods
  3. Physical World Verification: Absence of validation experiments in real physical environments
  4. Computational Cost Analysis: Insufficient analysis of SVD decomposition computational overhead

Impact

  1. Academic Contribution: Provides new perspectives for video adversarial attack research
  2. Security Awareness: Raises awareness of video detection system vulnerabilities
  3. Methodological Inspiration: Nuclear norm regularization may inspire other structured attack research

Applicable Scenarios

  1. Security Assessment: Robustness evaluation of industrial safety monitoring systems
  2. Research Tool: Benchmark method for adversarial robustness research
  3. Defense Development: Provides attack samples for developing targeted defense methods

References

The paper cites 41 relevant references covering multiple domains including adversarial attacks, object detection, and video analysis, providing solid theoretical foundation and comparison baselines.


Overall Assessment: This is a high-quality paper with significant contributions to the field of adversarial attacks on video object detection. The method demonstrates strong innovation, comprehensive experimental evaluation, and important practical significance for safety-critical applications. Despite some limitations, it provides valuable insights and future research directions for the field.