Video-based object detection plays a vital role in safety-critical applications. While deep learning-based object detectors have achieved impressive performance, they remain vulnerable to adversarial attacks, particularly those involving universal perturbations. In this work, we propose a minimally distorted universal adversarial attack tailored for video object detection, which leverages nuclear norm regularization to promote structured perturbations concentrated in the background. To optimize this formulation efficiently, we employ an adaptive, optimistic exponentiated gradient method that enhances both scalability and convergence. Our results demonstrate that the proposed attack outperforms both low-rank projected gradient descent and Frank-Wolfe based attacks in effectiveness while maintaining high stealthiness. All code and data are publicly available at https://github.com/jsve96/AO-Exp-Attack.
- Paper ID: 2510.14460
- Title: Structured Universal Adversarial Attacks on Object Detection for Video Sequences
- Authors: Sven Jacob (BAuA & TUM), Weijia Shao (BAuA), Gjergji Kasneci (TUM)
- Category: cs.CV (Computer Vision)
- Publication Date: October 16, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.14460v1
Video object detection plays a crucial role in safety-critical applications. While deep learning-based object detectors have achieved impressive performance, they remain vulnerable to adversarial attacks, particularly those involving universal perturbations. This paper proposes a minimum-distortion universal adversarial attack method for video object detection that leverages nuclear norm regularization to promote structured perturbations concentrated in the background. To efficiently optimize this formulation, an adaptive optimistic exponential gradient method is employed, improving scalability and convergence. Experimental results demonstrate that the proposed attack method outperforms low-rank projected gradient descent and Frank-Wolfe attacks while maintaining high imperceptibility.
This research addresses the vulnerability of video object detection systems to adversarial attacks, particularly in safety-critical application scenarios.
- Safety-Critical Nature: Video object detection is widely deployed in safety-critical domains such as autonomous driving, industrial safety monitoring, and real-time surveillance
- Real-World Threats: Adversarial attacks can cause detection system failures, leading to serious safety incidents
- Universal Challenge: Universal adversarial perturbations (UAP) pose stronger threats as they can transfer across frames without requiring further access to the target model
- Norm Constraint Limitations: Existing methods primarily focus on ℓ₂ and ℓ∞ norm-constrained perturbations
- Visual Perceptibility: ℓ₁ attacks produce visible artifacts on moving objects in videos, reducing imperceptibility
- Missing Temporal Consistency: Frame-by-frame processing ignores the temporal coherence of video data
Based on robust principal component analysis and structured adversarial perturbation methods, this work proposes a novel strategy for achieving target disappearance attacks through structured yet unsuspicious background modifications.
- Novel Attack Formulation: Proposes a minimum-distortion universal attack formulation based on nuclear norm regularization that promotes structured perturbations with orthogonal spatial patterns across video frames
- Efficient Optimization Algorithm: Adapts the adaptive optimistic exponential gradient descent method for scalable optimization under nuclear norm constraints
- Comprehensive Experimental Evaluation: Conducts thorough evaluation on public video datasets and state-of-the-art video object detection models
- Performance Advantages: Demonstrates superior attack success rates and computational efficiency compared to existing nuclear norm attack methods
Given a video frame sequence {xb∣1≤b≤B}, the objective is to find a universal adversarial perturbation δ that, when applied to all frames, causes the object detector f to fail while maintaining minimal and structured perturbation.
The loss function is decomposed into foreground and background losses:
L=Lfg+Lbg
Where:
- Foreground Loss: Lfg=∣F∣1∑i∈FCE(pi,yi)
- Background Loss: Lbg=∣B∣1∑i∈BCE(pi,yi)
- Confidence Loss: Lconf=∑i∈[S]ξi⋅1(ξi>τ)
Total loss is:
Ltotal=αLfg+γLconf+βLbg
A combination of Frobenius norm and nuclear norm is employed:
R(δ)=λ1∣∣δ∣∣∗+λ2∣∣δ∣∣F
The complete optimization problem for universal attacks:
minδ∈RH×W×C−B1∑b=1BLtotal(f(xb+δ),f(xb))+∑c=1C(λ1∣∣δc∣∣∗+2λ2∣∣δc∣∣F2)
Employs the adaptive optimistic exponential gradient method, maintaining decision variables through SVD decomposition:
δct=Uc,tdiag(zct)Vc,tT
- Optimistic Update:
ηct←ηct−1+∣∣∇G(δct)−∇G(δct−1)∣∣∞2t2
- Singular Value Update:
zc,it+1=λ2ηctW0(ηctλ2exp(ηtλ2+max{θc,it−λ1,0}))−1
- Perturbation Reconstruction:
δct+1=t(t+1)2∑s=1ts⋅Uc,tdiag(zs,1:kc)Vc,tT
- Structured Background Perturbation: Nuclear norm regularization promotes low-rank structure concentrated in background regions
- Temporal Consistency: Universal perturbations ensure temporal consistency across frames
- Efficient Optimization: AO-Exp method achieves fast convergence under nuclear norm constraints
- Low-Rank Adaptation: Further information compression through top-k singular value selection
- PETS 2009 S2L1: 7 scenes, 768×576 resolution, average 795 frames/scene
- EPFL-RLC: 3 scenes, 1920×1080 resolution, average 5000 frames/scene
- CW4C: 15 scenes, 1920×880 resolution, average 7200 frames/scene
- IoU Accumulation (IoUacc): Evaluates attack impact on entire sequence
- Adversarial Bounding Box Ratio (advBR): Ratio of adversarial to clean bounding boxes
- Mean Absolute Perturbation (MAP): Measures perceptibility
- Nuclear Norm ∣∣δ∣∣∗: Evaluates perturbation structure
- LoRa-PGD: Low-rank projected gradient descent attack
- FW-Nucl: Frank-Wolfe nuclear norm group attack
- AO-Exp Variants: Including low-rank adaptation version
- Iterations: 100 (AO-Exp and LoRa-PGD), 30 (FW-Nucl)
- Regularization parameters: λ₁ and λ₂ adjusted per dataset
- Target model: Mask R-CNN
| Dataset | Method | IoUacc(↓) | advBR(↓) | MAP(↓) | ∥∥δ∥∥∗(↓) |
|---|
| PETS2009 | FW-Nucl | 4.77±1.09 | 1.04±0.25 | 1.2±0.3 | 36.5±5.84 |
| LoRa-PGD-100 | 1.22±0.91 | 0.63±0.42 | 4.0±0.3 | 60.3±10.3 |
| AO-Exp | 0.29±0.27 | 0.06±0.04 | 2.9±0.1 | 41.3±16.6 |
| EPFL-RLC | FW-Nucl | 4.83±0.96 | 0.86±0.14 | 5.4±2.0 | 37.54±1.53 |
| LoRa-PGD-100 | 0.20±0.06 | 0.37±0.11 | 14.0±3.0 | 43.5±4.3 |
| AO-Exp | 0.9±0.37 | 0.22±0.07 | 6.0±4.0 | 27.52±15.8 |
- Attack Effectiveness: AO-Exp achieves the lowest IoUacc and advBR across all datasets
- Imperceptibility: MAP metrics demonstrate AO-Exp maintains good visual imperceptibility
- Structured Degree: Nuclear norm results indicate AO-Exp generates more structured perturbations
- Singular Value Count Impact: Analysis of different k values on advBR across camera viewpoints in EPFL dataset
- Low-Rank Adaptation Effect: AO-Exp (LoRa) variant significantly reduces nuclear norm while maintaining comparable performance
- ℓ₁ attacks produce flickering noise following moving objects
- Nuclear norm attacks generate more structured spatially coherent perturbations concentrated in background regions
- Image Classification Attacks: Relatively mature research with abundant methods
- Object Detection Attacks: Relatively scarce, particularly in video scenarios
- Universal Adversarial Perturbations: Input-agnostic, uniformly applied across inputs
- Manifold Hypothesis: High-dimensional data tends to lie near low-dimensional manifolds
- Dimensionality Reduction Methods: PCA, UMAP, autoencoders, etc.
- Adversarial Applications: Nuclear norm regularization applications in adversarial attacks
- Temporal Consistency: Considers temporal characteristics of video data
- Structured Design: Leverages nuclear norm to promote background structured perturbations
- Efficient Optimization: AO-Exp method improves computational efficiency
- Proposes a novel structured universal adversarial attack method for video object detection
- Nuclear norm regularization effectively promotes structured perturbations in background regions
- AO-Exp algorithm outperforms existing methods in both effectiveness and efficiency
- Method consistently suppresses bounding boxes across multiple datasets
- Static Camera Assumption: Current method assumes static camera settings, limiting applicability to dynamic camera scenarios
- Hyperparameter Sensitivity: Attack performance is sensitive to choices of nuclear norm weight and Frobenius regularization parameters
- Computational Complexity: Each iteration requires SVD decomposition, increasing computational cost
- Dynamic Camera Extension: Extend to dynamic camera settings
- Object Tracking Applications: Extend method to object tracking tasks
- Adaptive Hyperparameters: Develop adaptive or learned hyperparameter strategies
- Defense Mechanisms: Explore countermeasures and defenses against structured temporal-consistent adversarial attacks
- Methodological Innovation: First systematic application of nuclear norm regularization to video object detection adversarial attacks
- Solid Theoretical Foundation: Based on robust PCA and structured perturbation with solid theoretical grounding
- Comprehensive Experiments: Thorough evaluation across multiple datasets
- High Practical Value: Addresses important problems in safety-critical applications
- Open-Source Contribution: Code and data publicly available for reproducibility
- Application Scenario Limitations: Only applicable to static camera scenarios
- Insufficient Defense Consideration: Lacks evaluation against existing defense methods
- Physical World Verification: Absence of validation experiments in real physical environments
- Computational Cost Analysis: Insufficient analysis of SVD decomposition computational overhead
- Academic Contribution: Provides new perspectives for video adversarial attack research
- Security Awareness: Raises awareness of video detection system vulnerabilities
- Methodological Inspiration: Nuclear norm regularization may inspire other structured attack research
- Security Assessment: Robustness evaluation of industrial safety monitoring systems
- Research Tool: Benchmark method for adversarial robustness research
- Defense Development: Provides attack samples for developing targeted defense methods
The paper cites 41 relevant references covering multiple domains including adversarial attacks, object detection, and video analysis, providing solid theoretical foundation and comparison baselines.
Overall Assessment: This is a high-quality paper with significant contributions to the field of adversarial attacks on video object detection. The method demonstrates strong innovation, comprehensive experimental evaluation, and important practical significance for safety-critical applications. Despite some limitations, it provides valuable insights and future research directions for the field.