2025-11-17T08:34:13.390930

Open Vocabulary Multi-Label Video Classification

Gupta, Rizve, Unnikrishnan et al.
Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to open vocabulary single label action classification in videos. However, previous methods fall short in holistic video understanding which requires the ability to simultaneously recognize multiple actions and entities e.g., objects in the video in an open vocabulary setting. We formulate this problem as open vocabulary multilabel video classification and propose a method to adapt a pre-trained VLM such as CLIP to solve this task. We leverage large language models (LLMs) to provide semantic guidance to the VLM about class labels to improve its open vocabulary performance with two key contributions. First, we propose an end-to-end trainable architecture that learns to prompt an LLM to generate soft attributes for the CLIP text-encoder to enable it to recognize novel classes. Second, we integrate a temporal modeling module into CLIP's vision encoder to effectively model the spatio-temporal dynamics of video concepts as well as propose a novel regularized finetuning technique to ensure strong open vocabulary classification performance in the video domain. Our extensive experimentation showcases the efficacy of our approach on multiple benchmark datasets.
academic

Open Vocabulary Multi-Label Video Classification

Basic Information

  • Paper ID: 2407.09073
  • Title: Open Vocabulary Multi-Label Video Classification
  • Authors: Rohit Gupta, Mamshad Nayeem Rizve, Jayakrishnan Unnikrishnan, Ashish Tawari, Son Tran, Mubarak Shah, Benjamin Yao, Trishul Chilimbi
  • Category: cs.CV
  • Publication Date: arXiv:2407.09073v2 cs.CV 13 Oct 2025
  • Paper Link: https://arxiv.org/abs/2407.09073

Abstract

Pre-trained Vision-Language Models (VLMs) have achieved significant progress in open vocabulary computer vision tasks, such as image classification, object detection, and image segmentation. Recent work has focused on extending VLMs to open vocabulary single-label action classification in videos. However, previous approaches fall short in comprehensive video understanding and cannot simultaneously recognize multiple actions and entities (e.g., objects) in an open vocabulary setting. This paper defines this problem as open vocabulary multi-label video classification and proposes a method to adapt pre-trained VLMs (such as CLIP) to address this task. We leverage Large Language Models (LLMs) to provide semantic guidance to VLMs regarding class labels, improving their open vocabulary performance through two key contributions. First, we propose an end-to-end trainable architecture that learns to prompt LLMs to generate soft attributes for CLIP's text encoder, enabling it to recognize novel classes. Second, we integrate a temporal modeling module into CLIP's visual encoder to effectively model the spatiotemporal dynamics of video concepts, and propose a novel regularized fine-tuning technique that ensures robust open vocabulary classification performance in the video domain.

Research Background and Motivation

Problem Definition

Traditional video classification methods have the following limitations:

  1. Vocabulary Constraints: Classical methods require prior knowledge of all possible classes, with models trained only on labeled datasets
  2. High Annotation Cost: Manual annotation is labor-intensive, resulting in video datasets typically limited to specific domains (e.g., particular sports or simple activities)
  3. Single Concept Recognition: Existing open vocabulary methods primarily focus on single-label classification and cannot simultaneously recognize multiple concepts in videos

Research Motivation

With the widespread adoption of video applications, there is a need to develop video models capable of recognizing a broad range of concepts. The core motivation of this paper is:

  1. Leveraging the pre-training advantages of VLMs on large-scale image-text pairs
  2. Combining the rich world knowledge of LLMs to enhance semantic understanding
  3. Enabling simultaneous recognition of multiple video concepts (actions, objects, scenes, etc.) in an open vocabulary setting

Technical Challenges

  1. Similarity Score Issues in Multi-Label Settings: VLM similarity scores have different ranges for different concept types (e.g., actions vs. objects)
  2. Temporal Modeling: Image-language pre-trained models lack the ability to model temporal dynamics in videos
  3. Open Vocabulary Performance Preservation: Fine-tuning on video data is prone to overfitting and loss of generalization ability

Core Contributions

  1. End-to-End Trainable Label Encoder: Proposes a method for learning to prompt LLMs to generate soft attributes for VLM text encoders, enabling open vocabulary multi-label video classification
  2. Temporal-Enhanced Visual Encoder: Integrates temporal modeling capabilities into pre-trained VLM image encoders while maintaining strong open vocabulary performance
  3. New Benchmark Datasets: Defines open vocabulary multi-label video classification benchmarks on 5 datasets with comparisons against 6 strong baselines
  4. Significant Performance Improvements: Substantially outperforms baseline methods on multiple benchmark datasets

Methodology Details

Task Definition

Input: Video sequence and a set of class labels from an open vocabulary Output: Probability of each label's presence in the video Constraint: The model must handle novel classes unseen during training at inference time

Model Architecture

Overall Framework

The model comprises three main stages:

  1. Training Stage: Simultaneously train the label encoder and video encoder on closed-set training labels
  2. Classifier Vocabulary Expansion Stage: Compute embeddings for novel class labels and save to a label embedding database
  3. Inference Stage: Compute video features and match against the label embedding database

1. LLM Semantic-Enhanced Label Embedding

Fixed LLM Prompting Method:

  • Design prompt templates to query LLMs for generating useful features for visual discrimination of classes
  • Parse LLM outputs as attribute lists, prompting CLIP's text encoder alongside class names
  • Generate attribute-enhanced text embeddings through mean pooling

End-to-End Learnable LLM Prompting: To address the non-trainability of fixed prompting, the following architecture is proposed:

  • Learnable Prefix: N d-dimensional learnable vectors as prefixes for LLM prompts
  • Prompt Transformer: Maps LLM output semantic space to CLIP input semantic space
  • Soft Attribute Generation: Run K_L decoding iterations for each prefix, generating K L-token subsequences as soft attributes

Mathematical Representation:

Input sequence: I ∈ R^(M×d)
Prefix Pi concatenated with prompt template: [Pi; I] ∈ R^((1+M)×d)
Final label embedding: ft(ℓ) = MeanPool(Normalize(CLIP_text([soft_prompt; tokenize(ℓ)])))

2. Regularized Parallel Temporal Modeling

Temporal Modeling Branch:

  • Add parallel temporal modeling branches to the last T layers of CLIP's visual encoder
  • Freeze CLIP visual branches, train only newly added temporal layers
  • Each temporal block contains:
    • Spatial attention layer initialized from CLIP weights
    • Temporal attention layer with random initialization

Weight Regularization Strategy: To preserve zero-shot performance, apply random weight regularization to spatial attention layers:

θ = αθ_ft + (1-α)θ_frozen, where α ~ U(0, λ)

Video Embedding Generation: Generate overall video embeddings through mean pooling of final temporal tokens (TMP) and CLS tokens from each frame.

Training Objective

Employ weighted binary cross-entropy loss:

L(B) = -∑_{v∈B} [∑_{ℓ∈P(v)} log p(ℓ,v) + w∑_{ℓ∈N(v)} log(1-p(ℓ,v))]

Where:

  • p(ℓ,v) = σ(s(ℓ,v)/τ)
  • s(ℓ,v) = (ft(ℓ))^T fv(v)
  • τ is the temperature parameter, w is a weight hyperparameter

Experimental Setup

Datasets

Training Datasets:

  • YouTube-8M: Primarily annotated entities, retaining 2429 classes after removing gaming titles
  • Kinetics-400: High-quality manually verified action labels, 400 classes

Evaluation Datasets:

  • TAO (Tracking Any Object): Open vocabulary dataset focused on objects
  • ActivityNet: Dataset focused on actions
  • RareAct: Dataset containing objects, actions, and their uncommon combinations

Evaluation Metrics

  • AUPR (Area Under Precision-Recall curve): Summarizes classification performance across the entire precision-recall tradeoff
  • Peak F1-Score: F1 score achieved at the optimal threshold

Comparison Methods

  1. CoOp: Lightweight adaptation method for learning CLIP text encoder prompts
  2. DualCoOp: Multi-label extension of CoOp, learning both positive and negative prompts
  3. LLM + CLIP (Frozen): Fixed LLM prompting baseline
  4. ViFi-CLIP: Fine-tuning CLIP image and text encoders on training datasets

Experimental Results

Main Results

AUPR Performance Comparison:

MethodYouTube-8MKineticsTAOActivityNetRareAct
CLIP (Class Name Prompt)6.326.243.844.29.5
Fixed LLM Prompt6.930.650.246.811.5
DualCoOp8.323.947.133.07.6
Proposed Method16.743.265.550.213.2

Peak F1 Performance Comparison:

MethodYouTube-8MKineticsTAOActivityNetRareAct
CLIP (Class Name Prompt)14.934.244.647.117.6
Fixed LLM Prompt21.637.350.251.419.8
DualCoOp16.233.249.040.515.0
Proposed Method32.746.656.653.825.1

Ablation Studies

Temporal Modeling Component Analysis:

  • Number of temporal blocks: 4 blocks achieve optimal performance
  • Weight regularization: Significantly prevents overfitting and preserves open vocabulary performance
  • Freezing CLIP backbone: Avoids severe overfitting

Label Encoder Component Analysis:

  • Combination of LLM + learnable prompts + prompt transformer achieves best performance
  • Removing CLIP text encoder results in significant performance degradation
  • Learnable prompts outperform fixed prompts

Score Calibration Analysis

The proposed method achieves better score calibration across different concept types, enabling a single threshold to achieve good performance across multiple concepts, which is crucial for practical applications.

Vision-Language Representation Learning

  • Success of large-scale image-language models such as CLIP
  • Video-language pre-training typically adapts pre-trained image-language models

Open Vocabulary Classification

  • Regularized fine-tuning and prompt learning are primary approaches
  • Existing work primarily focuses on single-label tasks or image recognition

LLM Applications in Vision

  • LLMs used for generating class descriptors to improve classification
  • Multimodal models align visual representations with LLM input spaces

Conclusions and Discussion

Main Conclusions

  1. Proposes the first method for open vocabulary multi-label video classification
  2. End-to-end trainable LLM-guided architecture significantly improves performance
  3. Temporal modeling and regularization techniques successfully balance fine-tuning performance with open vocabulary capability

Limitations

  1. Depends on the quality of pre-trained VLMs and LLMs
  2. Concept coverage of training datasets remains limited
  3. Computational overhead increases compared to baseline CLIP models

Future Directions

  1. Explore more efficient temporal modeling architectures
  2. Investigate better LLM-VLM alignment methods
  3. Extend to more video understanding tasks

In-Depth Evaluation

Strengths

  1. Innovative Problem Definition: First systematic definition and solution of open vocabulary multi-label video classification
  2. Complete Technical Solution: Addresses both label encoding and video temporal modeling challenges
  3. Comprehensive Experiments: Thorough evaluation on multiple datasets with detailed ablation studies
  4. High Practical Value: Method demonstrates good scalability and supports dynamic addition of novel classes at inference time

Weaknesses

  1. Computational Complexity: Increases computational overhead compared to baseline methods
  2. Data Dependency: Performance still depends on the quality and diversity of training data
  3. Generalization Ability: Performance on extreme out-of-distribution data requires further verification

Impact

  1. Academic Contribution: Provides new research directions and benchmarks for video understanding
  2. Practical Value: Offers feasible technical solutions for real-world video applications
  3. Reproducibility: Provides detailed implementation details and experimental settings

Applicable Scenarios

  • Video content analysis and annotation
  • Video retrieval and recommendation systems
  • Multi-object recognition in security surveillance
  • Automatic classification of educational videos

References

The paper cites 68 relevant references covering important works in vision-language learning, open vocabulary classification, large language model applications, and other related fields, providing a solid theoretical foundation for this research.