2025-11-17T08:34:13.390930

Open Vocabulary Multi-Label Video Classification

Gupta, Rizve, Unnikrishnan et al.

Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to open vocabulary single label action classification in videos. However, previous methods fall short in holistic video understanding which requires the ability to simultaneously recognize multiple actions and entities e.g., objects in the video in an open vocabulary setting. We formulate this problem as open vocabulary multilabel video classification and propose a method to adapt a pre-trained VLM such as CLIP to solve this task. We leverage large language models (LLMs) to provide semantic guidance to the VLM about class labels to improve its open vocabulary performance with two key contributions. First, we propose an end-to-end trainable architecture that learns to prompt an LLM to generate soft attributes for the CLIP text-encoder to enable it to recognize novel classes. Second, we integrate a temporal modeling module into CLIP's vision encoder to effectively model the spatio-temporal dynamics of video concepts as well as propose a novel regularized finetuning technique to ensure strong open vocabulary classification performance in the video domain. Our extensive experimentation showcases the efficacy of our approach on multiple benchmark datasets.

academic

Open Vocabulary Multi-Label Video Classification

Basic Information

Paper ID: 2407.09073
Title: Open Vocabulary Multi-Label Video Classification
Authors: Rohit Gupta, Mamshad Nayeem Rizve, Jayakrishnan Unnikrishnan, Ashish Tawari, Son Tran, Mubarak Shah, Benjamin Yao, Trishul Chilimbi
Category: cs.CV
Publication Date: arXiv:2407.09073v2 cs.CV 13 Oct 2025
Paper Link: https://arxiv.org/abs/2407.09073

Abstract

Pre-trained Vision-Language Models (VLMs) have achieved significant progress in open vocabulary computer vision tasks, such as image classification, object detection, and image segmentation. Recent work has focused on extending VLMs to open vocabulary single-label action classification in videos. However, previous approaches fall short in comprehensive video understanding and cannot simultaneously recognize multiple actions and entities (e.g., objects) in an open vocabulary setting. This paper defines this problem as open vocabulary multi-label video classification and proposes a method to adapt pre-trained VLMs (such as CLIP) to address this task. We leverage Large Language Models (LLMs) to provide semantic guidance to VLMs regarding class labels, improving their open vocabulary performance through two key contributions. First, we propose an end-to-end trainable architecture that learns to prompt LLMs to generate soft attributes for CLIP's text encoder, enabling it to recognize novel classes. Second, we integrate a temporal modeling module into CLIP's visual encoder to effectively model the spatiotemporal dynamics of video concepts, and propose a novel regularized fine-tuning technique that ensures robust open vocabulary classification performance in the video domain.

Research Background and Motivation

Problem Definition

Traditional video classification methods have the following limitations:

Vocabulary Constraints: Classical methods require prior knowledge of all possible classes, with models trained only on labeled datasets
High Annotation Cost: Manual annotation is labor-intensive, resulting in video datasets typically limited to specific domains (e.g., particular sports or simple activities)
Single Concept Recognition: Existing open vocabulary methods primarily focus on single-label classification and cannot simultaneously recognize multiple concepts in videos

Research Motivation

With the widespread adoption of video applications, there is a need to develop video models capable of recognizing a broad range of concepts. The core motivation of this paper is:

Leveraging the pre-training advantages of VLMs on large-scale image-text pairs
Combining the rich world knowledge of LLMs to enhance semantic understanding
Enabling simultaneous recognition of multiple video concepts (actions, objects, scenes, etc.) in an open vocabulary setting

Technical Challenges

Similarity Score Issues in Multi-Label Settings: VLM similarity scores have different ranges for different concept types (e.g., actions vs. objects)
Temporal Modeling: Image-language pre-trained models lack the ability to model temporal dynamics in videos
Open Vocabulary Performance Preservation: Fine-tuning on video data is prone to overfitting and loss of generalization ability

Core Contributions

End-to-End Trainable Label Encoder: Proposes a method for learning to prompt LLMs to generate soft attributes for VLM text encoders, enabling open vocabulary multi-label video classification
Temporal-Enhanced Visual Encoder: Integrates temporal modeling capabilities into pre-trained VLM image encoders while maintaining strong open vocabulary performance
New Benchmark Datasets: Defines open vocabulary multi-label video classification benchmarks on 5 datasets with comparisons against 6 strong baselines
Significant Performance Improvements: Substantially outperforms baseline methods on multiple benchmark datasets

Methodology Details

Task Definition

Input: Video sequence and a set of class labels from an open vocabulary Output: Probability of each label's presence in the video Constraint: The model must handle novel classes unseen during training at inference time

Model Architecture

Overall Framework

The model comprises three main stages:

Training Stage: Simultaneously train the label encoder and video encoder on closed-set training labels
Classifier Vocabulary Expansion Stage: Compute embeddings for novel class labels and save to a label embedding database
Inference Stage: Compute video features and match against the label embedding database

1. LLM Semantic-Enhanced Label Embedding

Fixed LLM Prompting Method:

Design prompt templates to query LLMs for generating useful features for visual discrimination of classes
Parse LLM outputs as attribute lists, prompting CLIP's text encoder alongside class names
Generate attribute-enhanced text embeddings through mean pooling

End-to-End Learnable LLM Prompting: To address the non-trainability of fixed prompting, the following architecture is proposed:

Learnable Prefix: N d-dimensional learnable vectors as prefixes for LLM prompts
Prompt Transformer: Maps LLM output semantic space to CLIP input semantic space
Soft Attribute Generation: Run K_L decoding iterations for each prefix, generating K L-token subsequences as soft attributes

Mathematical Representation:

Input sequence: I ∈ R^(M×d)
Prefix Pi concatenated with prompt template: [Pi; I] ∈ R^((1+M)×d)
Final label embedding: ft(ℓ) = MeanPool(Normalize(CLIP_text([soft_prompt; tokenize(ℓ)])))

2. Regularized Parallel Temporal Modeling

Temporal Modeling Branch:

Add parallel temporal modeling branches to the last T layers of CLIP's visual encoder
Freeze CLIP visual branches, train only newly added temporal layers
Each temporal block contains:
- Spatial attention layer initialized from CLIP weights
- Temporal attention layer with random initialization

Weight Regularization Strategy: To preserve zero-shot performance, apply random weight regularization to spatial attention layers:

θ = αθ_ft + (1-α)θ_frozen, where α ~ U(0, λ)

Video Embedding Generation: Generate overall video embeddings through mean pooling of final temporal tokens (TMP) and CLS tokens from each frame.

Training Objective

Employ weighted binary cross-entropy loss:

L(B) = -∑_{v∈B} [∑_{ℓ∈P(v)} log p(ℓ,v) + w∑_{ℓ∈N(v)} log(1-p(ℓ,v))]

Where:

p(ℓ,v) = σ(s(ℓ,v)/τ)
s(ℓ,v) = (ft(ℓ))^T fv(v)
τ is the temperature parameter, w is a weight hyperparameter

Experimental Setup

Datasets

Training Datasets:

YouTube-8M: Primarily annotated entities, retaining 2429 classes after removing gaming titles
Kinetics-400: High-quality manually verified action labels, 400 classes

Evaluation Datasets:

TAO (Tracking Any Object): Open vocabulary dataset focused on objects
ActivityNet: Dataset focused on actions
RareAct: Dataset containing objects, actions, and their uncommon combinations

Evaluation Metrics

AUPR (Area Under Precision-Recall curve): Summarizes classification performance across the entire precision-recall tradeoff
Peak F1-Score: F1 score achieved at the optimal threshold

Comparison Methods

CoOp: Lightweight adaptation method for learning CLIP text encoder prompts
DualCoOp: Multi-label extension of CoOp, learning both positive and negative prompts
LLM + CLIP (Frozen): Fixed LLM prompting baseline
ViFi-CLIP: Fine-tuning CLIP image and text encoders on training datasets

Experimental Results

Main Results

AUPR Performance Comparison:

Method	YouTube-8M	Kinetics	TAO	ActivityNet	RareAct
CLIP (Class Name Prompt)	6.3	26.2	43.8	44.2	9.5
Fixed LLM Prompt	6.9	30.6	50.2	46.8	11.5
DualCoOp	8.3	23.9	47.1	33.0	7.6
Proposed Method	16.7	43.2	65.5	50.2	13.2

Peak F1 Performance Comparison:

Method	YouTube-8M	Kinetics	TAO	ActivityNet	RareAct
CLIP (Class Name Prompt)	14.9	34.2	44.6	47.1	17.6
Fixed LLM Prompt	21.6	37.3	50.2	51.4	19.8
DualCoOp	16.2	33.2	49.0	40.5	15.0
Proposed Method	32.7	46.6	56.6	53.8	25.1

Ablation Studies

Temporal Modeling Component Analysis:

Number of temporal blocks: 4 blocks achieve optimal performance
Weight regularization: Significantly prevents overfitting and preserves open vocabulary performance
Freezing CLIP backbone: Avoids severe overfitting

Label Encoder Component Analysis:

Combination of LLM + learnable prompts + prompt transformer achieves best performance
Removing CLIP text encoder results in significant performance degradation
Learnable prompts outperform fixed prompts

Score Calibration Analysis

The proposed method achieves better score calibration across different concept types, enabling a single threshold to achieve good performance across multiple concepts, which is crucial for practical applications.

Vision-Language Representation Learning

Success of large-scale image-language models such as CLIP
Video-language pre-training typically adapts pre-trained image-language models

Open Vocabulary Classification

Regularized fine-tuning and prompt learning are primary approaches
Existing work primarily focuses on single-label tasks or image recognition

LLM Applications in Vision

LLMs used for generating class descriptors to improve classification
Multimodal models align visual representations with LLM input spaces

Conclusions and Discussion

Main Conclusions

Proposes the first method for open vocabulary multi-label video classification
End-to-end trainable LLM-guided architecture significantly improves performance
Temporal modeling and regularization techniques successfully balance fine-tuning performance with open vocabulary capability

Limitations

Depends on the quality of pre-trained VLMs and LLMs
Concept coverage of training datasets remains limited
Computational overhead increases compared to baseline CLIP models

Future Directions

Explore more efficient temporal modeling architectures
Investigate better LLM-VLM alignment methods
Extend to more video understanding tasks

In-Depth Evaluation

Strengths

Innovative Problem Definition: First systematic definition and solution of open vocabulary multi-label video classification
Complete Technical Solution: Addresses both label encoding and video temporal modeling challenges
Comprehensive Experiments: Thorough evaluation on multiple datasets with detailed ablation studies
High Practical Value: Method demonstrates good scalability and supports dynamic addition of novel classes at inference time

Weaknesses

Computational Complexity: Increases computational overhead compared to baseline methods
Data Dependency: Performance still depends on the quality and diversity of training data
Generalization Ability: Performance on extreme out-of-distribution data requires further verification

Impact

Academic Contribution: Provides new research directions and benchmarks for video understanding
Practical Value: Offers feasible technical solutions for real-world video applications
Reproducibility: Provides detailed implementation details and experimental settings

Applicable Scenarios

Video content analysis and annotation
Video retrieval and recommendation systems
Multi-object recognition in security surveillance
Automatic classification of educational videos

References

The paper cites 68 relevant references covering important works in vision-language learning, open vocabulary classification, large language model applications, and other related fields, providing a solid theoretical foundation for this research.