Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to open vocabulary single label action classification in videos. However, previous methods fall short in holistic video understanding which requires the ability to simultaneously recognize multiple actions and entities e.g., objects in the video in an open vocabulary setting. We formulate this problem as open vocabulary multilabel video classification and propose a method to adapt a pre-trained VLM such as CLIP to solve this task. We leverage large language models (LLMs) to provide semantic guidance to the VLM about class labels to improve its open vocabulary performance with two key contributions. First, we propose an end-to-end trainable architecture that learns to prompt an LLM to generate soft attributes for the CLIP text-encoder to enable it to recognize novel classes. Second, we integrate a temporal modeling module into CLIP's vision encoder to effectively model the spatio-temporal dynamics of video concepts as well as propose a novel regularized finetuning technique to ensure strong open vocabulary classification performance in the video domain. Our extensive experimentation showcases the efficacy of our approach on multiple benchmark datasets.
Pre-trained Vision-Language Models (VLMs) have achieved significant progress in open vocabulary computer vision tasks, such as image classification, object detection, and image segmentation. Recent work has focused on extending VLMs to open vocabulary single-label action classification in videos. However, previous approaches fall short in comprehensive video understanding and cannot simultaneously recognize multiple actions and entities (e.g., objects) in an open vocabulary setting. This paper defines this problem as open vocabulary multi-label video classification and proposes a method to adapt pre-trained VLMs (such as CLIP) to address this task. We leverage Large Language Models (LLMs) to provide semantic guidance to VLMs regarding class labels, improving their open vocabulary performance through two key contributions. First, we propose an end-to-end trainable architecture that learns to prompt LLMs to generate soft attributes for CLIP's text encoder, enabling it to recognize novel classes. Second, we integrate a temporal modeling module into CLIP's visual encoder to effectively model the spatiotemporal dynamics of video concepts, and propose a novel regularized fine-tuning technique that ensures robust open vocabulary classification performance in the video domain.
Traditional video classification methods have the following limitations:
Vocabulary Constraints: Classical methods require prior knowledge of all possible classes, with models trained only on labeled datasets
High Annotation Cost: Manual annotation is labor-intensive, resulting in video datasets typically limited to specific domains (e.g., particular sports or simple activities)
Single Concept Recognition: Existing open vocabulary methods primarily focus on single-label classification and cannot simultaneously recognize multiple concepts in videos
With the widespread adoption of video applications, there is a need to develop video models capable of recognizing a broad range of concepts. The core motivation of this paper is:
Leveraging the pre-training advantages of VLMs on large-scale image-text pairs
Combining the rich world knowledge of LLMs to enhance semantic understanding
Enabling simultaneous recognition of multiple video concepts (actions, objects, scenes, etc.) in an open vocabulary setting
End-to-End Trainable Label Encoder: Proposes a method for learning to prompt LLMs to generate soft attributes for VLM text encoders, enabling open vocabulary multi-label video classification
Temporal-Enhanced Visual Encoder: Integrates temporal modeling capabilities into pre-trained VLM image encoders while maintaining strong open vocabulary performance
New Benchmark Datasets: Defines open vocabulary multi-label video classification benchmarks on 5 datasets with comparisons against 6 strong baselines
Input: Video sequence and a set of class labels from an open vocabulary
Output: Probability of each label's presence in the video
Constraint: The model must handle novel classes unseen during training at inference time
The proposed method achieves better score calibration across different concept types, enabling a single threshold to achieve good performance across multiple concepts, which is crucial for practical applications.
The paper cites 68 relevant references covering important works in vision-language learning, open vocabulary classification, large language model applications, and other related fields, providing a solid theoretical foundation for this research.