2025-11-25T21:10:18.097119

Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos

Gupta, Roy, Christensen et al.

The recent growth in the consumption of online media by children during early childhood necessitates data-driven tools enabling educators to filter out appropriate educational content for young learners. This paper presents an approach for detecting educational content in online videos. We focus on two widely used educational content classes: literacy and math. For each class, we choose prominent codes (sub-classes) based on the Common Core Standards. For example, literacy codes include `letter names', `letter sounds', and math codes include `counting', `sorting'. We pose this as a fine-grained multilabel classification problem as videos can contain multiple types of educational content and the content classes can get visually similar (e.g., `letter names' vs `letter sounds'). We propose a novel class prototypes based supervised contrastive learning approach that can handle fine-grained samples associated with multiple labels. We learn a class prototype for each class and a loss function is employed to minimize the distances between a class prototype and the samples from the class. Similarly, distances between a class prototype and the samples from other classes are maximized. As the alignment between visual and audio cues are crucial for effective comprehension, we consider a multimodal transformer network to capture the interaction between visual and audio cues in videos while learning the embedding for videos. For evaluation, we present a dataset, APPROVE, employing educational videos from YouTube labeled with fine-grained education classes by education researchers. APPROVE consists of 193 hours of expert-annotated videos with 19 classes. The proposed approach outperforms strong baselines on APPROVE and other benchmarks such as Youtube-8M, and COIN. The dataset is available at https://github.com/rohit-gupta/MMContrast/tree/main/APPROVE

academic

Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos

基本信息

论文ID: 2510.11204
标题: Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos
作者: Rohit Gupta, Anirban Roy, Claire Christensen, Sujeong Kim, Sarah Gerard, Madeline Cincebeaux, Ajay Divakaran, Todd Grindal, Mubarak Shah
分类: cs.CV (Computer Vision)
发表时间: 2025年10月13日
论文链接: https://arxiv.org/abs/2510.11204v1

摘要

随着儿童在线媒体消费的快速增长，教育工作者迫切需要数据驱动的工具来筛选适合幼儿学习者的教育内容。本文提出了一种检测在线视频中教育内容的方法，专注于两个广泛使用的教育内容类别：读写能力和数学。基于Common Core Standards选择突出的代码（子类别），如读写能力代码包括"字母名称"、"字母发音"，数学代码包括"计数"、"分类"等。由于视频可能包含多种教育内容且内容类别在视觉上可能相似，本文将其建模为细粒度多标签分类问题。提出了一种新颖的基于类原型的监督对比学习方法，能够处理与多个标签相关联的细粒度样本。通过学习每个类别的类原型，使用损失函数最小化类原型与该类样本之间的距离，同时最大化与其他类样本的距离。考虑到视觉和音频线索对有效理解的重要性，采用多模态变换器网络捕获视频中视觉和音频线索的交互。评估使用了APPROVE数据集，包含193小时由教育研究人员标注的YouTube教育视频，共19个类别。

研究背景与动机

问题定义

核心问题: 自动识别和分类在线视频中的教育内容，特别是针对幼儿园阶段的读写能力和数学内容
现实需求: 89%的11岁以下儿童家长报告其孩子观看YouTube视频，2-4岁儿童平均每天观看2.5小时，5-8岁儿童平均每天观看3.0小时
教育价值: 观看适当的教育视频支持健康的儿童发展和学习，已被证明能产生有意义的学习收益

挑战分析

细粒度区分: 教育代码之间存在高度相似性，如"字母名称"vs"字母发音"
多标签特性: 单个视频可能包含多种教育内容类型
多模态需求: 教育内容理解需要同时分析视觉和音频线索
数据稀缺: 缺乏专家标注的细粒度教育视频数据集

现有方法局限性

标准监督对比学习: SupCon等方法无法直接扩展到多标签场景
单模态方法: 仅依赖视觉线索不足以区分细粒度教育内容
通用视频分类: 现有数据集如UCF101、Kinetics等主要关注动作识别，不适用于教育内容分析

核心贡献

APPROVE数据集: 构建了首个细粒度多标签教育视频数据集，包含193小时专家标注的视频，19个类别，平均每个视频3个标签
类原型对比学习框架: 提出了适用于多标签细粒度分类的类原型监督对比学习方法
多模态融合架构: 设计了多模态变换器网络，有效融合视觉和文本（ASR转录）信息
性能提升: 在APPROVE、YouTube-8M和COIN数据集上均优于强基线方法

方法详解

任务定义

输入: 教育视频 $x$ ，包含视觉帧序列和音频轨道
输出: 多标签分类结果，预测视频包含的教育内容类别
约束: 类别间存在细粒度差异，单个视频可能包含多个相关标签

模型架构

1. 类原型对比学习

传统监督对比学习（SupCon）通过最小化同类样本间距离、最大化异类样本间距离来学习表征：

$L_{SupCon} = \sum_{i \in A} -\frac{1}{|P(i)|} \sum_{p \in P(i)} \log \frac{\exp(\text{sim}(z_i, z_p)/\tau)}{\sum_{a \in A\backslash i} \exp(\text{sim}(z_i, z_a)/\tau)}$

但在多标签场景中，样本对无法简单分为正负样本。本文提出基于类原型的对比学习：

$L_{mlc}(x) = -\frac{1}{|P_{ml}(x)|} \sum_{c_k^+ \in P_{ml}(x)} \left[ \log \frac{\exp(\text{sim}(z, cp_k)/\tau)}{\sum_{c_j^- \in C\backslash P_{ml}(x)} \exp(\text{sim}(z, cp_j)/\tau)} \right]$