2025-11-23T04:13:16.733055

ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos

Vuong, Kwak

We present ViDRiP-LLaVA, the first large multimodal model (LMM) in computational pathology that integrates three distinct image scenarios, including single patch images, automatically segmented pathology video clips, and manually segmented pathology videos. This integration closely mirrors the natural diagnostic process of pathologists. By generating detailed histological descriptions and culminating in a definitive sign-out diagnosis, ViDRiP-LLaVA bridges visual narratives with diagnostic reasoning. Central to our approach is the ViDRiP-Instruct dataset, comprising 4278 video and diagnosis-specific chain-of-thought instructional pairs sourced from educational histopathology videos on YouTube. Although high-quality data is critical for enhancing diagnostic reasoning, its creation is time-intensive and limited in volume. To overcome this challenge, we transfer knowledge from existing single-image instruction datasets to train on weakly annotated, keyframe-extracted clips, followed by fine-tuning on manually segmented videos. ViDRiP-LLaVA establishes a new benchmark in pathology video analysis and offers a promising foundation for future AI systems that support clinical decision-making through integrated visual and diagnostic reasoning. Our code, data, and model are publicly available at: https://github.com/QuIIL/ViDRiP-LLaVA.

academic

VideoPath-LLaVA: 病理视频诊断推理的多模态模型

基本信息

论文ID: 2505.04192
标题: VideoPath-LLaVA: Pathology Diagnostic Reasoning Through Video Instruction Tuning
作者: Trinh Vuong, Jin Tae Kwak (Korea University)
分类: cs.CV cs.AI cs.CL
发表时间: arXiv preprint (2025年)
论文链接: https://arxiv.org/abs/2505.04192v2

摘要

VideoPath-LLaVA是计算病理学领域首个大型多模态模型(LMM)，集成了三种不同的图像场景：单个补丁图像、自动关键帧提取的片段和手动分割的视频病理图像，以模拟病理学家的自然诊断过程。通过生成详细的组织学描述并最终给出明确的诊断结论，VideoPath-LLaVA将视觉叙述与诊断推理相结合。该方法的核心是VideoPath-Instruct数据集，包含4278个来自YouTube教育组织病理学视频的视频和诊断特定的思维链指令对。

研究背景与动机

核心问题

单图像诊断的局限性：现有医学领域的大多数LMM专注于基于单个图像回答问题，但在病理诊断任务中存在问题 - 高倍镜图像缺乏全局结构信息，低倍镜图像缺乏精细细节
视频资源的未充分利用：教育YouTube视频具有结构化的教学过程（从低倍镜概览到高倍镜检查），但存在对齐问题，即单帧代表整个视频段及其转录，常常超出其视觉内容
诊断推理过程的缺失：缺乏能够模拟病理学家逐步诊断推理过程的AI系统

研究动机

利用教育视频的固有结构构建思维链(CoT)推理过程
解决视频帧与文本描述之间的对齐问题
建立首个病理视频理解模型，提供可解释的诊断推理

核心贡献

首创性模型：提出VideoPath-LLaVA，这是计算病理学领域首个视频理解的大型多模态模型
高质量数据集：构建VideoPath-Instruct数据集，包含4278个精心策划的病理视频配对指令跟随问答
创新训练策略：设计四阶段训练方法，包括对齐、图像SFT、混合SFT和视频SFT
优异性能：在VideoPath-Instruct测试集上超越GPT-4o等先进模型
开源贡献：公开代码、数据和模型，为社区提供基础设施

方法详解

任务定义

给定病理视频输入，模型需要：

生成详细的组织学描述
进行逐步的诊断推理
提供最终的病理诊断结论

模型架构

VideoPath-LLaVA基于LLaVA-ov架构，包含三个主要组件：

视觉编码器(ViT)：采用SigLIP编码器提取图像特征 $z_v = g(x_v)$
投影器：2层MLP将图像特征投影到词嵌入空间 $h_v = p(z_v)$
语言解码器(LLM)：使用Qwen-2.5-7B作为LLM，接收投影的视觉特征和文本指令生成响应

训练策略

采用四阶段渐进式训练：

Stage 0: 对齐阶段

在图像-标题对上预训练投影器
建立LLM和ViT之间的连接

Stage 1: 图像SFT

在图像指令调优数据集上微调整个模型
使用Quilt-LLaVA和PathAsst数据集

Stage 2: 混合SFT（创新点）

结合图像和自动分割视频指令数据集训练
促进从静态图像到动态视频内容的平滑迁移

Stage 3: 视频SFT

在VideoPath-Instruct上最终微调
应用LoRA调优LLM以避免过拟合

技术创新点

渐进式视觉任务迁移：Stage 2混合训练有效桥接图像和视频任务
思维链诊断推理：利用CoT prompting生成结构化推理过程
多层次视频分割：结合自动关键帧提取和手动精细分割
视觉数据精炼：组织检测和文本移除确保数据质量

实验设置

数据集

VideoPath-Instruct：4036个训练视频，242个测试视频
ClipPath-Instruct：140k自动分割的病理片段
辅助数据集：Quilt-1M、PathAsst、膀胱数据集等

数据预处理

使用Whisper进行视频转录
YOLO-Path进行组织检测和人物遮挡
docTR进行文本检测和移除
AutoShot进行候选片段边界检测

评价指标

使用Video-ChatGPT指标评估：

Context（上下文相关性）
Correctness（正确性）
Detail（详细程度）
评分范围：0-5分，使用GPT-3.5-turbo-0613评估

对比方法

开源LMM：LLaVA-OV、LLaVA-Video、InternVL2-8B、Qwen2-VL、Qwen2.5-VL
专有LMM：GPT-4o、Claude-3.7-Sonnet、Gemini-1.5-Pro、Gemini-2.0-Flash

实验结果

主要结果

VideoPath-LLaVA在VideoPath-Instruct测试集上取得优异表现：

模型	Context	Correct	Detail	Avg	Norm-Avg
GPT-4o	2.69	2.69	2.36	2.58	51.60
VideoPath-LLaVA (完整)	2.82	2.82	2.67	2.77	55.40
VideoPath-LLaVA (w/o Stage 2)	2.74	2.68	2.69	2.70	54.08
LLaVA-OV (基线)	1.86	1.40	2.03	1.76	35.21