2025-11-21T10:07:15.918989

RO-Bench: Large-scale robustness evaluation of MLLMs with text-driven counterfactual videos

Yang, Li, Diao et al.
Recently, Multi-modal Large Language Models (MLLMs) have demonstrated significant performance across various video understanding tasks. However, their robustness, particularly when faced with manipulated video content, remains largely unexplored. In this paper, we introduce Ro-Bench, the first benchmark for evaluating MLLMs on dynamic out-of-distribution (OOD) counterfactual video test sets. Ro-Bench incorporates high-quality, diverse and temporally relevant video data, by editing Style, Object, Background and their compositions. We evaluated eight recent video MLLMs and found that current models exhibit substantial performance degradation on Ro-Bench when exposed to counterfactual video content. Furthermore, we demonstrate that fine-tuning MLLMs with counterfactual data enhances robustness, achieving a 21.73% performance increase on Ro-Bench and a 12.78% improvement across 20 tasks in the MVBench dataset. These findings underscore the effectiveness of counterfactual data in enhancing the video understanding ability of MLLMs. The code and data will be released shortly.
academic

RO-Bench: Large-scale robustness evaluation of MLLMs with text-driven counterfactual videos

Basic Information

  • Paper ID: 2510.08936
  • Title: RO-Bench: Large-scale robustness evaluation of MLLMs with text-driven counterfactual videos
  • Authors: Zixi Yang, Jiapeng Li, Muxi Diao, Yinuo Jing, Kongming Liang (Beijing University of Posts and Telecommunications)
  • Classification: cs.CV cs.AI
  • Publication Date: 2025 (Preprint)
  • Paper Link: https://arxiv.org/abs/2510.08936

Abstract

Recent advances have demonstrated the effectiveness of multimodal large language models (MLLMs) in various video understanding tasks. However, their robustness when confronted with manipulated video content remains insufficiently explored. This paper introduces RO-Bench, the first benchmark designed to evaluate MLLMs' performance on dynamic out-of-distribution (OOD) counterfactual video test sets. RO-Bench integrates high-quality, diverse, and temporally-coherent video data through editing styles, objects, backgrounds, and their combinations. The authors evaluate eight state-of-the-art video MLLMs and discover that current models exhibit significant performance degradation when facing counterfactual video content. Furthermore, the study demonstrates that fine-tuning MLLMs with counterfactual data enhances robustness, achieving a 21.73% performance improvement on RO-Bench and an average 12.78% improvement across 20 tasks on the MVBench dataset.

Research Background and Motivation

Problem Definition

With the widespread application of multimodal large language models in video understanding tasks, particularly in high-risk domains such as video content moderation, autonomous driving, and real-time surveillance, ensuring model robustness has become critically important. While existing models perform well in controlled environments, their ability to maintain performance when facing tampered or manipulated inputs remains unknown.

Research Significance

  1. Practical Application Needs: In high-risk application scenarios, models must maintain stable performance across various visual variations
  2. Safety Considerations: Malicious actors may deceive models through video editing, creating security vulnerabilities
  3. Evaluation Gap: Existing robustness evaluations primarily focus on static images, with systematic evaluation lacking in the video domain

Limitations of Existing Methods

  1. Static Image Limitations: Benchmarks such as LANCE primarily focus on counterfactual generation for static images
  2. Simplistic Perturbations: Existing video robustness evaluations predominantly employ noise or corruption testing, overlooking the rich temporal dynamics of real-world videos
  3. Lack of Systematicity: Absence of comprehensive robustness evaluation frameworks specifically designed for video MLLMs

Research Motivation

This work aims to address two core research questions:

  • RQ1: How do MLLMs perform on counterfactual videos, and what specific challenges do they face in understanding edited video content?
  • RQ2: How does the use of counterfactual videos affect MLLM performance, and can it enhance their ability to understand and interpret complex video content?

Core Contributions

  1. First Video Robustness Benchmark: Proposes RO-Bench, the first counterfactual video test set benchmark specifically designed for evaluating video MLLM robustness
  2. Novel Evaluation Metrics: Introduces four innovative evaluation metrics to assess the impact of text prompts and original videos on editing results, ensuring high-quality data
  3. Comprehensive Robustness Evaluation: Conducts systematic evaluation of mainstream video MLLMs, revealing their robustness deficiencies in video understanding
  4. Training Strategy Validation: Demonstrates that training with counterfactual data can enhance RO-Bench performance and generalize to other benchmark tasks

Methodology Details

Task Definition

RO-Bench aims to evaluate video MLLMs' robustness when confronted with counterfactual video content. The task encompasses:

  • Input: Original videos and corresponding counterfactually edited videos
  • Output: Multiple-choice answers for four video understanding tasks (action recognition, object recognition, object existence judgment, video captioning)
  • Evaluation: Comparing model performance differences between original and edited videos

Data Construction Pipeline

1. Data Source Collection

  • Dataset Sources: DAVIS, TGVE, MSR-VTT, BalanceCC and other public datasets and internet sources
  • Content Categories: Four subject types (humans, animals, landscapes, objects)
  • Task Types: Action Recognition (AR), Object Recognition (OR), Object Existence (OE), Video Captioning (VC)

2. Counterfactual Video Generation

Caption Editing Strategy:

  • Decompose video captions into structured components: object attributes, object actions, background, style
  • Perform caption editing based on these four visual factors

Video Editing Pipeline:

  • Employ state-of-the-art text-driven video editing models
  • Propose four key evaluation metrics: Hallucination Level (FL), Scene Complexity (SC), Camera Motion (CM), Object Motion (OM)
  • Select the top three performing editing models based on evaluation results
  • Conduct rigorous manual screening to ensure video quality

3. QA Pair Generation

Automated Question Generation:

  • Utilize GPT-4o to generate questions for each video based on task definitions
  • Construct corresponding answer options according to different task types

Option Generation Strategy:

  • Annotation-based: Extract correct answers directly from authentic annotations
  • LLM-based generation: Provide "yes," "no," "uncertain" options for object existence tasks
  • Distractor Design: Ensure options are neither overly simple nor excessively difficult, maintaining relevance and diversity

Technical Innovations

  1. Multi-dimensional Editing Strategy: Systematically perform video editing across three dimensions: style, objects, and background
  2. Quality Assessment System: Propose four quantitative metrics to evaluate editing quality, ensuring generation of high-quality counterfactual videos
  3. Task Diversity: Cover four core video understanding tasks, comprehensively evaluating model capabilities
  4. Automated Pipeline: Construct an end-to-end automated data generation and evaluation pipeline

Experimental Setup

Dataset Scale

  • Video Data: 2.1k high-quality video-caption pairs
  • QA Pairs: 8.6k multiple-choice QA pairs
  • Training Set: 332 original videos, 1,328 counterfactual video samples, 6,640 QA pairs

Evaluation Metrics

  • Origin: Test accuracy on original videos
  • Edit: Test accuracy on edited videos
  • Drop: Performance degradation magnitude (Origin - Edit)

Comparison Methods

Evaluated eight mainstream video MLLMs:

  • Large or Fine-tuned Video Encoders: VideoChat, VideoChat2, VideoLLaMA2, VideoLLaVA, VideoLLaMA3
  • CLIP ViT/L-14 Encoders: VideoChatGPT, mPLUG-Owl3, LLaVA-Next

Implementation Details

  • Use LLaVA-Next as the base model for fine-tuning
  • Construct LLaVA-NextRo (trained with counterfactual data) and LLaVA-Nextori (trained with original data) for comparison

Experimental Results

Main Results

Overall Robustness Evaluation

Table 1 reveals significant performance degradation across all models on counterfactual videos:

  • Average Performance Drop: 17.57%
  • Best Robustness: VideoChat2 (10.34% drop)
  • Worst Robustness: LLaVA-Nextori (30.85% drop)

Impact of Editing Factors on Model Performance

  1. Task Sensitivity Differences: Action recognition tasks are most affected (23.99% drop), while object existence tasks are least affected (11.54% drop)
  2. Editing Factor Impact: Object changes have greater impact on models than style and background changes
  3. Architecture Impact: Models with larger or fine-tuned video encoders outperform those using frozen CLIP ViT/L-14

Fine-tuned Model Results

RO-Bench Performance Improvement

  • LLaVA-NextRo: Achieves best performance in robustness evaluation with only 4.83% accuracy drop
  • Relative to LLaVA-Next: Robustness metrics significantly improved by 21.73%
  • Relative to LLaVA-Nextori: Demonstrates advantages of counterfactual data training

General Video Understanding Capability Improvement

Across 20 downstream tasks in MVBench:

  • Average Performance Improvement: 12.78%
  • Action and Object-related Tasks: Show more pronounced improvements
  • Best Task Improvement: Achieve best performance on multiple tasks

Ablation Study Findings

  1. Editing Factor Analysis: Object editing has the largest impact on model performance, followed by style and background
  2. Architecture Comparison: More powerful video encoders are crucial for improving robustness
  3. Task Specificity: Temporal reasoning tasks (e.g., action recognition) are more susceptible to visual perturbations

Multimodal Large Language Models

Recent years have witnessed significant advances in MLLMs for video understanding tasks, though robustness evaluation remains relatively underdeveloped.

Robustness Evaluation

  • Image Domain: Works such as LANCE employ counterfactual image generation to evaluate model performance
  • Video Domain: Existing work primarily focuses on noise and corruption testing, lacking systematic counterfactual evaluation

Counterfactual Data Augmentation

Counterfactual data has demonstrated potential in improving model generalization, though its application in video MLLMs remains to be explored.

Conclusions and Discussion

Main Conclusions

  1. Insufficient Robustness: Current video MLLMs exhibit significant performance degradation when facing counterfactual video content
  2. Task Differences: Different tasks exhibit varying sensitivity to visual changes, with temporal-related tasks being more susceptible
  3. Architecture Importance: More powerful video encoders are crucial for improving robustness
  4. Training Effectiveness: Fine-tuning with counterfactual data effectively enhances model robustness and general performance

Limitations

  1. Data Scale: The current dataset scale is relatively small, potentially limiting evaluation comprehensiveness
  2. Editing Quality: Despite quality control measures, generated counterfactual videos may still lack naturalness
  3. Evaluation Scope: Primarily focuses on visual editing, not covering other perturbation types (e.g., audio, temporal perturbations)
  4. Model Coverage: Limited number of evaluated models may not fully represent current technological levels

Future Directions

  1. Extended Editing Types: Explore additional types of video editing and perturbation methods
  2. Large-scale Datasets: Construct larger-scale, more diverse counterfactual video datasets
  3. Theoretical Analysis: Conduct in-depth analysis of fundamental causes of MLLM robustness deficiencies
  4. Defense Mechanisms: Develop specialized defense strategies to enhance model robustness

In-depth Evaluation

Strengths

  1. Strong Innovation: First systematic proposal of a video MLLM robustness evaluation benchmark, filling an important research gap
  2. Complete Methodology: Constructs a comprehensive evaluation framework from data generation, quality control, to evaluation metrics
  3. Sufficient Experiments: Evaluates multiple mainstream models, providing comprehensive performance comparison analysis
  4. High Practical Value: Not only provides an evaluation benchmark but also demonstrates the effectiveness of counterfactual data in improving model performance
  5. Solid Technical Foundation: Employs state-of-the-art video editing technology to ensure generation of high-quality counterfactual videos

Weaknesses

  1. Data Scale Limitations: Relatively small dataset scale compared to other large-scale benchmarks
  2. Limited Editing Dimensions: Primarily focuses on three dimensions (style, objects, background), potentially overlooking other important perturbation types
  3. Single Evaluation Metrics: Primarily uses accuracy as the evaluation metric, lacking more fine-grained analysis indicators
  4. Insufficient Theoretical Analysis: Lacks in-depth theoretical analysis of fundamental causes of model robustness deficiencies

Impact

  1. Academic Contribution: Provides important benchmark and research framework for video MLLM robustness evaluation
  2. Practical Value: Offers important guidance for industrial deployment of video MLLMs
  3. Research Inspiration: Provides important foundation and reference for subsequent related research
  4. Reproducibility: Commits to open-sourcing code and data, facilitating research community development

Applicable Scenarios

  1. Model Evaluation: Suitable for robustness evaluation of various video MLLMs
  2. Model Improvement: Can guide model architecture design and training strategy optimization
  3. Application Deployment: Provides safety assessment for model deployment in high-risk application scenarios
  4. Research Benchmark: Can serve as a standard evaluation benchmark for future related research

References

This paper cites multiple important related works, including:

  • Video MLLMs: VideoChat, VideoLLaMA, LLaVA-Next, etc.
  • Robustness Evaluation: LANCE, OOD-CV, etc.
  • Video Editing: Tune-a-Video, CCEdit, etc.
  • Evaluation Benchmarks: MVBench, DAVIS, etc.

Overall Assessment: This is a high-quality research paper that systematically addresses the important problem of video MLLM robustness evaluation for the first time. The paper demonstrates excellence in technical innovation, experimental design, and practical value, making significant contributions to the field's development. While there remains room for improvement in dataset scale and theoretical analysis, overall this is a highly valuable research work.