Evaluating Open-Source Vision-Language Models for Multimodal Sarcasm Detection
Basnet, Farabi, Ranasinghe et al.
Recent advances in open-source vision-language models (VLMs) offer new opportunities for understanding complex and subjective multimodal phenomena such as sarcasm. In this work, we evaluate seven state-of-the-art VLMs - BLIP2, InstructBLIP, OpenFlamingo, LLaVA, PaliGemma, Gemma3, and Qwen-VL - on their ability to detect multimodal sarcasm using zero-, one-, and few-shot prompting. Furthermore, we evaluate the models' capabilities in generating explanations to sarcastic instances. We evaluate the capabilities of VLMs on three benchmark sarcasm datasets (Muse, MMSD2.0, and SarcNet). Our primary objectives are twofold: (1) to quantify each model's performance in detecting sarcastic image-caption pairs, and (2) to assess their ability to generate human-quality explanations that highlight the visual-textual incongruities driving sarcasm. Our results indicate that, while current models achieve moderate success in binary sarcasm detection, they are still not able to generate high-quality explanations without task-specific finetuning.
academic
Evaluating Open-Source Vision-Language Models for Multimodal Sarcasm Detection
This study evaluates the performance of seven state-of-the-art open-source vision-language models (VLMs) on multimodal sarcasm detection (MSD) tasks, including BLIP2, InstructBLIP, OpenFlamingo, LLaVA, PaliGemma, Gemma3, and Qwen-VL. The research employs zero-shot, one-shot, and few-shot prompting strategies and assesses the models' ability to generate sarcasm explanations. Experiments are conducted on three benchmark datasets (Muse, MMSD2.0, and SarcNet). Results demonstrate that while current models achieve moderate success in binary sarcasm detection, they fail to generate high-quality explanations without task-specific fine-tuning.
Core Problem: Evaluating the capabilities of open-source vision-language models on multimodal sarcasm detection (MSD) tasks, including both detection and explanation of sarcastic content
Challenges: Sarcasm is a complex linguistic phenomenon where intended meaning contradicts literal expression. In multimodal environments, sarcastic effects often arise from mismatches between visual and textual content
Social Media Prevalence: On social platforms, sarcasm is frequently realized through image-text pairings. Understanding such cross-modal inconsistencies is crucial for sentiment analysis and content understanding
Technological Advancement: The development of large-scale vision-language models provides new opportunities for understanding complex subjective multimodal phenomena
Application Value: Highly relevant to social media content moderation, sentiment analysis, and offensive language detection tasks
Insufficient Research: While VLMs demonstrate strong performance across various tasks, their performance on MSD tasks remains underexplored
Methodological Constraints: Early MSD research primarily relied on separate feature extractors and feature aggregation techniques, lacking end-to-end multimodal understanding
Explanation Capabilities: Existing models focus mainly on classification accuracy, with limited research on generating human-quality explanations
Unified Evaluation Framework: Provides a unified in-context learning framework with integrated image, few-shot examples, and explanation seeds applicable to seven different VLMs
Systematic Benchmarking: Conducts systematic zero-shot, one-shot, and few-shot evaluations across three MSD benchmark datasets
Explanation Generation Assessment: Evaluates each model's ability to generate free-form sarcasm explanations, filling a research gap in the field
In-depth Analysis: Reveals the separation phenomenon between classification performance and explanation quality, providing important insights for future research
The paper cites 46 relevant references covering important works in sarcasm detection, multimodal learning, and vision-language models across multiple research domains, providing a solid theoretical foundation for the research.
Overall Assessment: This is a high-quality empirical research paper that fills the gap in evaluating open-source VLMs on multimodal sarcasm detection tasks. The research design is sound, experiments are comprehensive, and conclusions have practical value. While there is room for improvement in deep analysis and evaluation metrics, the work makes important contributions to the field's development.