2025-11-14T23:01:10.895550

Evaluating Open-Source Vision-Language Models for Multimodal Sarcasm Detection

Basnet, Farabi, Ranasinghe et al.

Recent advances in open-source vision-language models (VLMs) offer new opportunities for understanding complex and subjective multimodal phenomena such as sarcasm. In this work, we evaluate seven state-of-the-art VLMs - BLIP2, InstructBLIP, OpenFlamingo, LLaVA, PaliGemma, Gemma3, and Qwen-VL - on their ability to detect multimodal sarcasm using zero-, one-, and few-shot prompting. Furthermore, we evaluate the models' capabilities in generating explanations to sarcastic instances. We evaluate the capabilities of VLMs on three benchmark sarcasm datasets (Muse, MMSD2.0, and SarcNet). Our primary objectives are twofold: (1) to quantify each model's performance in detecting sarcastic image-caption pairs, and (2) to assess their ability to generate human-quality explanations that highlight the visual-textual incongruities driving sarcasm. Our results indicate that, while current models achieve moderate success in binary sarcasm detection, they are still not able to generate high-quality explanations without task-specific finetuning.

academic

Evaluating Open-Source Vision-Language Models for Multimodal Sarcasm Detection

Basic Information

Paper ID: 2510.11852
Title: Evaluating Open-Source Vision-Language Models for Multimodal Sarcasm Detection
Authors: Saroj Basnet (George Mason University), Shafkat Farabi (Virginia Tech), Tharindu Ranasinghe (Lancaster University), Diptesh Kanojia (University of Surrey), Marcos Zampieri (George Mason University)
Classification: cs.LG (Machine Learning)
Publication Date: October 13, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.11852v1

Abstract

This study evaluates the performance of seven state-of-the-art open-source vision-language models (VLMs) on multimodal sarcasm detection (MSD) tasks, including BLIP2, InstructBLIP, OpenFlamingo, LLaVA, PaliGemma, Gemma3, and Qwen-VL. The research employs zero-shot, one-shot, and few-shot prompting strategies and assesses the models' ability to generate sarcasm explanations. Experiments are conducted on three benchmark datasets (Muse, MMSD2.0, and SarcNet). Results demonstrate that while current models achieve moderate success in binary sarcasm detection, they fail to generate high-quality explanations without task-specific fine-tuning.

Research Background and Motivation

Problem Definition

Core Problem: Evaluating the capabilities of open-source vision-language models on multimodal sarcasm detection (MSD) tasks, including both detection and explanation of sarcastic content
Challenges: Sarcasm is a complex linguistic phenomenon where intended meaning contradicts literal expression. In multimodal environments, sarcastic effects often arise from mismatches between visual and textual content

Significance

Social Media Prevalence: On social platforms, sarcasm is frequently realized through image-text pairings. Understanding such cross-modal inconsistencies is crucial for sentiment analysis and content understanding
Technological Advancement: The development of large-scale vision-language models provides new opportunities for understanding complex subjective multimodal phenomena
Application Value: Highly relevant to social media content moderation, sentiment analysis, and offensive language detection tasks

Limitations of Existing Approaches

Insufficient Research: While VLMs demonstrate strong performance across various tasks, their performance on MSD tasks remains underexplored
Methodological Constraints: Early MSD research primarily relied on separate feature extractors and feature aggregation techniques, lacking end-to-end multimodal understanding
Explanation Capabilities: Existing models focus mainly on classification accuracy, with limited research on generating human-quality explanations

Core Contributions

Unified Evaluation Framework: Provides a unified in-context learning framework with integrated image, few-shot examples, and explanation seeds applicable to seven different VLMs
Systematic Benchmarking: Conducts systematic zero-shot, one-shot, and few-shot evaluations across three MSD benchmark datasets
Explanation Generation Assessment: Evaluates each model's ability to generate free-form sarcasm explanations, filling a research gap in the field
In-depth Analysis: Reveals the separation phenomenon between classification performance and explanation quality, providing important insights for future research

Methodology Details

Task Definition

Input: Image-text pairs (I, C), where I is the image and C is the caption text Output:

Binary classification: Determine whether the pair contains sarcasm (Yes/No)
Explanation generation: For sarcastic instances, generate natural language descriptions explaining visual-textual inconsistencies

Dataset Description

Dataset	Positive	Negative	Explanations	Multilingual
MuSE	3,510	0	✓	×
MMSD2.0	11,651	12,980	×	×
SarcNet	1,875	1,460	×	✓

Model Architecture

Seven open-source VLMs evaluated:

InstructBLIP: Instruction-tuned model based on FlanT5
BLIP2 2.7B: Frozen image encoder + Q-former + large language model
OpenFlamingo 3B: Lightweight open-source adaptation of Flamingo
LLaVA 7B: Vision-language alignment through adversarial fine-tuning
PaliGemma 3B: Multimodal mixture-of-experts model
Qwen-VL 7B: Q-aware encoder-decoder architecture
Gemma3 27B: Instruction-tuned multimodal model

Prompting Strategies

Classification Task Prompt Structure:

*<global_instruction>*
Example: (zero-, one-, few-shots)
*<image>*
*Caption:<caption> Answer: Yes/No*
*<image>*
**Context:** {caption}
Is this sarcastic?

Explanation Generation Prompt Structure:

*<Context>:*
*<image>*
**Original Caption**: {caption}
**Provided Explanation**: {explanation}
**Task Instruction**

Technical Innovations

Unified Prompting Framework: Designs unified prompt templates applicable to different VLM architectures
Multi-granular Evaluation: Combines classification accuracy and explanation quality assessment for comprehensive evaluation
Cross-modal Alignment Assessment: Introduces Δ-CLIPScore to quantify improvements in image-text alignment

Experimental Setup

Data Processing

Randomly sample 3,000 image-caption pairs from MMSD2.0 and SarcNet for evaluation
Use MuSE dataset to provide explanation examples and evaluation benchmarks
Few-shot examples sampled from MuSE (positive) and MMSD2.0 (negative)

Evaluation Metrics

Classification Accuracy: Accuracy of binary classification
Δ-CLIPScore: Quantifies improvement in image-text alignment relative to original caption
```
ΔCLIP = CLIP(IMG, G_exp) - CLIP(IMG, B_exp)
```
where G_exp is the generated explanation and B_exp is the baseline explanation

Implementation Details

All models loaded with 8-bit precision with FlashAttention optimization enabled
Batch size of 1, maximum generation tokens 100-256
Beam search with beam size=3
Temperature parameter set to 0.7

Experimental Results

Classification Performance

Dataset	Best Model	Setting	Accuracy
SarcNet	Gemma3	One-shot	0.67
SarcNet	InstructBLIP	Zero-shot	0.67
MMSD2.0	Gemma3	One-shot	0.73
MMSD2.0	InstructBLIP	Zero-shot	0.64

Key Findings

Instruction-tuned Model Advantages: Gemma3 and InstructBLIP perform best in zero-shot and one-shot settings
Limited Few-shot Effectiveness: Increasing example numbers does not improve performance and sometimes introduces noise
Dataset Differences: Models generally perform better on MMSD2.0 than on SarcNet

Explanation Generation Results

Model	Mean Δ-CLIPScore	Variance
LLaVA	1.966	27.315
BLIP2	0.831	25.532
PaliGemma	0.757	16.234
InstructBLIP	0.583	27.749
Gemma3	-2.063	46.481
OpenFlamingo	-1.750	11.526
Qwen	-7.143	25.515

Important Findings

Performance Separation: The best-performing classification model (Gemma3) performs worst in explanation generation
Architecture Impact: VQA-style architectures (BLIP2, LLaVA) are better suited for generating high-quality explanations
Training Objective Differences: Discriminatively-trained models excel at classification, while generatively-trained models are better for explanation

Sarcasm Detection Research

Text Sarcasm Detection: Traditional research primarily focuses on sarcasm identification in pure text environments
Multimodal Sarcasm Detection: Schifanella et al. first demonstrated that visual modality contains clues helpful for identifying sarcastic intent
Feature Aggregation Methods: Early work uses separate encoders to extract features, then designs aggregation techniques

Vision-Language Models

Pre-trained Models: Models like Flamingo and VILA demonstrate zero-shot and few-shot learning capabilities
Multimodal Understanding: Recent models focus on early modeling of cross-modal interactions
Instruction Tuning: Models like InstructBLIP improve multi-task performance through instruction tuning

Conclusions and Discussion

Main Conclusions

Moderate Success: Open-source VLMs achieve moderate success in binary sarcasm detection but have room for improvement
Explanation Challenges: Existing models face significant difficulties in generating high-quality explanations
Architecture Importance: Model architecture and training objectives significantly influence task-specific performance

Limitations

Sample Scale: Evaluation samples are relatively limited (3,000 samples per dataset)
Language Coverage: Primarily focuses on English with limited multilingual evaluation
Explanation Assessment: Explanation quality assessment relies mainly on automated metrics, lacking human evaluation

Future Directions

Hybrid Training Objectives: Develop multi-task learning methods that simultaneously optimize classification and explanation generation
Chain-of-Thought Prompting: Explore CoT and multi-stage prompting to elicit richer model reasoning
Knowledge Enhancement: Integrate RAG techniques or external knowledge to enhance contextual understanding
Multilingual Extension: Extend to sarcasm detection across more languages and cultural backgrounds

In-depth Evaluation

Strengths

Systematic Evaluation: First systematic evaluation of multiple open-source VLMs on MSD tasks
Dual Tasks: Simultaneously assesses classification and explanation capabilities for comprehensive perspective
Practical Value: Provides important reference for researchers selecting appropriate VLMs
Openness: Commits to open-sourcing code and data, promoting reproducible research

Weaknesses

Insufficient Deep Analysis: Limited qualitative analysis of model failure cases
Evaluation Metric Limitations: Explanation quality assessment relies primarily on CLIP alignment, potentially incomplete
Model Recency: Some model versions are relatively outdated and may not represent latest technological advances

Impact

Benchmark Role: Provides important benchmark evaluation for the MSD field
Methodological Inspiration: Unified evaluation framework can be generalized to other multimodal tasks
Practical Guidance: Provides reference for selecting appropriate models in practical applications

Applicable Scenarios

Social Media Analysis: Applicable to content understanding on platforms like Twitter and Facebook
Sentiment Analysis: Can serve as a component in broader sentiment analysis systems
Content Moderation: Helps identify potential sarcastic and sarcastic content

References

The paper cites 46 relevant references covering important works in sarcasm detection, multimodal learning, and vision-language models across multiple research domains, providing a solid theoretical foundation for the research.

Overall Assessment: This is a high-quality empirical research paper that fills the gap in evaluating open-source VLMs on multimodal sarcasm detection tasks. The research design is sound, experiments are comprehensive, and conclusions have practical value. While there is room for improvement in deep analysis and evaluation metrics, the work makes important contributions to the field's development.