2025-11-14T23:01:10.895550

Evaluating Open-Source Vision-Language Models for Multimodal Sarcasm Detection

Basnet, Farabi, Ranasinghe et al.
Recent advances in open-source vision-language models (VLMs) offer new opportunities for understanding complex and subjective multimodal phenomena such as sarcasm. In this work, we evaluate seven state-of-the-art VLMs - BLIP2, InstructBLIP, OpenFlamingo, LLaVA, PaliGemma, Gemma3, and Qwen-VL - on their ability to detect multimodal sarcasm using zero-, one-, and few-shot prompting. Furthermore, we evaluate the models' capabilities in generating explanations to sarcastic instances. We evaluate the capabilities of VLMs on three benchmark sarcasm datasets (Muse, MMSD2.0, and SarcNet). Our primary objectives are twofold: (1) to quantify each model's performance in detecting sarcastic image-caption pairs, and (2) to assess their ability to generate human-quality explanations that highlight the visual-textual incongruities driving sarcasm. Our results indicate that, while current models achieve moderate success in binary sarcasm detection, they are still not able to generate high-quality explanations without task-specific finetuning.
academic

Evaluating Open-Source Vision-Language Models for Multimodal Sarcasm Detection

Basic Information

  • Paper ID: 2510.11852
  • Title: Evaluating Open-Source Vision-Language Models for Multimodal Sarcasm Detection
  • Authors: Saroj Basnet (George Mason University), Shafkat Farabi (Virginia Tech), Tharindu Ranasinghe (Lancaster University), Diptesh Kanojia (University of Surrey), Marcos Zampieri (George Mason University)
  • Classification: cs.LG (Machine Learning)
  • Publication Date: October 13, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.11852v1

Abstract

This study evaluates the performance of seven state-of-the-art open-source vision-language models (VLMs) on multimodal sarcasm detection (MSD) tasks, including BLIP2, InstructBLIP, OpenFlamingo, LLaVA, PaliGemma, Gemma3, and Qwen-VL. The research employs zero-shot, one-shot, and few-shot prompting strategies and assesses the models' ability to generate sarcasm explanations. Experiments are conducted on three benchmark datasets (Muse, MMSD2.0, and SarcNet). Results demonstrate that while current models achieve moderate success in binary sarcasm detection, they fail to generate high-quality explanations without task-specific fine-tuning.

Research Background and Motivation

Problem Definition

  1. Core Problem: Evaluating the capabilities of open-source vision-language models on multimodal sarcasm detection (MSD) tasks, including both detection and explanation of sarcastic content
  2. Challenges: Sarcasm is a complex linguistic phenomenon where intended meaning contradicts literal expression. In multimodal environments, sarcastic effects often arise from mismatches between visual and textual content

Significance

  1. Social Media Prevalence: On social platforms, sarcasm is frequently realized through image-text pairings. Understanding such cross-modal inconsistencies is crucial for sentiment analysis and content understanding
  2. Technological Advancement: The development of large-scale vision-language models provides new opportunities for understanding complex subjective multimodal phenomena
  3. Application Value: Highly relevant to social media content moderation, sentiment analysis, and offensive language detection tasks

Limitations of Existing Approaches

  1. Insufficient Research: While VLMs demonstrate strong performance across various tasks, their performance on MSD tasks remains underexplored
  2. Methodological Constraints: Early MSD research primarily relied on separate feature extractors and feature aggregation techniques, lacking end-to-end multimodal understanding
  3. Explanation Capabilities: Existing models focus mainly on classification accuracy, with limited research on generating human-quality explanations

Core Contributions

  1. Unified Evaluation Framework: Provides a unified in-context learning framework with integrated image, few-shot examples, and explanation seeds applicable to seven different VLMs
  2. Systematic Benchmarking: Conducts systematic zero-shot, one-shot, and few-shot evaluations across three MSD benchmark datasets
  3. Explanation Generation Assessment: Evaluates each model's ability to generate free-form sarcasm explanations, filling a research gap in the field
  4. In-depth Analysis: Reveals the separation phenomenon between classification performance and explanation quality, providing important insights for future research

Methodology Details

Task Definition

Input: Image-text pairs (I, C), where I is the image and C is the caption text Output:

  1. Binary classification: Determine whether the pair contains sarcasm (Yes/No)
  2. Explanation generation: For sarcastic instances, generate natural language descriptions explaining visual-textual inconsistencies

Dataset Description

DatasetPositiveNegativeExplanationsMultilingual
MuSE3,5100×
MMSD2.011,65112,980××
SarcNet1,8751,460×

Model Architecture

Seven open-source VLMs evaluated:

  1. InstructBLIP: Instruction-tuned model based on FlanT5
  2. BLIP2 2.7B: Frozen image encoder + Q-former + large language model
  3. OpenFlamingo 3B: Lightweight open-source adaptation of Flamingo
  4. LLaVA 7B: Vision-language alignment through adversarial fine-tuning
  5. PaliGemma 3B: Multimodal mixture-of-experts model
  6. Qwen-VL 7B: Q-aware encoder-decoder architecture
  7. Gemma3 27B: Instruction-tuned multimodal model

Prompting Strategies

Classification Task Prompt Structure:

*<global_instruction>*
Example: (zero-, one-, few-shots)
*<image>*
*Caption:<caption> Answer: Yes/No*
*<image>*
**Context:** {caption}
Is this sarcastic?

Explanation Generation Prompt Structure:

*<Context>:*
*<image>*
**Original Caption**: {caption}
**Provided Explanation**: {explanation}
**Task Instruction**

Technical Innovations

  1. Unified Prompting Framework: Designs unified prompt templates applicable to different VLM architectures
  2. Multi-granular Evaluation: Combines classification accuracy and explanation quality assessment for comprehensive evaluation
  3. Cross-modal Alignment Assessment: Introduces Δ-CLIPScore to quantify improvements in image-text alignment

Experimental Setup

Data Processing

  • Randomly sample 3,000 image-caption pairs from MMSD2.0 and SarcNet for evaluation
  • Use MuSE dataset to provide explanation examples and evaluation benchmarks
  • Few-shot examples sampled from MuSE (positive) and MMSD2.0 (negative)

Evaluation Metrics

  1. Classification Accuracy: Accuracy of binary classification
  2. Δ-CLIPScore: Quantifies improvement in image-text alignment relative to original caption
    ΔCLIP = CLIP(IMG, G_exp) - CLIP(IMG, B_exp)
    
    where G_exp is the generated explanation and B_exp is the baseline explanation

Implementation Details

  • All models loaded with 8-bit precision with FlashAttention optimization enabled
  • Batch size of 1, maximum generation tokens 100-256
  • Beam search with beam size=3
  • Temperature parameter set to 0.7

Experimental Results

Classification Performance

DatasetBest ModelSettingAccuracy
SarcNetGemma3One-shot0.67
SarcNetInstructBLIPZero-shot0.67
MMSD2.0Gemma3One-shot0.73
MMSD2.0InstructBLIPZero-shot0.64

Key Findings

  1. Instruction-tuned Model Advantages: Gemma3 and InstructBLIP perform best in zero-shot and one-shot settings
  2. Limited Few-shot Effectiveness: Increasing example numbers does not improve performance and sometimes introduces noise
  3. Dataset Differences: Models generally perform better on MMSD2.0 than on SarcNet

Explanation Generation Results

ModelMean Δ-CLIPScoreVariance
LLaVA1.96627.315
BLIP20.83125.532
PaliGemma0.75716.234
InstructBLIP0.58327.749
Gemma3-2.06346.481
OpenFlamingo-1.75011.526
Qwen-7.14325.515

Important Findings

  1. Performance Separation: The best-performing classification model (Gemma3) performs worst in explanation generation
  2. Architecture Impact: VQA-style architectures (BLIP2, LLaVA) are better suited for generating high-quality explanations
  3. Training Objective Differences: Discriminatively-trained models excel at classification, while generatively-trained models are better for explanation

Sarcasm Detection Research

  1. Text Sarcasm Detection: Traditional research primarily focuses on sarcasm identification in pure text environments
  2. Multimodal Sarcasm Detection: Schifanella et al. first demonstrated that visual modality contains clues helpful for identifying sarcastic intent
  3. Feature Aggregation Methods: Early work uses separate encoders to extract features, then designs aggregation techniques

Vision-Language Models

  1. Pre-trained Models: Models like Flamingo and VILA demonstrate zero-shot and few-shot learning capabilities
  2. Multimodal Understanding: Recent models focus on early modeling of cross-modal interactions
  3. Instruction Tuning: Models like InstructBLIP improve multi-task performance through instruction tuning

Conclusions and Discussion

Main Conclusions

  1. Moderate Success: Open-source VLMs achieve moderate success in binary sarcasm detection but have room for improvement
  2. Explanation Challenges: Existing models face significant difficulties in generating high-quality explanations
  3. Architecture Importance: Model architecture and training objectives significantly influence task-specific performance

Limitations

  1. Sample Scale: Evaluation samples are relatively limited (3,000 samples per dataset)
  2. Language Coverage: Primarily focuses on English with limited multilingual evaluation
  3. Explanation Assessment: Explanation quality assessment relies mainly on automated metrics, lacking human evaluation

Future Directions

  1. Hybrid Training Objectives: Develop multi-task learning methods that simultaneously optimize classification and explanation generation
  2. Chain-of-Thought Prompting: Explore CoT and multi-stage prompting to elicit richer model reasoning
  3. Knowledge Enhancement: Integrate RAG techniques or external knowledge to enhance contextual understanding
  4. Multilingual Extension: Extend to sarcasm detection across more languages and cultural backgrounds

In-depth Evaluation

Strengths

  1. Systematic Evaluation: First systematic evaluation of multiple open-source VLMs on MSD tasks
  2. Dual Tasks: Simultaneously assesses classification and explanation capabilities for comprehensive perspective
  3. Practical Value: Provides important reference for researchers selecting appropriate VLMs
  4. Openness: Commits to open-sourcing code and data, promoting reproducible research

Weaknesses

  1. Insufficient Deep Analysis: Limited qualitative analysis of model failure cases
  2. Evaluation Metric Limitations: Explanation quality assessment relies primarily on CLIP alignment, potentially incomplete
  3. Model Recency: Some model versions are relatively outdated and may not represent latest technological advances

Impact

  1. Benchmark Role: Provides important benchmark evaluation for the MSD field
  2. Methodological Inspiration: Unified evaluation framework can be generalized to other multimodal tasks
  3. Practical Guidance: Provides reference for selecting appropriate models in practical applications

Applicable Scenarios

  1. Social Media Analysis: Applicable to content understanding on platforms like Twitter and Facebook
  2. Sentiment Analysis: Can serve as a component in broader sentiment analysis systems
  3. Content Moderation: Helps identify potential sarcastic and sarcastic content

References

The paper cites 46 relevant references covering important works in sarcasm detection, multimodal learning, and vision-language models across multiple research domains, providing a solid theoretical foundation for the research.


Overall Assessment: This is a high-quality empirical research paper that fills the gap in evaluating open-source VLMs on multimodal sarcasm detection tasks. The research design is sound, experiments are comprehensive, and conclusions have practical value. While there is room for improvement in deep analysis and evaluation metrics, the work makes important contributions to the field's development.