2025-11-11T15:40:09.573035

PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning

Kawakami, Egashira, Miyai et al.

In recent years, unlearning techniques, which are methods for inducing a model to "forget" previously learned information, have attracted attention as a way to address privacy and copyright concerns in large language models (LLMs) and large multimodal models (LMMs). While several unlearning benchmarks have been established for LLMs, a practical evaluation framework for unlearning in LMMs has been less explored. Specifically, existing unlearning benchmark for LMMs considers only scenarios in which the model is required to unlearn fine-tuned knowledge through a single unlearning operation. In this study, we introduce PULSE protocol for realistic unlearning scenarios for LMMs by introducing two critical perspectives: (i) Pre-trained knowledge Unlearning for analyzing the effect across different knowledge acquisition phases and (ii) Long-term Sustainability Evaluation to address sequential requests. We then evaluate existing unlearning methods along these dimensions. Our results reveal that, although some techniques can successfully unlearn knowledge acquired through fine-tuning, they struggle to eliminate information learned during pre-training. Moreover, methods that effectively unlearn a batch of target data in a single operation exhibit substantial performance degradation when the same data are split and unlearned sequentially.

academic

PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning

Basic Information

Paper ID: 2507.01271
Title: PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning
Authors: Tatsuki Kawakami, Kazuki Egashira, Atsuyuki Miyai, Go Irie, Kiyoharu Aizawa (University of Tokyo)
Classification: cs.LG cs.AI
Publication Date/Venue: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop
Paper Link: https://arxiv.org/abs/2507.01271

Abstract

In recent years, machine unlearning has gained attention as a method to address privacy and copyright concerns in large language models (LLMs) and large multimodal models (LMMs). While multiple unlearning benchmarks have been established for LLMs, practical unlearning evaluation frameworks for LMMs remain underexplored. Existing LMM unlearning benchmarks only consider scenarios where fine-tuned knowledge is forgotten through a single unlearning operation. This study introduces the PULSE protocol by incorporating two key perspectives: (i) pre-trained knowledge unlearning to analyze the impact of different knowledge acquisition stages, and (ii) long-term sustainability evaluation to address continuous requests. The results demonstrate that while some techniques successfully forget knowledge acquired through fine-tuning, they struggle to eliminate information learned during pre-training. Furthermore, methods effective at forgetting batch target data in single operations exhibit significant performance degradation when continuously forgetting data in batches.

Research Background and Motivation

Problem Definition

As large multimodal models achieve tremendous success across various tasks, their training data may contain personal information and copyrighted content, raising concerns about privacy and intellectual property infringement. Machine unlearning aims to enable models to "forget" previously learned information while maintaining performance on other tasks.

Problem Significance

Privacy Protection Requirements: With strengthened data privacy regulations, techniques are needed to remove specific personal information from trained models
Copyright Protection: Addressing potentially copyrighted content in training data
Practical Application Needs: Real-world scenarios may require continuous multiple unlearning operations

Limitations of Existing Methods

Limited Evaluation Scope: Existing LMM unlearning benchmarks (e.g., MLLMU-Bench) only consider forgetting fine-tuned knowledge
Single Operation Assumption: Only evaluate one-time unlearning operations, ignoring scenarios with continuous unlearning requests
Lack of Pre-trained Knowledge Assessment: Do not consider forgetting knowledge acquired during pre-training

Research Motivation

This paper aims to establish a more practical and comprehensive evaluation framework for LMM unlearning, filling gaps in existing evaluation methods regarding pre-trained knowledge unlearning and sustainability.

Core Contributions

Proposes PULSE Protocol: Designs a new protocol for evaluating (i) pre-trained knowledge unlearning and (ii) long-term sustainability assessment in LMMs
Reveals Difficulty of Pre-trained Knowledge Unlearning: Through PULSE protocol, discovers that existing unlearning techniques perform poorly when targeting knowledge acquired during pre-training
Identifies Sustainability Issues: Demonstrates that current methods show significant performance degradation when facing multiple continuous unlearning requests
Provides Practical Evaluation Foundation: Offers important insights for future design of LMM unlearning techniques

Methodology Details

Task Definition

Let $D_{unlearn}$ denote data to be forgotten and $D_{retain}$ denote data to be retained. Evaluation of unlearning methods encompasses two aspects:

Effectiveness: Unlearning performance on target $D_{unlearn}$
Generality: Accuracy retention on irrelevant data $D_{retain}$

PULSE Protocol Architecture

1. Fine-tuned Knowledge Unlearning

Follows standard practice, selecting a subset of fine-tuned knowledge as $D_{unlearn}$
Model forgets this subset in a single operation
Evaluates unlearning effectiveness and generalization performance retention

2. Pre-trained Knowledge Unlearning

Treats knowledge acquired during pre-training as $D_{unlearn}$
Identifies individuals the model "knows" based on actual model behavior
More practical than direct sampling from pre-training data, applicable when pre-training corpus is not fully public

3. Long-term Sustainability Evaluation

Divides $D_{unlearn}$ into multiple subsets
Performs sequential unlearning operations on these subsets
Tracks changes in model generalization and effectiveness after each operation

Technical Innovations

Multi-dimensional Evaluation Framework: First to simultaneously consider knowledge source types and operation sustainability in LMMs
Practicality-Oriented Design: Evaluation protocol designed based on real-world application scenarios
Cross-modal Consistency Requirements: Requires models not to leak target information in both multimodal and pure text tasks

Experimental Setup

Datasets

Uses publicly released MLLMU-Bench datasets:

Each individual contains 1 facial image and 10 question-answer pairs
5 multimodal tasks, 5 pure text tasks
Questions involve personal details (e.g., occupation, residence)

Experimental Configuration:

Fine-tuned Knowledge Unlearning: 100 fictional individuals, 50 for $D_{unlearn}$ , 50 for $D_{retain}$
Pre-trained Knowledge Unlearning: 45 high-accuracy individuals selected from 153 real celebrities, 20 for $D_{unlearn}$ , 25 for $D_{retain}$
Sustainability Evaluation: 50 individuals divided into 5 subsets, 5 sequential unlearning operations

Evaluation Metrics

Effectiveness Metrics: Accuracy on $D_{unlearn}$ (lower is better)
Generality Metrics:
- Accuracy on $D_{retain}$ (higher is better)
- MMBench score (evaluates multimodal capabilities)

Comparison Methods

Gradient Ascent (GA): Uses $D_{unlearn}$ as unlearning data, parameter update direction opposite to standard gradient descent
GA with KL Regularization (GA+KLR): Adds KL divergence penalty term to keep updated model close to original
Negative Preference Optimization (NPO): Preference optimization method treating unlearning data as negative examples

Implementation Details

Base Model: LLaVA-v1.5-13B
Fine-tuning Method: LoRA (Low-Rank Adaptation)
Parameter Updates: Uses LoRA in both fine-tuning and unlearning processes

Experimental Results

Main Results

Pre-trained Knowledge Unlearning Performance

All methods show reduced accuracy on $D_{unlearn}$ , indicating unlearning is effective to some extent
Key Findings:
- Forgetting fine-tuned knowledge: MMBench capability loss ~10% at most
- Forgetting pre-trained knowledge: MMBench capability loss exceeds 90%
- $D_{retain}$ accuracy also significantly decreases, indicating difficulty in selective unlearning

Sustainability Evaluation Results

As unlearning operations increase, not only $D_{unlearn}$ performance but also generality metrics gradually deteriorate
After 5 unlearning operations, generality nearly completely vanishes
Indicates current mainstream unlearning methods cannot maintain sustainability in LMM unlearning

In-depth Analysis

Task Modality Differences

When parameter updates include projection matrices and language models:

Multimodal task accuracy: 78.0% → 9.6%
Pure text task accuracy: 76.8% → 35.2%

Important Finding: Pure text tasks show greater resistance to unlearning, possibly only "disrupting image-knowledge alignment" rather than truly forgetting target information.

Parameter Update Strategy Impact

LLM-only Updates: Significant MMBench performance decrease
Simultaneous Updates to Projection Matrix and LLM: Slight MMBench performance decrease
Hypothesis: Allowing projection matrix updates facilitates unlearning by disrupting inter-modal alignment

Experimental Findings

Pre-trained Knowledge More Difficult to Forget: Possibly because models learn relationships between target individuals and other entities during pre-training
Continuous Unlearning Causes Catastrophic Forgetting: Repeated unlearning updates parameters critical for retained tasks
Cross-modal Inconsistency: Existing methods may fail to ensure consistent unlearning across modalities

Unlearning Methodology

Gradient Ascent Variants: GA, GA with regularization, NPO and other methods show certain effectiveness in LLMs and LMMs
LMM-Specific Methods: SIU limited to multimodal tasks, unsuitable for pure text task evaluation

Unlearning Benchmarks

LLM Benchmarks: MUSE, TOFU provide comprehensive evaluation frameworks
LMM Benchmarks: MLLMU-Bench provides basic but incomplete evaluation
This Paper's Contribution: First to provide pre-trained knowledge unlearning and sustainability evaluation in LMMs

Conclusions and Discussion

Main Conclusions

Existing unlearning methods perform poorly on pre-trained knowledge, causing severe model generalization degradation
Continuous unlearning operations lead to gradual performance deterioration; current methods unsuitable for practical deployment
Inconsistency exists between multimodal and pure text tasks in unlearning effectiveness

Limitations

Dataset Scale: Relatively small datasets used in experiments may not fully reflect large-scale application scenarios
Method Coverage: Only evaluates three mainstream unlearning methods, not covering all existing techniques
Evaluation Metrics: May require more fine-grained metrics for comprehensive unlearning assessment

Future Directions

Develop unlearning methods specifically targeting pre-trained knowledge
Design unlearning techniques maintaining long-term sustainability
Research methods for cross-modal consistent unlearning
Explore more refined parameter update strategies

In-depth Evaluation

Strengths

Accurate Problem Identification: Precisely identifies key deficiencies in existing LMM unlearning evaluation
Complete Evaluation Framework: PULSE protocol fills important evaluation gaps
Reasonable Experimental Design: Experimental setup closely aligns with practical scenarios
Insightful Findings: Reveals important issues in pre-trained knowledge unlearning and sustainability
Clear Writing: Well-structured paper with accurate technical descriptions

Weaknesses

Limited Method Innovation: Main contributions in evaluation protocol rather than new unlearning methods
Missing Solutions: Identifies problems but provides no effective solutions
Insufficient Theoretical Analysis: Relatively simple theoretical explanations for observed phenomena
Experimental Scale Limitations: Relatively small experimental scale constrained by existing datasets

Impact

Academic Value: Provides important evaluation benchmark for LMM unlearning research
Practical Value: Identified problems have important guidance for practical applications
Driving Force: May promote development of more practical unlearning methods
Reproducibility: Clear experimental setup based on public datasets with good reproducibility

Applicable Scenarios

Research Evaluation: Provides standard protocol for evaluating LMM unlearning methods
Method Development: Offers evaluation benchmark for new unlearning method design
Practical Deployment: Provides performance expectations for unlearning needs in practical applications
Policy Making: Provides technical reference for related privacy protection policies

References

The paper cites multiple important related works, including:

LLM unlearning benchmarks such as MUSE and TOFU
LMM unlearning benchmarks such as MLLMU-Bench
Multimodal models such as LLaVA
Parameter-efficient fine-tuning methods such as LoRA

Overall Assessment: This is a high-quality evaluation research paper that, while relatively limited in methodological innovation, makes important contributions in problem identification and evaluation framework establishment. The paper's findings regarding pre-trained knowledge unlearning difficulty and sustainability issues provide important guidance for field development and point toward key future research directions.