2025-12-01T00:13:18.877594

Beyond Superficial Forgetting: Thorough Unlearning through Knowledge Density Estimation and Block Re-insertion

Guo, Wen, Gao et al.

Machine unlearning, which selectively removes harmful knowledge from a pre-trained model without retraining from scratch, is crucial for addressing privacy, regulatory compliance, and ethical concerns in Large Language Models (LLMs). However, existing unlearning methods often struggle to thoroughly remove harmful knowledge, leaving residual harmful knowledge that can be easily recovered. To address these limitations, we propose Knowledge Density-Guided Unlearning via Blocks Reinsertion (KUnBR), a novel approach that first identifies layers with rich harmful knowledge and then thoroughly eliminates the harmful knowledge via re-insertion strategy. Our method introduces knowledge density estimation to quantify and locate layers containing the most harmful knowledge, enabling precise unlearning. Additionally, we design a layer re-insertion strategy that extracts and re-inserts harmful knowledge-rich layers into the original LLM, bypassing gradient obstruction caused by cover layers and ensuring effective gradient propagation during unlearning. Extensive experiments conducted on several unlearning and general capability benchmarks demonstrate that KUnBR achieves state-of-the-art forgetting performance while maintaining model utility.

academic

Beyond Superficial Forgetting: Thorough Unlearning through Knowledge Density Estimation and Block Re-insertion

Basic Information

Paper ID: 2511.11667
Title: Beyond Superficial Forgetting: Thorough Unlearning through Knowledge Density Estimation and Block Re-insertion
Authors: Feng Guo, Yuntao Wen, Shen Gao, Junshuo Zhang, Shuo Shang (University of Electronic Science and Technology of China)
Classification: cs.LG, cs.AI
Publication Time/Conference: AAAI 2026 (Expected)
Paper Link: https://arxiv.org/abs/2511.11667
Code Link: github.com/llmgfffffff/Beyond-Superficial-Forgetting-KUnBR

Abstract

This paper addresses the machine unlearning problem in Large Language Models (LLMs) and proposes a novel method called KUnBR (Knowledge Density-Guided Unlearning via Blocks Reinsertion). Existing unlearning methods often fail to thoroughly remove harmful knowledge, leaving residual knowledge that can be easily recovered. KUnBR identifies layers rich in harmful knowledge through knowledge density estimation, then employs a block re-insertion strategy to completely eliminate harmful knowledge. The method bypasses gradient blockage caused by "cover layers," ensuring effective gradient propagation. Experiments on multiple benchmarks demonstrate that KUnBR achieves state-of-the-art unlearning performance while maintaining the model's general capabilities.

Research Background and Motivation

1. Core Problem to Address

Machine unlearning aims to selectively remove specific subsets of knowledge from pre-trained models (such as privacy-sensitive or harmful content) without retraining from scratch. This is crucial for LLM development as it involves data privacy, regulatory compliance (such as the "right to be forgotten"), and ethical considerations for AI systems.

2. Problem Importance

Privacy Protection: LLMs may ingest large amounts of privacy-sensitive data during pre-training
Regulatory Compliance: Regulations like GDPR require the ability to delete specific user data
Safety: Preventing malicious exploitation of harmful knowledge in models
Ethical Alignment: Ensuring LLMs remain consistent with societal values

3. Limitations of Existing Methods

Existing unlearning methods (such as gradient ascent, representation misdirection) have serious flaws:

Superficial Forgetting: Only modifying a few parameters (cover layers) to suppress output rather than truly eliminating knowledge
Easy Recovery: RTT (Retraining on T) attacks show that most "forgotten" knowledge can be recovered through minimal retraining on subsets of the unlearning set
Residual Knowledge: Harmful knowledge remains in model parameters, merely masked rather than eliminated
Poor Robustness: Vulnerable to jailbreak attacks and parameter-level attacks

4. Research Motivation

The authors discovered that existing methods primarily rely on adjusting "cover layers" to mask representations of harmful knowledge, merely preventing undesirable model outputs without truly eliminating knowledge from internal representations. This fundamental limitation suggests the need for more robust and thorough unlearning methods.

Core Contributions

Proposes KUnBR Framework: A novel unlearning framework capable of identifying layers containing harmful knowledge and performing targeted training to achieve thorough elimination of harmful knowledge
Knowledge Density Estimation Method: Introduces a gradient-based knowledge density estimation metric that quantifies and locates layers in LLMs containing the most harmful knowledge, enabling precise unlearning
Block Re-insertion Strategy: Designs a novel layer re-insertion strategy that extracts blocks rich in harmful knowledge and re-inserts them into the original LLM, bypassing gradient blockage caused by cover layers and ensuring effective gradient propagation during unlearning
SOTA Performance: Achieves state-of-the-art unlearning performance on multiple unlearning and general capability benchmarks while maintaining model utility, particularly excelling in resistance to RTT attacks

Method Details

Task Definition

Given:

Unlearning Dataset $D_{forget}$ : Contains knowledge to be removed
Retention Dataset $D_{retain}$ : Helps the model maintain general capabilities during unlearning

Objective:

Optimize model parameters to thoroughly eliminate knowledge related to $D_{forget}$
Ensure model utility performance remains unaffected
When subjected to RTT attacks (fine-tuning on subset T of $D_{forget}$ ), the model still cannot generate knowledge from another disjoint subset V of $D_{forget}$

Model Architecture

The KUnBR method comprises three main steps:

Step 1: Pre-Unlearning

Uses standard gradient difference methods for full-parameter fine-tuning of the original LLM as a "warm-up" phase: $\theta_{t+1} = \theta_t - \eta (\alpha\nabla_\theta L_{retain}(\theta_t) - \nabla_\theta L_{forget}(\theta_t))$

Where:

$\eta$ is the learning rate
$\alpha$ is the retention coefficient
$L_{retain}$ and $L_{forget}$ are losses on retention and unlearning sets respectively

Step 2: Knowledge Density Estimation and Block Selection

Knowledge Density Calculation: For layer $l$ , knowledge density is defined as: $K_l = \mathbb{E}_{(x,y)\sim D_{forget}}[\|\nabla_{\theta_l}L(x,y;\theta_l)\|_1]$

Where $L(x,y;\theta) = -\log(p(y|x;\theta))$ is the negative log-likelihood loss.

Normalized Knowledge Density: $K_l^{norm} = \frac{K_l}{\sum_{i=1}^H K_i}$

Represents the proportion of layer $l$ 's knowledge density relative to all layers.

Block-level Knowledge Density: Dividing H layers into M blocks with N=⌊H/M⌋ layers each, the cumulative knowledge density of block m is: $K_{block,m} = \sum_{i=(m-1)N+1}^{mN} K_i^{norm}$

Block Selection Strategy:

Top-K Selection: Select K blocks with highest knowledge density
Ignore Head Layers: Exclude blocks containing the last two layers to avoid interference from output generation layers

Step 3: Iterative Block Re-insertion Unlearning

This is the core innovation of KUnBR:

Extract selected high-density knowledge blocks from $LLM_{unlearning}$ (the pre-unlearned model)
Re-insert these blocks into corresponding positions in $LLM_{original}$ (the original unlearned model)
Freeze other layers, applying gradient difference methods only to inserted blocks
Since other layers in $LLM_{original}$ remain unchanged and frozen, no cover layer interference is produced
After training, place updated blocks back into $LLM_{unlearning}$
Repeat this process for all selected blocks

Technical Innovations

1. Identification of Cover Layer Problem

The paper explicitly identifies for the first time the fundamental problem with existing methods: they only modify a few layers (cover layers) to suppress harmful outputs rather than truly eliminating knowledge. This explains why RTT attacks can easily recover "forgotten" knowledge.

2. Rationality of Knowledge Density Estimation

Based on research findings that MLPs serve as neural memory units
Gradient absolute values intuitively reflect the amount of target knowledge contained in a layer
Provides quantitative metrics to precisely locate layers requiring focused unlearning

3. Innovation of Re-insertion Strategy

Bypasses Cover Layers: By inserting blocks to be unlearned into the original model, avoids gradient blockage from cover layers
Deep Unlearning: Enables deeper modification of residual knowledge rather than mere surface suppression
Iterative Processing: Performs deep unlearning independently for each high-density block, ensuring thoroughness

4. Essential Differences from Baselines

GA/GD: Global optimization, easily forms cover layers
RMU: Adjusts intermediate layer representations, but still surface modification
KUnBR: Localization + isolation + deep unlearning, fundamentally alters knowledge structure

Experimental Setup

Datasets

Random Birthdays: Randomly generated names and birth years, suitable for unlearning task testing
WMDP-Deduped: 3,668 multiple-choice questions about harmful knowledge, evaluating LLM's ability to handle sensitive information
Years: Records major 20th-century events and their corresponding years
MMLU: Comprehensive multi-task benchmark containing 3,944 multiple-choice questions across 57 tasks, testing world knowledge and problem-solving ability

Data Partitioning:

$D_{forget}$ / $D_{retain}$ divided according to standard ratios
$D_{forget}$ further divided into set T (for RTT attacks) and set V (for recovery evaluation)

Evaluation Metrics

Unlearning Performance Metrics:

Forget Accuracy ( $A_{Unlearn}$ ): Model accuracy on unlearning set after unlearning $A_{Unlearn} = \frac{1}{N}\sum_{i=1}^N \mathbb{I}(f_{unlearn}(x_i) = y_i)$
RTT Accuracy ( $A_{RTT}$ ): Accuracy after RTT attack
Recovery Rate ( $A_{Recover}$ ): Recovery rate $A_{Recover} = A_{RTT} - A_{Unlearn}$
Lower values indicate more thorough unlearning

General Capability Metrics (RKWU Benchmark):

Reasoning (Rea.): Evaluated on Big-Bench-Hard using 3-shot CoT
Truthfulness (Tru.): Evaluated on TruthfulQA MC1 task, 6-shot accuracy
Factuality (Fac.): Evaluated on TriviaQA, 6-shot F1 score
Fluency (Flu.): Using AlpacaEval instructions, reporting weighted average of bi-gram and tri-gram entropy

Comparison Methods

GA (Gradient Ascent): Achieves unlearning by maximizing loss on unlearning set
GD (Gradient Difference): Gradient ascent on unlearning set, gradient descent on retention set
RMU (Representation Misdirection): Strategically modifies internal representations of intermediate layers
RIA (Random Incorrect Answer): Applies gradient descent to incorrect options
NPO (Negative Preference Optimization): Optimizes model to show negative preference for deleted information

Implementation Details

Models: LLaMA3-8B-Instruct and Zephyr-7B-beta

KUnBR Hyperparameters:

Learning rate: 1.5×10⁻⁷
Retention coefficient: 0.1
Warm-up steps: 24
Number of blocks: M=8
Top-K selection: K=6

Hardware: Single NVIDIA A800 GPU

Experimental Results

Main Results

Performance on LLaMA3-8B-Instruct (Table 1):

Dataset	Method	Forget↓	RTT↓	Rec↓
Random Birthdays	NPO	71.3	78.3	7.0
	KUnBR	36.9	43.9	7.0
WMDP-Deduped	GD	30.5	62.4	31.9
	KUnBR	29.2	38.8	9.6
Years	GD	25.9	68.3	42.4
	KUnBR	25.9	36.0	10.1
MMLU	NPO	31.2	38.8	7.6
	KUnBR	16.5	28.0	11.5

Key Findings:

Lowest RTT Accuracy: KUnBR achieves the lowest RTT attack accuracy across all 4 datasets
Minimum Recovery Rate: On LLaMA3, KUnBR consistently maintains the lowest recovery rate
Cross-model Generalization: Performs excellently on Zephyr-7B, demonstrating method universality

General Capability Preservation (Table 2):

KUnBR achieves best or second-best performance on most general capability tests:

Reasoning Ability: Reaches 41.2 on Random Birthdays (best)
Factuality: Reaches 56.4 on Years (best)
Fluency: Reaches 708.8 on MMLU (best)

In contrast, RIA and NPO, while showing good unlearning effects on some datasets, severely damage general capabilities (e.g., RIA's reasoning ability on WMDP is only 1.20).

Ablation Studies

Effectiveness of Pre-unlearning and Re-insertion Strategy (Table 3):

Variant	WMDP Forget	WMDP RTT
KUnBR	29.2	38.8
- w/o re-insert	30.5	62.4
- w/o pre-unl	29.9	56.6

Analysis:

Removing the re-insertion strategy causes the method to degrade to original GD, with RTT accuracy skyrocketing from 38.8% to 62.4%
Removing pre-unlearning also increases RTT accuracy to 56.6%
Demonstrates both components are necessary

Block Selection Strategy Analysis (Figure 3):

Compares four strategies:

Head layers: Selecting blocks near output layers - poor performance
Bottom layers: Selecting blocks near input layers - limited effectiveness
Average: Uniformly selecting all blocks - moderate effect but unstable
KUnBR (Knowledge Density-driven): Best performance with consistent decrease in unlearning accuracy

Conclusion: Knowledge density metrics accurately quantify harmful knowledge content in each layer, providing effective selection guidance.

Impact of Different Block Numbers (Table 4):

Testing different (M, K) configurations on Years dataset:

M=4 (too few blocks): Limited effectiveness, difficult to isolate knowledge
M=32 (too many blocks): May ignore inter-layer dependencies
M=8, K=6: Optimal configuration
Most configurations significantly outperform baselines, showing method robustness to hyperparameters

Multi-attack Scenario Evaluation

Constructed 9 adversarial variants:

Prefix injection
Affirmative suffix
Role-playing
Multiple choice
Reverse query
Synonym manipulation
Background prompts
In-context learning
Cross-lingual

Results: Traditional GD method recovers from 18.18% to 21.21% under prefix injection attacks, while KUnBR maintains 18.18%, demonstrating robustness to prompt-level attacks.

Case Study (Table 5)

Question: "When was Julia Brown born?" Correct Answer (to be forgotten): B. 1989

Performance of various methods:

RMU: Outputs meaningless content after unlearning, recovers correct answer after RTT
GA: Outputs confusion after unlearning, recovers correct answer after RTT
GD: Unlearning fails, directly outputs correct answer; continues outputting after RTT
RIA/NPO: Outputs incorrect answer after unlearning, recovers correct answer after RTT
KUnBR: Outputs incorrect answer (C. 1960) with explanation after unlearning, still outputs incorrect answer (D. 1986) after RTT while maintaining complete response format

Conclusion: Only KUnBR successfully achieves thorough unlearning and maintains unlearning state under RTT attacks while preserving good generation capability.

Computational Cost Analysis

Training time on Years dataset (minutes):

GA: 24
GD: 20
RMU: 9
RIA: 8
NPO: 16
KUnBR: 17

KUnBR's time cost is comparable to mainstream methods, 15% faster than current SOTA GD method while achieving better unlearning performance.

Machine Unlearning Methods

Gradient-based Methods:
- Gradient Ascent (Jang et al. 2022): Maximizes loss on unlearning set
- Gradient Difference (Liu et al. 2022): Balances unlearning and retention
Representation Adjustment Methods:
- RMU (Li et al. 2024): Adjusts intermediate layer representations
- NPO (Zhang et al. 2024): Negative preference optimization
Safety Research:
- Jailbreak attacks (Liu et al. 2023; Zhou et al. 2024)
- Backdoor attacks (Liu et al. 2022)
- RTT attacks (Deeb & Roger 2025): Reveals residual knowledge

Knowledge Localization Research

Geva et al. (2021): MLPs as key-value memory
Hong et al. (2024): Critical role of MLP layers in unlearning process

Advantages of This Work

Theoretical Insight: First to explicitly propose the cover layer problem
Method Innovation: Re-insertion strategy bypasses gradient blockage
Comprehensive Evaluation: Includes RTT attacks and multiple adversarial scenarios
Practicality: Achieves thorough unlearning while maintaining general capabilities

Conclusions and Discussion

Main Conclusions

Cover Layers are Root of Shallow Unlearning: Existing methods primarily suppress output by adjusting a few layers rather than eliminating knowledge
Knowledge Density Estimation is Effective: Gradient-based knowledge density metrics accurately locate layers rich in harmful knowledge
Re-insertion Strategy Enables Deep Unlearning: By isolating high-density blocks and training in original model, bypasses cover layer interference
SOTA Performance: KUnBR achieves best balance between unlearning thoroughness and general capability preservation

Limitations

Computational Overhead: While comparable to baselines, iterative re-insertion still requires additional computation (88.9% higher than RMU)
Hyperparameter Sensitivity: Requires selecting appropriate block number M and Top-K values, though paper shows method is relatively robust
Block Granularity Limitations: Paper doesn't deeply discuss why block-level unlearning doesn't lead to finer-grained shallow unlearning
Evaluation Limitations: Primarily evaluated on multiple-choice datasets, effectiveness on open-ended generation tasks not fully validated
Model Scale: Only tested on models below 8B parameters, effectiveness on larger models (70B+) unknown

Future Directions

Adaptive Block Selection: Automatically adjust block granularity and quantity based on different knowledge types
Efficiency Optimization: Explore parallelization or approximation methods to reduce computational overhead
Theoretical Analysis: Provide theoretical guarantees for re-insertion strategy effectiveness
Extended Applications: Test effectiveness on larger-scale models and more diverse tasks
Continual Unlearning: Research incremental unlearning during model continuous learning

In-Depth Evaluation

Strengths

1. Deep Problem Identification

First to explicitly propose "cover layer" concept, revealing fundamental flaws in existing methods
Clearly demonstrates shallow unlearning problems through RTT attacks
Clear problem definition with important practical significance

2. Strong Method Innovation

Knowledge Density Estimation: Simple yet effective metric based on solid theoretical foundation (MLPs as memory units)
Re-insertion Strategy: Clever design bypassing cover layers through "grafting"
Iterative Processing: Independent deep unlearning for each high-density block ensures thoroughness

3. Comprehensive Experimental Design

Multiple datasets (4) and two backbone models
Comprehensive evaluation metrics (unlearning performance + general capabilities)
Sufficient ablation studies validating component contributions
Multi-attack scenario evaluation (9 adversarial variants)
Case studies provide intuitive understanding

4. Strong Result Convincingness

Achieves lowest RTT accuracy across all datasets
Significantly outperforms SOTA methods (e.g., GD's RTT reduced from 68.3% to 36.0%)
Simultaneously maintains or improves general capabilities
Good cross-model generalization

5. High Practical Value

Code open-sourced for strong reproducibility
Acceptable computational cost
Relatively robust to hyperparameters
Directly applicable to practical LLM deployment scenarios

Weaknesses

1. Insufficient Theoretical Analysis

Lacks theoretical proof of re-insertion strategy effectiveness
Why doesn't block-level unlearning lead to finer-grained shallow unlearning? Paper only briefly mentions "blocks as constituent memory units"
Theoretical properties of knowledge density estimation (convergence, uniqueness) not discussed

2. Method Complexity

Requires multiple iterations (for each selected block)
Involves multiple hyperparameters (M, K, α, learning rate, etc.)
Higher implementation complexity compared to simple GA/GD

3. Evaluation Limitations

Dataset Bias: Primarily multiple-choice questions, lacking open-ended generation tasks
Model Scale: Only 8B and below, modern LLMs commonly reach 70B+
Unlearning Types: Mainly factual knowledge, effectiveness on conceptual and reasoning knowledge unknown
Long-term Effects: Cumulative impact after multiple unlearning rounds not evaluated

4. Heuristic Nature of Block Selection

"Ignore head layers" based on empirical observation, lacking principled explanation
Is Top-K selection optimal? Are better selection strategies possible?
Different knowledge types may require different selection strategies

5. Cover Layer Relationship Not Fully Resolved

Will training after re-insertion form new cover layers at new positions?
Paper insufficiently discusses this potential issue
How is convergence of iterative process guaranteed?

6. Limitations of General Capability Evaluation

RKWU benchmark, while comprehensive, remains limited
Some tasks (code generation, mathematical reasoning) not covered
Impact of unlearning on model internal representation structure not evaluated

Impact

1. Contribution to Field

Pioneering: First to systematically address cover layer problem, providing new direction for unlearning research
Methodology: Knowledge density estimation and re-insertion strategy can inspire other research
Benchmark Setting: Establishes new performance standards in RTT attack scenarios

2. Practical Value

Immediate Application: Directly applicable to LLM privacy protection and safe deployment
Regulatory Compliance: Helps satisfy requirements like GDPR
Risk Mitigation: Reduces risk of LLMs leaking sensitive information

3. Reproducibility

Code open-sourced
Detailed implementation details and hyperparameter settings
Standardized evaluation protocols

4. Potential Impact

Short-term: Expected to become important baseline in unlearning research
Mid-term: Likely to promote more research on deep unlearning mechanisms
Long-term: Contributes to development of trustworthy AI and responsible AI

Applicable Scenarios

1. Highly Applicable

Privacy-sensitive Applications: Scenarios requiring user data deletion (medical, financial)
Regulatory Compliance: Systems needing to satisfy "right to be forgotten"
Safety-critical Applications: Scenarios requiring removal of harmful knowledge

2. Moderately Applicable

Continual Learning Systems: LLMs requiring periodic knowledge updates
Copyright Protection: Models needing to remove copyrighted content

3. Potentially Inapplicable

Extremely Resource-constrained: Scenarios with very limited computational resources
Real-time Systems: Online services requiring extremely fast response
Ultra-large-scale Models: 100B+ parameter models may require additional optimization

4. Scenarios Requiring Improvement

Open-ended Generation: Requires more evaluation and possible method adjustments
Multimodal Models: Needs extension to vision-language models
Cross-lingual Unlearning: Needs consideration of multilingual knowledge associations

Key References

Deeb & Roger (2025): RTT attack method revealing shallow unlearning problems
Li et al. (2024): WMDP benchmark and RMU method
Geva et al. (2021): Theoretical foundation of MLPs as key-value memory
Hong et al. (2024): Empirical research on layer modification in unlearning
Zhang et al. (2024): NPO method, current SOTA
Liu, Liu, & Stone (2022): Foundational work on gradient difference methods

Overall Evaluation

This is a high-quality research paper making substantial progress on the important problem of machine unlearning. The paper's main strengths are: (1) deep identification of fundamental flaws in existing methods (cover layer problem), (2) innovative and effective solution (knowledge density estimation + re-insertion strategy), (3) comprehensive experimental validation of method effectiveness.

Novelty: ★★★★☆ (4.5/5) - Re-insertion strategy is truly innovative, knowledge density estimation simple but effective

Technical Depth: ★★★★☆ (4/5) - Clever method design, but theoretical analysis could be deeper

Experimental Sufficiency: ★★★★★ (5/5) - Comprehensive experimental design, diverse evaluation metrics, thorough ablation studies

Practical Value: ★★★★★ (5/5) - Directly solves practical problems, code open-sourced, immediately applicable

Writing Quality: ★★★★☆ (4.5/5) - Clear and understandable, rigorous logic, effective visualizations

Overall Score: ★★★★☆ (4.4/5)

Recommended Reading: Strongly recommended for scholars and engineers working on LLM safety, privacy protection, and machine unlearning research. This paper not only provides effective technical solutions but more importantly offers deep insights into unlearning mechanisms.