Machine unlearning, which selectively removes harmful knowledge from a pre-trained model without retraining from scratch, is crucial for addressing privacy, regulatory compliance, and ethical concerns in Large Language Models (LLMs). However, existing unlearning methods often struggle to thoroughly remove harmful knowledge, leaving residual harmful knowledge that can be easily recovered. To address these limitations, we propose Knowledge Density-Guided Unlearning via Blocks Reinsertion (KUnBR), a novel approach that first identifies layers with rich harmful knowledge and then thoroughly eliminates the harmful knowledge via re-insertion strategy. Our method introduces knowledge density estimation to quantify and locate layers containing the most harmful knowledge, enabling precise unlearning. Additionally, we design a layer re-insertion strategy that extracts and re-inserts harmful knowledge-rich layers into the original LLM, bypassing gradient obstruction caused by cover layers and ensuring effective gradient propagation during unlearning. Extensive experiments conducted on several unlearning and general capability benchmarks demonstrate that KUnBR achieves state-of-the-art forgetting performance while maintaining model utility.
Beyond Superficial Forgetting: Thorough Unlearning through Knowledge Density Estimation and Block Re-insertion
- Paper ID: 2511.11667
- Title: Beyond Superficial Forgetting: Thorough Unlearning through Knowledge Density Estimation and Block Re-insertion
- Authors: Feng Guo, Yuntao Wen, Shen Gao, Junshuo Zhang, Shuo Shang (University of Electronic Science and Technology of China)
- Classification: cs.LG, cs.AI
- Publication Time/Conference: AAAI 2026 (Expected)
- Paper Link: https://arxiv.org/abs/2511.11667
- Code Link: github.com/llmgfffffff/Beyond-Superficial-Forgetting-KUnBR
This paper addresses the machine unlearning problem in Large Language Models (LLMs) and proposes a novel method called KUnBR (Knowledge Density-Guided Unlearning via Blocks Reinsertion). Existing unlearning methods often fail to thoroughly remove harmful knowledge, leaving residual knowledge that can be easily recovered. KUnBR identifies layers rich in harmful knowledge through knowledge density estimation, then employs a block re-insertion strategy to completely eliminate harmful knowledge. The method bypasses gradient blockage caused by "cover layers," ensuring effective gradient propagation. Experiments on multiple benchmarks demonstrate that KUnBR achieves state-of-the-art unlearning performance while maintaining the model's general capabilities.
Machine unlearning aims to selectively remove specific subsets of knowledge from pre-trained models (such as privacy-sensitive or harmful content) without retraining from scratch. This is crucial for LLM development as it involves data privacy, regulatory compliance (such as the "right to be forgotten"), and ethical considerations for AI systems.
- Privacy Protection: LLMs may ingest large amounts of privacy-sensitive data during pre-training
- Regulatory Compliance: Regulations like GDPR require the ability to delete specific user data
- Safety: Preventing malicious exploitation of harmful knowledge in models
- Ethical Alignment: Ensuring LLMs remain consistent with societal values
Existing unlearning methods (such as gradient ascent, representation misdirection) have serious flaws:
- Superficial Forgetting: Only modifying a few parameters (cover layers) to suppress output rather than truly eliminating knowledge
- Easy Recovery: RTT (Retraining on T) attacks show that most "forgotten" knowledge can be recovered through minimal retraining on subsets of the unlearning set
- Residual Knowledge: Harmful knowledge remains in model parameters, merely masked rather than eliminated
- Poor Robustness: Vulnerable to jailbreak attacks and parameter-level attacks
The authors discovered that existing methods primarily rely on adjusting "cover layers" to mask representations of harmful knowledge, merely preventing undesirable model outputs without truly eliminating knowledge from internal representations. This fundamental limitation suggests the need for more robust and thorough unlearning methods.
- Proposes KUnBR Framework: A novel unlearning framework capable of identifying layers containing harmful knowledge and performing targeted training to achieve thorough elimination of harmful knowledge
- Knowledge Density Estimation Method: Introduces a gradient-based knowledge density estimation metric that quantifies and locates layers in LLMs containing the most harmful knowledge, enabling precise unlearning
- Block Re-insertion Strategy: Designs a novel layer re-insertion strategy that extracts blocks rich in harmful knowledge and re-inserts them into the original LLM, bypassing gradient blockage caused by cover layers and ensuring effective gradient propagation during unlearning
- SOTA Performance: Achieves state-of-the-art unlearning performance on multiple unlearning and general capability benchmarks while maintaining model utility, particularly excelling in resistance to RTT attacks
Given:
- Unlearning Dataset Dforget: Contains knowledge to be removed
- Retention Dataset Dretain: Helps the model maintain general capabilities during unlearning
Objective:
- Optimize model parameters to thoroughly eliminate knowledge related to Dforget
- Ensure model utility performance remains unaffected
- When subjected to RTT attacks (fine-tuning on subset T of Dforget), the model still cannot generate knowledge from another disjoint subset V of Dforget
The KUnBR method comprises three main steps:
Uses standard gradient difference methods for full-parameter fine-tuning of the original LLM as a "warm-up" phase:
θt+1=θt−η(α∇θLretain(θt)−∇θLforget(θt))
Where:
- η is the learning rate
- α is the retention coefficient
- Lretain and Lforget are losses on retention and unlearning sets respectively
Knowledge Density Calculation:
For layer l, knowledge density is defined as:
Kl=E(x,y)∼Dforget[∥∇θlL(x,y;θl)∥1]
Where L(x,y;θ)=−log(p(y∣x;θ)) is the negative log-likelihood loss.
Normalized Knowledge Density:
Klnorm=∑i=1HKiKl
Represents the proportion of layer l's knowledge density relative to all layers.
Block-level Knowledge Density:
Dividing H layers into M blocks with N=⌊H/M⌋ layers each, the cumulative knowledge density of block m is:
Kblock,m=∑i=(m−1)N+1mNKinorm
Block Selection Strategy:
- Top-K Selection: Select K blocks with highest knowledge density
- Ignore Head Layers: Exclude blocks containing the last two layers to avoid interference from output generation layers
This is the core innovation of KUnBR:
- Extract selected high-density knowledge blocks from LLMunlearning (the pre-unlearned model)
- Re-insert these blocks into corresponding positions in LLMoriginal (the original unlearned model)
- Freeze other layers, applying gradient difference methods only to inserted blocks
- Since other layers in LLMoriginal remain unchanged and frozen, no cover layer interference is produced
- After training, place updated blocks back into LLMunlearning
- Repeat this process for all selected blocks
The paper explicitly identifies for the first time the fundamental problem with existing methods: they only modify a few layers (cover layers) to suppress harmful outputs rather than truly eliminating knowledge. This explains why RTT attacks can easily recover "forgotten" knowledge.
- Based on research findings that MLPs serve as neural memory units
- Gradient absolute values intuitively reflect the amount of target knowledge contained in a layer
- Provides quantitative metrics to precisely locate layers requiring focused unlearning
- Bypasses Cover Layers: By inserting blocks to be unlearned into the original model, avoids gradient blockage from cover layers
- Deep Unlearning: Enables deeper modification of residual knowledge rather than mere surface suppression
- Iterative Processing: Performs deep unlearning independently for each high-density block, ensuring thoroughness
- GA/GD: Global optimization, easily forms cover layers
- RMU: Adjusts intermediate layer representations, but still surface modification
- KUnBR: Localization + isolation + deep unlearning, fundamentally alters knowledge structure
- Random Birthdays: Randomly generated names and birth years, suitable for unlearning task testing
- WMDP-Deduped: 3,668 multiple-choice questions about harmful knowledge, evaluating LLM's ability to handle sensitive information
- Years: Records major 20th-century events and their corresponding years
- MMLU: Comprehensive multi-task benchmark containing 3,944 multiple-choice questions across 57 tasks, testing world knowledge and problem-solving ability
Data Partitioning:
- Dforget / Dretain divided according to standard ratios
- Dforget further divided into set T (for RTT attacks) and set V (for recovery evaluation)
- Forget Accuracy (AUnlearn): Model accuracy on unlearning set after unlearning
AUnlearn=N1∑i=1NI(funlearn(xi)=yi)
- RTT Accuracy (ARTT): Accuracy after RTT attack
- Recovery Rate (ARecover): Recovery rate
ARecover=ARTT−AUnlearn
Lower values indicate more thorough unlearning
- Reasoning (Rea.): Evaluated on Big-Bench-Hard using 3-shot CoT
- Truthfulness (Tru.): Evaluated on TruthfulQA MC1 task, 6-shot accuracy
- Factuality (Fac.): Evaluated on TriviaQA, 6-shot F1 score
- Fluency (Flu.): Using AlpacaEval instructions, reporting weighted average of bi-gram and tri-gram entropy
- GA (Gradient Ascent): Achieves unlearning by maximizing loss on unlearning set
- GD (Gradient Difference): Gradient ascent on unlearning set, gradient descent on retention set
- RMU (Representation Misdirection): Strategically modifies internal representations of intermediate layers
- RIA (Random Incorrect Answer): Applies gradient descent to incorrect options
- NPO (Negative Preference Optimization): Optimizes model to show negative preference for deleted information
Models: LLaMA3-8B-Instruct and Zephyr-7B-beta
KUnBR Hyperparameters:
- Learning rate: 1.5×10⁻⁷
- Retention coefficient: 0.1
- Warm-up steps: 24
- Number of blocks: M=8
- Top-K selection: K=6
Hardware: Single NVIDIA A800 GPU
| Dataset | Method | Forget↓ | RTT↓ | Rec↓ |
|---|
| Random Birthdays | NPO | 71.3 | 78.3 | 7.0 |
| KUnBR | 36.9 | 43.9 | 7.0 |
| WMDP-Deduped | GD | 30.5 | 62.4 | 31.9 |
| KUnBR | 29.2 | 38.8 | 9.6 |
| Years | GD | 25.9 | 68.3 | 42.4 |
| KUnBR | 25.9 | 36.0 | 10.1 |
| MMLU | NPO | 31.2 | 38.8 | 7.6 |
| KUnBR | 16.5 | 28.0 | 11.5 |
Key Findings:
- Lowest RTT Accuracy: KUnBR achieves the lowest RTT attack accuracy across all 4 datasets
- Minimum Recovery Rate: On LLaMA3, KUnBR consistently maintains the lowest recovery rate
- Cross-model Generalization: Performs excellently on Zephyr-7B, demonstrating method universality
KUnBR achieves best or second-best performance on most general capability tests:
- Reasoning Ability: Reaches 41.2 on Random Birthdays (best)
- Factuality: Reaches 56.4 on Years (best)
- Fluency: Reaches 708.8 on MMLU (best)
In contrast, RIA and NPO, while showing good unlearning effects on some datasets, severely damage general capabilities (e.g., RIA's reasoning ability on WMDP is only 1.20).
| Variant | WMDP Forget | WMDP RTT |
|---|
| KUnBR | 29.2 | 38.8 |
| - w/o re-insert | 30.5 | 62.4 |
| - w/o pre-unl | 29.9 | 56.6 |
Analysis:
- Removing the re-insertion strategy causes the method to degrade to original GD, with RTT accuracy skyrocketing from 38.8% to 62.4%
- Removing pre-unlearning also increases RTT accuracy to 56.6%
- Demonstrates both components are necessary
Compares four strategies:
- Head layers: Selecting blocks near output layers - poor performance
- Bottom layers: Selecting blocks near input layers - limited effectiveness
- Average: Uniformly selecting all blocks - moderate effect but unstable
- KUnBR (Knowledge Density-driven): Best performance with consistent decrease in unlearning accuracy
Conclusion: Knowledge density metrics accurately quantify harmful knowledge content in each layer, providing effective selection guidance.
Testing different (M, K) configurations on Years dataset:
- M=4 (too few blocks): Limited effectiveness, difficult to isolate knowledge
- M=32 (too many blocks): May ignore inter-layer dependencies
- M=8, K=6: Optimal configuration
- Most configurations significantly outperform baselines, showing method robustness to hyperparameters
Constructed 9 adversarial variants:
- Prefix injection
- Affirmative suffix
- Role-playing
- Multiple choice
- Reverse query
- Synonym manipulation
- Background prompts
- In-context learning
- Cross-lingual
Results: Traditional GD method recovers from 18.18% to 21.21% under prefix injection attacks, while KUnBR maintains 18.18%, demonstrating robustness to prompt-level attacks.
Question: "When was Julia Brown born?"
Correct Answer (to be forgotten): B. 1989
Performance of various methods:
- RMU: Outputs meaningless content after unlearning, recovers correct answer after RTT
- GA: Outputs confusion after unlearning, recovers correct answer after RTT
- GD: Unlearning fails, directly outputs correct answer; continues outputting after RTT
- RIA/NPO: Outputs incorrect answer after unlearning, recovers correct answer after RTT
- KUnBR: Outputs incorrect answer (C. 1960) with explanation after unlearning, still outputs incorrect answer (D. 1986) after RTT while maintaining complete response format
Conclusion: Only KUnBR successfully achieves thorough unlearning and maintains unlearning state under RTT attacks while preserving good generation capability.
Training time on Years dataset (minutes):
- GA: 24
- GD: 20
- RMU: 9
- RIA: 8
- NPO: 16
- KUnBR: 17
KUnBR's time cost is comparable to mainstream methods, 15% faster than current SOTA GD method while achieving better unlearning performance.
- Gradient-based Methods:
- Gradient Ascent (Jang et al. 2022): Maximizes loss on unlearning set
- Gradient Difference (Liu et al. 2022): Balances unlearning and retention
- Representation Adjustment Methods:
- RMU (Li et al. 2024): Adjusts intermediate layer representations
- NPO (Zhang et al. 2024): Negative preference optimization
- Safety Research:
- Jailbreak attacks (Liu et al. 2023; Zhou et al. 2024)
- Backdoor attacks (Liu et al. 2022)
- RTT attacks (Deeb & Roger 2025): Reveals residual knowledge
- Geva et al. (2021): MLPs as key-value memory
- Hong et al. (2024): Critical role of MLP layers in unlearning process
- Theoretical Insight: First to explicitly propose the cover layer problem
- Method Innovation: Re-insertion strategy bypasses gradient blockage
- Comprehensive Evaluation: Includes RTT attacks and multiple adversarial scenarios
- Practicality: Achieves thorough unlearning while maintaining general capabilities
- Cover Layers are Root of Shallow Unlearning: Existing methods primarily suppress output by adjusting a few layers rather than eliminating knowledge
- Knowledge Density Estimation is Effective: Gradient-based knowledge density metrics accurately locate layers rich in harmful knowledge
- Re-insertion Strategy Enables Deep Unlearning: By isolating high-density blocks and training in original model, bypasses cover layer interference
- SOTA Performance: KUnBR achieves best balance between unlearning thoroughness and general capability preservation
- Computational Overhead: While comparable to baselines, iterative re-insertion still requires additional computation (88.9% higher than RMU)
- Hyperparameter Sensitivity: Requires selecting appropriate block number M and Top-K values, though paper shows method is relatively robust
- Block Granularity Limitations: Paper doesn't deeply discuss why block-level unlearning doesn't lead to finer-grained shallow unlearning
- Evaluation Limitations: Primarily evaluated on multiple-choice datasets, effectiveness on open-ended generation tasks not fully validated
- Model Scale: Only tested on models below 8B parameters, effectiveness on larger models (70B+) unknown
- Adaptive Block Selection: Automatically adjust block granularity and quantity based on different knowledge types
- Efficiency Optimization: Explore parallelization or approximation methods to reduce computational overhead
- Theoretical Analysis: Provide theoretical guarantees for re-insertion strategy effectiveness
- Extended Applications: Test effectiveness on larger-scale models and more diverse tasks
- Continual Unlearning: Research incremental unlearning during model continuous learning
- First to explicitly propose "cover layer" concept, revealing fundamental flaws in existing methods
- Clearly demonstrates shallow unlearning problems through RTT attacks
- Clear problem definition with important practical significance
- Knowledge Density Estimation: Simple yet effective metric based on solid theoretical foundation (MLPs as memory units)
- Re-insertion Strategy: Clever design bypassing cover layers through "grafting"
- Iterative Processing: Independent deep unlearning for each high-density block ensures thoroughness
- Multiple datasets (4) and two backbone models
- Comprehensive evaluation metrics (unlearning performance + general capabilities)
- Sufficient ablation studies validating component contributions
- Multi-attack scenario evaluation (9 adversarial variants)
- Case studies provide intuitive understanding
- Achieves lowest RTT accuracy across all datasets
- Significantly outperforms SOTA methods (e.g., GD's RTT reduced from 68.3% to 36.0%)
- Simultaneously maintains or improves general capabilities
- Good cross-model generalization
- Code open-sourced for strong reproducibility
- Acceptable computational cost
- Relatively robust to hyperparameters
- Directly applicable to practical LLM deployment scenarios
- Lacks theoretical proof of re-insertion strategy effectiveness
- Why doesn't block-level unlearning lead to finer-grained shallow unlearning? Paper only briefly mentions "blocks as constituent memory units"
- Theoretical properties of knowledge density estimation (convergence, uniqueness) not discussed
- Requires multiple iterations (for each selected block)
- Involves multiple hyperparameters (M, K, α, learning rate, etc.)
- Higher implementation complexity compared to simple GA/GD
- Dataset Bias: Primarily multiple-choice questions, lacking open-ended generation tasks
- Model Scale: Only 8B and below, modern LLMs commonly reach 70B+
- Unlearning Types: Mainly factual knowledge, effectiveness on conceptual and reasoning knowledge unknown
- Long-term Effects: Cumulative impact after multiple unlearning rounds not evaluated
- "Ignore head layers" based on empirical observation, lacking principled explanation
- Is Top-K selection optimal? Are better selection strategies possible?
- Different knowledge types may require different selection strategies
- Will training after re-insertion form new cover layers at new positions?
- Paper insufficiently discusses this potential issue
- How is convergence of iterative process guaranteed?
- RKWU benchmark, while comprehensive, remains limited
- Some tasks (code generation, mathematical reasoning) not covered
- Impact of unlearning on model internal representation structure not evaluated
- Pioneering: First to systematically address cover layer problem, providing new direction for unlearning research
- Methodology: Knowledge density estimation and re-insertion strategy can inspire other research
- Benchmark Setting: Establishes new performance standards in RTT attack scenarios
- Immediate Application: Directly applicable to LLM privacy protection and safe deployment
- Regulatory Compliance: Helps satisfy requirements like GDPR
- Risk Mitigation: Reduces risk of LLMs leaking sensitive information
- Code open-sourced
- Detailed implementation details and hyperparameter settings
- Standardized evaluation protocols
- Short-term: Expected to become important baseline in unlearning research
- Mid-term: Likely to promote more research on deep unlearning mechanisms
- Long-term: Contributes to development of trustworthy AI and responsible AI
- Privacy-sensitive Applications: Scenarios requiring user data deletion (medical, financial)
- Regulatory Compliance: Systems needing to satisfy "right to be forgotten"
- Safety-critical Applications: Scenarios requiring removal of harmful knowledge
- Continual Learning Systems: LLMs requiring periodic knowledge updates
- Copyright Protection: Models needing to remove copyrighted content
- Extremely Resource-constrained: Scenarios with very limited computational resources
- Real-time Systems: Online services requiring extremely fast response
- Ultra-large-scale Models: 100B+ parameter models may require additional optimization
- Open-ended Generation: Requires more evaluation and possible method adjustments
- Multimodal Models: Needs extension to vision-language models
- Cross-lingual Unlearning: Needs consideration of multilingual knowledge associations
- Deeb & Roger (2025): RTT attack method revealing shallow unlearning problems
- Li et al. (2024): WMDP benchmark and RMU method
- Geva et al. (2021): Theoretical foundation of MLPs as key-value memory
- Hong et al. (2024): Empirical research on layer modification in unlearning
- Zhang et al. (2024): NPO method, current SOTA
- Liu, Liu, & Stone (2022): Foundational work on gradient difference methods
This is a high-quality research paper making substantial progress on the important problem of machine unlearning. The paper's main strengths are: (1) deep identification of fundamental flaws in existing methods (cover layer problem), (2) innovative and effective solution (knowledge density estimation + re-insertion strategy), (3) comprehensive experimental validation of method effectiveness.
Novelty: ★★★★☆ (4.5/5) - Re-insertion strategy is truly innovative, knowledge density estimation simple but effective
Technical Depth: ★★★★☆ (4/5) - Clever method design, but theoretical analysis could be deeper
Experimental Sufficiency: ★★★★★ (5/5) - Comprehensive experimental design, diverse evaluation metrics, thorough ablation studies
Practical Value: ★★★★★ (5/5) - Directly solves practical problems, code open-sourced, immediately applicable
Writing Quality: ★★★★☆ (4.5/5) - Clear and understandable, rigorous logic, effective visualizations
Overall Score: ★★★★☆ (4.4/5)
Recommended Reading: Strongly recommended for scholars and engineers working on LLM safety, privacy protection, and machine unlearning research. This paper not only provides effective technical solutions but more importantly offers deep insights into unlearning mechanisms.