2025-11-22T06:10:16.346479

Teaching Models to Understand (but not Generate) High-risk Data

Wang, Finlayson, Soldaini et al.

Language model developers typically filter out high-risk content -- such as toxic or copyrighted text -- from their pre-training data to prevent models from generating similar outputs. However, removing such data altogether limits models' ability to recognize and appropriately respond to harmful or sensitive content. In this paper, we introduce Selective Loss to Understand but Not Generate (SLUNG), a pre-training paradigm through which models learn to understand high-risk data without learning to generate it. Instead of uniformly applying the next-token prediction loss, SLUNG selectively avoids incentivizing the generation of high-risk tokens while ensuring they remain within the model's context window. As the model learns to predict low-risk tokens that follow high-risk ones, it is forced to understand the high-risk content. Through our experiments, we show that SLUNG consistently improves models' understanding of high-risk data (e.g., ability to recognize toxic content) without increasing its generation (e.g., toxicity of model responses). Overall, our SLUNG paradigm enables models to benefit from high-risk text that would otherwise be filtered out.

academic

Teaching Models to Understand (but not Generate) High-risk Data

Basic Information

Paper ID: 2505.03052
Title: Teaching Models to Understand (but not Generate) High-risk Data
Authors: Ryan Wang, Matthew Finlayson, Luca Soldaini, Swabha Swayamdipta, Robin Jia
Classification: cs.CL cs.LG
Conference: COLM 2025
Paper Link: https://arxiv.org/abs/2505.03052

Abstract

Research Background and Motivation

Problem Background

Current language model development faces a fundamental contradiction: to prevent models from generating harmful content (such as toxic text, copyrighted material, etc.), developers typically filter out such high-risk content from pre-training data. However, while this approach improves model safety, it limits the model's ability to recognize and appropriately respond to harmful or sensitive content.

Core Issues

Side effects of data filtering: Complete removal of high-risk data reduces the model's understanding ability for such content
Coupling of understanding and generation: The traditional next-token prediction objective inherently couples the model's understanding and generation capabilities
Real-world deployment requirements: In practical applications, models need to identify and handle harmful requests, which requires some understanding of harmful content

Research Motivation

The authors propose to achieve "the best of both worlds": training models that can understand high-risk data while avoiding generating such content. This requires moving beyond standard next-token prediction objectives to decouple the model's understanding and generation capabilities.

Core Contributions

Proposes SLUNG framework: A novel pre-training paradigm that decouples understanding from generation through selective loss functions
Technical innovation: Designs differentiated training strategies based on token risk levels, including two implementations: Masked SLUNG and Unlikelihood SLUNG
Experimental validation: Validates the method's effectiveness in two scenarios: toxic content understanding and fictitious entity learning
Theoretical contribution: Provides new frameworks and insights for developing safe yet capable language models

Methodology Details

Task Definition

Given a pre-training document $X = (x_1, x_2, ..., x_{|X|})$ , each token has a corresponding binary label $(l_1, l_2, ..., l_{|X|})$ , where $l_i \in \{0,1\}$ indicates whether the i-th token is a high-risk token ( $l_i = 1$ ) or low-risk token ( $l_i = 0$ ).

The objective is to train a model such that it assigns high perplexity to high-risk spans while maintaining low perplexity for low-risk spans that may be conditioned on high-risk content.

Model Architecture

SLUNG Core Concept

The key innovation of SLUNG is applying different loss functions to tokens of different risk levels:

$L(\theta, X) = -\sum_{i=1}^{|X|} \left[ \mathbf{1}[l_i=1] f_\theta(x_i | x_{<i}) + \mathbf{1}[l_i=0] \log p_\theta(x_i | x_{<i}) \right]$

Where:

High-risk tokens ( $l_i = 1$ ) use a custom loss function $f_\theta(x_i | x_{<i})$
Low-risk tokens ( $l_i = 0$ ) use the standard maximum likelihood objective
All tokens remain in the model's context window

Two Specific Implementations

1. Masked SLUNG Sets $f_\theta(x_i | x_{<i}) = 0$ for high-risk tokens, masking their generation loss while keeping tokens visible to the attention mechanism.

2. Unlikelihood SLUNG Applies $f_\theta(x_i | x_{<i}) = \log(1 - p_\theta(x_i | x_{<i}))$ to high-risk tokens, explicitly penalizing the model for assigning high probability to high-risk tokens.

Technical Innovations

Decoupling design: First to achieve decoupling of understanding and generation capabilities at the pre-training stage
Context preservation: High-risk tokens, while not participating in loss computation or being penalized, remain in context, ensuring the model learns their representations
Indirect learning mechanism: By learning to predict low-risk tokens following high-risk content, the model is forced to understand the high-risk content
Flexible framework: Can be combined with any risk detection classifier

Experimental Setup

Experiment 1: Toxic Content Understanding

Dataset

Base model: OLMo 1B (continued pre-training from checkpoint 737)
Training data: Last 40 billion tokens of the original Dolma dataset + injected toxic Reddit documents (approximately 212 million tokens, 5%)
Toxicity classification: Uses FastText toxicity classifier, categorizing content into Not Toxic, Possibly Toxic, and Definitely Toxic

Comparison Methods

Control (OLMo 1B): Original model unexposed to toxic data
Low-risk Baseline: Trained only on non-toxic Reddit content
Toxic Baseline: Standard maximum likelihood training on all data including toxic content
Masked SLUNG: Masks loss for Definitely Toxic and Possibly Toxic tokens
Unlikelihood SLUNG: Applies unlikelihood loss to Definitely Toxic tokens

Experiment 2: Fictitious Entity Learning

Dataset

TOFU dataset: Question-answer pairs with synthetic author profiles
Training setup: Fine-tuning only on answer columns, with entity names marked as high-risk tokens
Objective: Learn entity-related facts while avoiding generating entity names

Evaluation Metrics

Toxicity Experiment

Generation evaluation: Uses RealToxicityPrompts to assess the model's tendency to generate toxic content, scored via Perspective API
Understanding evaluation: Trains linear probes on CivilComments dataset to evaluate toxicity classification ability of model hidden states (AUROC)

Entity Learning Experiment

Generation evaluation: Measures the proportion of entity names in model outputs
Understanding evaluation: Uses GPT-4o to assess the correctness of model answers to factual questions

Experimental Results

Main Findings

Toxicity Experiment Core Results

Pareto optimality: SLUNG methods achieve Pareto frontier in understanding-generation tradeoff, improving toxicity understanding while reducing toxicity generation
Understanding improvement: Masked SLUNG and Unlikelihood SLUNG achieve AUROC of approximately 0.825 and 0.820 on CivilComments, significantly outperforming the Control baseline's 0.810
Generation safety: Both SLUNG methods maintain toxicity generation scores around 0.165, far below the Toxic Baseline's 0.175
Post-instruction-tuning persistence: SLUNG methods maintain Pareto optimality even after instruction tuning

Data Scale Effects

As toxic data increases from 20M to 320M tokens:

Masked SLUNG consistently maintains the best understanding-generation tradeoff
Understanding ability improves linearly with data volume, but toxic generation grows slowly
Demonstrates SLUNG's good scalability

Entity Learning Experiment Results

Method	Name Generation Rate↓	Fully Correct↑	Partially Correct↑
OLMo 1B	57.5%	3.5%	15.5%
Direct training	34.3±9.2%	28.2±0.6%	51.4±0.7%
Masked SLUNG	4.1±1.2%	20.8±1.9%	44.0±2.1%
Unlikelihood SLUNG	1.5±0.7%	22.3±2.1%	43.6±3.2%

Ablation Studies

Perplexity Analysis

All methods show no significant differences in perplexity on Dolma documents, indicating SLUNG does not harm general language modeling ability
Masked SLUNG achieves lowest perplexity on non-toxic Reddit documents
Unlikelihood SLUNG shows higher perplexity on Reddit domain, possibly because unlikelihood loss affects generation distribution in that domain

Case Analysis

In TOFU experiments, SLUNG models learned to answer questions using pronouns ("he", "she") or omitting subjects, successfully avoiding generating entity names while preserving factual information.

Data Filtering and High-risk Content Handling

Existing work primarily addresses high-risk content through filtering
Grattafiori et al. (2024), Soldaini et al. (2024) and others employ document-level or span-level filtering
While these methods improve safety, they sacrifice data diversity

Training Methods to Prevent Undesirable Generation

Unlikelihood training: Penalizes high probability for undesirable sequences
Contrastive learning: Promotes preferred candidates through contrast
RLHF: Suppresses harmful generation through human feedback
These methods primarily focus on suppressing generation without explicitly evaluating understanding ability

Decoding-time Methods

Classifier-guided decoding: Uses auxiliary classifiers to adjust generation probabilities
Control token methods: Conditions generation through special tokens
DExperts: Uses "good" and "bad" expert models to guide generation

Conclusions and Discussion

Main Conclusions

SLUNG successfully decouples language model understanding from generation capabilities, providing a new paradigm for safe AI development
The method performs excellently in two different scenarios (toxic content and entity learning), demonstrating its generality
SLUNG enables models to benefit from high-risk text that would otherwise be filtered, improving data utilization efficiency

Limitations

Computational budget constraints: Experiments use continued pre-training rather than training from scratch, potentially underestimating the method's full potential
Classifier dependency: Method effectiveness depends on the quality of risk detection classifiers
Evaluation scope: Validation primarily on 1B parameter models; effectiveness on larger models remains to be verified
Domain specificity: Unlikelihood SLUNG may affect generation ability in specific domains

Future Directions

Large-scale pre-training: Evaluate SLUNG effectiveness in complete pre-training settings
Adversarial robustness research: Explore SLUNG's resistance to jailbreak attacks
Classifier improvement: Develop more accurate risk detection systems
Theoretical analysis: Deepen understanding of the theoretical foundations of the decoupling mechanism

In-depth Evaluation

Strengths

Strong novelty: First to achieve understanding-generation decoupling at pre-training stage with novel approach
High practical value: Addresses important problems in AI safety with broad application prospects
Comprehensive experiments: Validation in two different scenarios with multiple comparison methods and ablation studies
Simple methodology: Relatively straightforward implementation, easy to reproduce and apply
Clear theory: Well-articulated decoupling mechanism with rigorous mathematical formulation

Weaknesses

Scale limitations: Experiments primarily on small-scale models; effectiveness on large models unknown
Evaluation limitations: Toxicity detection depends on specific classifiers, potentially introducing bias
Long-term effects: Lacks evaluation of method's impact on model long-term behavior
Computational overhead: Requires additional risk annotation, increasing preprocessing costs

Impact

Academic contribution: Provides new insights for AI safety research, potentially inspiring follow-up work
Practical value: Offers direct guidance for industrial language model development
Reproducibility: Authors commit to open-sourcing code, facilitating community verification and extension

Applicable Scenarios

Content moderation systems: Applications requiring identification but not generation of harmful content
Copyright protection: Scenarios involving learning copyrighted content while avoiding direct reproduction
Sensitive information handling: Systems that understand but do not leak private information
Educational applications: Scenarios requiring understanding of inappropriate content for educational purposes without propagation

References

The paper cites multiple important works, including:

Longpre et al. (2023): Research on pre-training data's impact on model capabilities
Welleck et al. (2019): Original work on unlikelihood training
Soldaini et al. (2024): Construction and filtering methods for Dolma dataset
Gehman et al. (2020): RealToxicityPrompts evaluation benchmark

This paper makes important methodological contributions to language model safety training, achieving decoupling of understanding and generation through clever loss function design, laying the foundation for future safe AI research.