2025-11-22T06:10:16.346479

Teaching Models to Understand (but not Generate) High-risk Data

Wang, Finlayson, Soldaini et al.
Language model developers typically filter out high-risk content -- such as toxic or copyrighted text -- from their pre-training data to prevent models from generating similar outputs. However, removing such data altogether limits models' ability to recognize and appropriately respond to harmful or sensitive content. In this paper, we introduce Selective Loss to Understand but Not Generate (SLUNG), a pre-training paradigm through which models learn to understand high-risk data without learning to generate it. Instead of uniformly applying the next-token prediction loss, SLUNG selectively avoids incentivizing the generation of high-risk tokens while ensuring they remain within the model's context window. As the model learns to predict low-risk tokens that follow high-risk ones, it is forced to understand the high-risk content. Through our experiments, we show that SLUNG consistently improves models' understanding of high-risk data (e.g., ability to recognize toxic content) without increasing its generation (e.g., toxicity of model responses). Overall, our SLUNG paradigm enables models to benefit from high-risk text that would otherwise be filtered out.
academic

Teaching Models to Understand (but not Generate) High-risk Data

Basic Information

  • Paper ID: 2505.03052
  • Title: Teaching Models to Understand (but not Generate) High-risk Data
  • Authors: Ryan Wang, Matthew Finlayson, Luca Soldaini, Swabha Swayamdipta, Robin Jia
  • Classification: cs.CL cs.LG
  • Conference: COLM 2025
  • Paper Link: https://arxiv.org/abs/2505.03052

Abstract

Language model developers typically filter out high-risk content -- such as toxic or copyrighted text -- from their pre-training data to prevent models from generating similar outputs. However, removing such data altogether limits models' ability to recognize and appropriately respond to harmful or sensitive content. In this paper, we introduce Selective Loss to Understand but Not Generate (SLUNG), a pre-training paradigm through which models learn to understand high-risk data without learning to generate it. Instead of uniformly applying the next-token prediction loss, SLUNG selectively avoids incentivizing the generation of high-risk tokens while ensuring they remain within the model's context window. As the model learns to predict low-risk tokens that follow high-risk ones, it is forced to understand the high-risk content. Through our experiments, we show that SLUNG consistently improves models' understanding of high-risk data (e.g., ability to recognize toxic content) without increasing its generation (e.g., toxicity of model responses). Overall, our SLUNG paradigm enables models to benefit from high-risk text that would otherwise be filtered out.

Research Background and Motivation

Problem Background

Current language model development faces a fundamental contradiction: to prevent models from generating harmful content (such as toxic text, copyrighted material, etc.), developers typically filter out such high-risk content from pre-training data. However, while this approach improves model safety, it limits the model's ability to recognize and appropriately respond to harmful or sensitive content.

Core Issues

  1. Side effects of data filtering: Complete removal of high-risk data reduces the model's understanding ability for such content
  2. Coupling of understanding and generation: The traditional next-token prediction objective inherently couples the model's understanding and generation capabilities
  3. Real-world deployment requirements: In practical applications, models need to identify and handle harmful requests, which requires some understanding of harmful content

Research Motivation

The authors propose to achieve "the best of both worlds": training models that can understand high-risk data while avoiding generating such content. This requires moving beyond standard next-token prediction objectives to decouple the model's understanding and generation capabilities.

Core Contributions

  1. Proposes SLUNG framework: A novel pre-training paradigm that decouples understanding from generation through selective loss functions
  2. Technical innovation: Designs differentiated training strategies based on token risk levels, including two implementations: Masked SLUNG and Unlikelihood SLUNG
  3. Experimental validation: Validates the method's effectiveness in two scenarios: toxic content understanding and fictitious entity learning
  4. Theoretical contribution: Provides new frameworks and insights for developing safe yet capable language models

Methodology Details

Task Definition

Given a pre-training document X=(x1,x2,...,xX)X = (x_1, x_2, ..., x_{|X|}), each token has a corresponding binary label (l1,l2,...,lX)(l_1, l_2, ..., l_{|X|}), where li{0,1}l_i \in \{0,1\} indicates whether the i-th token is a high-risk token (li=1l_i = 1) or low-risk token (li=0l_i = 0).

The objective is to train a model such that it assigns high perplexity to high-risk spans while maintaining low perplexity for low-risk spans that may be conditioned on high-risk content.

Model Architecture

SLUNG Core Concept

The key innovation of SLUNG is applying different loss functions to tokens of different risk levels:

L(θ,X)=i=1X[1[li=1]fθ(xix<i)+1[li=0]logpθ(xix<i)]L(\theta, X) = -\sum_{i=1}^{|X|} \left[ \mathbf{1}[l_i=1] f_\theta(x_i | x_{<i}) + \mathbf{1}[l_i=0] \log p_\theta(x_i | x_{<i}) \right]

Where:

  • High-risk tokens (li=1l_i = 1) use a custom loss function fθ(xix<i)f_\theta(x_i | x_{<i})
  • Low-risk tokens (li=0l_i = 0) use the standard maximum likelihood objective
  • All tokens remain in the model's context window

Two Specific Implementations

1. Masked SLUNG Sets fθ(xix<i)=0f_\theta(x_i | x_{<i}) = 0 for high-risk tokens, masking their generation loss while keeping tokens visible to the attention mechanism.

2. Unlikelihood SLUNG Applies fθ(xix<i)=log(1pθ(xix<i))f_\theta(x_i | x_{<i}) = \log(1 - p_\theta(x_i | x_{<i})) to high-risk tokens, explicitly penalizing the model for assigning high probability to high-risk tokens.

Technical Innovations

  1. Decoupling design: First to achieve decoupling of understanding and generation capabilities at the pre-training stage
  2. Context preservation: High-risk tokens, while not participating in loss computation or being penalized, remain in context, ensuring the model learns their representations
  3. Indirect learning mechanism: By learning to predict low-risk tokens following high-risk content, the model is forced to understand the high-risk content
  4. Flexible framework: Can be combined with any risk detection classifier

Experimental Setup

Experiment 1: Toxic Content Understanding

Dataset

  • Base model: OLMo 1B (continued pre-training from checkpoint 737)
  • Training data: Last 40 billion tokens of the original Dolma dataset + injected toxic Reddit documents (approximately 212 million tokens, 5%)
  • Toxicity classification: Uses FastText toxicity classifier, categorizing content into Not Toxic, Possibly Toxic, and Definitely Toxic

Comparison Methods

  • Control (OLMo 1B): Original model unexposed to toxic data
  • Low-risk Baseline: Trained only on non-toxic Reddit content
  • Toxic Baseline: Standard maximum likelihood training on all data including toxic content
  • Masked SLUNG: Masks loss for Definitely Toxic and Possibly Toxic tokens
  • Unlikelihood SLUNG: Applies unlikelihood loss to Definitely Toxic tokens

Experiment 2: Fictitious Entity Learning

Dataset

  • TOFU dataset: Question-answer pairs with synthetic author profiles
  • Training setup: Fine-tuning only on answer columns, with entity names marked as high-risk tokens
  • Objective: Learn entity-related facts while avoiding generating entity names

Evaluation Metrics

Toxicity Experiment

  • Generation evaluation: Uses RealToxicityPrompts to assess the model's tendency to generate toxic content, scored via Perspective API
  • Understanding evaluation: Trains linear probes on CivilComments dataset to evaluate toxicity classification ability of model hidden states (AUROC)

Entity Learning Experiment

  • Generation evaluation: Measures the proportion of entity names in model outputs
  • Understanding evaluation: Uses GPT-4o to assess the correctness of model answers to factual questions

Experimental Results

Main Findings

Toxicity Experiment Core Results

  1. Pareto optimality: SLUNG methods achieve Pareto frontier in understanding-generation tradeoff, improving toxicity understanding while reducing toxicity generation
  2. Understanding improvement: Masked SLUNG and Unlikelihood SLUNG achieve AUROC of approximately 0.825 and 0.820 on CivilComments, significantly outperforming the Control baseline's 0.810
  3. Generation safety: Both SLUNG methods maintain toxicity generation scores around 0.165, far below the Toxic Baseline's 0.175
  4. Post-instruction-tuning persistence: SLUNG methods maintain Pareto optimality even after instruction tuning

Data Scale Effects

As toxic data increases from 20M to 320M tokens:

  • Masked SLUNG consistently maintains the best understanding-generation tradeoff
  • Understanding ability improves linearly with data volume, but toxic generation grows slowly
  • Demonstrates SLUNG's good scalability

Entity Learning Experiment Results

MethodName Generation Rate↓Fully Correct↑Partially Correct↑
OLMo 1B57.5%3.5%15.5%
Direct training34.3±9.2%28.2±0.6%51.4±0.7%
Masked SLUNG4.1±1.2%20.8±1.9%44.0±2.1%
Unlikelihood SLUNG1.5±0.7%22.3±2.1%43.6±3.2%

Ablation Studies

Perplexity Analysis

  • All methods show no significant differences in perplexity on Dolma documents, indicating SLUNG does not harm general language modeling ability
  • Masked SLUNG achieves lowest perplexity on non-toxic Reddit documents
  • Unlikelihood SLUNG shows higher perplexity on Reddit domain, possibly because unlikelihood loss affects generation distribution in that domain

Case Analysis

In TOFU experiments, SLUNG models learned to answer questions using pronouns ("he", "she") or omitting subjects, successfully avoiding generating entity names while preserving factual information.

Data Filtering and High-risk Content Handling

  • Existing work primarily addresses high-risk content through filtering
  • Grattafiori et al. (2024), Soldaini et al. (2024) and others employ document-level or span-level filtering
  • While these methods improve safety, they sacrifice data diversity

Training Methods to Prevent Undesirable Generation

  • Unlikelihood training: Penalizes high probability for undesirable sequences
  • Contrastive learning: Promotes preferred candidates through contrast
  • RLHF: Suppresses harmful generation through human feedback
  • These methods primarily focus on suppressing generation without explicitly evaluating understanding ability

Decoding-time Methods

  • Classifier-guided decoding: Uses auxiliary classifiers to adjust generation probabilities
  • Control token methods: Conditions generation through special tokens
  • DExperts: Uses "good" and "bad" expert models to guide generation

Conclusions and Discussion

Main Conclusions

  1. SLUNG successfully decouples language model understanding from generation capabilities, providing a new paradigm for safe AI development
  2. The method performs excellently in two different scenarios (toxic content and entity learning), demonstrating its generality
  3. SLUNG enables models to benefit from high-risk text that would otherwise be filtered, improving data utilization efficiency

Limitations

  1. Computational budget constraints: Experiments use continued pre-training rather than training from scratch, potentially underestimating the method's full potential
  2. Classifier dependency: Method effectiveness depends on the quality of risk detection classifiers
  3. Evaluation scope: Validation primarily on 1B parameter models; effectiveness on larger models remains to be verified
  4. Domain specificity: Unlikelihood SLUNG may affect generation ability in specific domains

Future Directions

  1. Large-scale pre-training: Evaluate SLUNG effectiveness in complete pre-training settings
  2. Adversarial robustness research: Explore SLUNG's resistance to jailbreak attacks
  3. Classifier improvement: Develop more accurate risk detection systems
  4. Theoretical analysis: Deepen understanding of the theoretical foundations of the decoupling mechanism

In-depth Evaluation

Strengths

  1. Strong novelty: First to achieve understanding-generation decoupling at pre-training stage with novel approach
  2. High practical value: Addresses important problems in AI safety with broad application prospects
  3. Comprehensive experiments: Validation in two different scenarios with multiple comparison methods and ablation studies
  4. Simple methodology: Relatively straightforward implementation, easy to reproduce and apply
  5. Clear theory: Well-articulated decoupling mechanism with rigorous mathematical formulation

Weaknesses

  1. Scale limitations: Experiments primarily on small-scale models; effectiveness on large models unknown
  2. Evaluation limitations: Toxicity detection depends on specific classifiers, potentially introducing bias
  3. Long-term effects: Lacks evaluation of method's impact on model long-term behavior
  4. Computational overhead: Requires additional risk annotation, increasing preprocessing costs

Impact

  1. Academic contribution: Provides new insights for AI safety research, potentially inspiring follow-up work
  2. Practical value: Offers direct guidance for industrial language model development
  3. Reproducibility: Authors commit to open-sourcing code, facilitating community verification and extension

Applicable Scenarios

  1. Content moderation systems: Applications requiring identification but not generation of harmful content
  2. Copyright protection: Scenarios involving learning copyrighted content while avoiding direct reproduction
  3. Sensitive information handling: Systems that understand but do not leak private information
  4. Educational applications: Scenarios requiring understanding of inappropriate content for educational purposes without propagation

References

The paper cites multiple important works, including:

  • Longpre et al. (2023): Research on pre-training data's impact on model capabilities
  • Welleck et al. (2019): Original work on unlikelihood training
  • Soldaini et al. (2024): Construction and filtering methods for Dolma dataset
  • Gehman et al. (2020): RealToxicityPrompts evaluation benchmark

This paper makes important methodological contributions to language model safety training, achieving decoupling of understanding and generation through clever loss function design, laying the foundation for future safe AI research.