2025-11-18T02:28:12.443418

Don't Walk the Line: Boundary Guidance for Filtered Generation

Ball, Haupt
Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier's decision boundary, increasing both false positives and false negatives. We propose Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier's margin. On a benchmark of jailbreak and ambiguous prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach.
academic

Don't Walk the Line: Boundary Guidance for Filtered Generation

Basic Information

  • Paper ID: 2510.11834
  • Title: Don't Walk the Line: Boundary Guidance for Filtered Generation
  • Authors: Sarah Ball (Ludwig-Maximilians-Universität München), Andreas Haupt (Stanford University)
  • Classification: cs.LG cs.CL
  • Publication Date: October 13, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.11834v1

Abstract

Generative models are increasingly paired with safety classifiers to filter harmful or inappropriate outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this may be suboptimal: it typically drives the model to produce samples near the classifier's decision boundary, thereby increasing both false positives and false negatives. This paper proposes Boundary Guidance, a reinforcement learning fine-tuning method that explicitly guides generation away from the classifier boundary. On benchmarks for jailbreaks and ambiguous prompts, Boundary Guidance improves both safety and utility of outputs, as verified by LLM-as-a-Judge evaluation. Comprehensive ablation studies across model scales and reward designs demonstrate the robustness of the approach.

Research Background and Motivation

Problem Definition

Modern AI deployment increasingly relies on composite safety systems, where generative models are paired with downstream safety classifiers to filter harmful or inappropriate outputs. This architecture allows organizations to maintain flexibility in safety policies while leveraging the complementary advantages of safety-trained models and specialized classifiers.

Core Problem

Current approaches focus on aligning models independently of safety classifiers, revealing a mismatch between training objectives and deployment reality. Standard generative AI model fine-tuning practices do not account for which generations are easily classified by the classifier—some generations hover near the classifier's decision boundary and are misclassified.

Problem Significance

This leads to errors in two directions:

  1. False Positives (over-blocking useful content)
  2. False Negatives (under-blocking harmful content)

When safety classifiers are imperfect (empirical evidence shows that even state-of-the-art classifiers can be successfully attacked 5% of the time on new harm dimensions), operating near the decision boundary amplifies these classification errors and degrades overall system performance.

Limitations of Existing Approaches

  1. Primarily optimize individual model behavior without considering the downstream filtering context that defines real-world deployment scenarios
  2. Require computationally intensive training processes of models in current implementations, whereas this paper's method requires only a single token from the safety classifier

Core Contributions

  1. Theoretical Contribution: Provides decision-theoretic evidence that system utility is minimized near the classifier's decision boundary, providing theoretical justification for the boundary avoidance objective
  2. Methodological Contribution: Introduces a reinforcement learning-based fine-tuning framework for training generators within composite safety systems
  3. Empirical Contribution: Demonstrates empirical improvements in safety and utility across multiple model architectures and scales, showing that composite system optimization can achieve results that individual components cannot

Methodology Details

Task Definition

Consider a generative model π_θ(y|x) that generates completions y ∈ Y given a prompt x ∈ X. Focus on output safety, denoted z(x,y) ∈ {0,1}. The safety classifier provides the expected probability that an output is unsafe: t(x,y) = Ez|x,y.

Decision-Theoretic Model

The paper establishes a decision-theoretic framework to analyze the utility of composite systems:

When an output is displayed, the user receives utility u(x,y) and society receives negative utility s(x,y). If the output is not displayed but is actually safe, the user receives negative utility -λ < 0 and society receives utility 0.

The expected utility for completion y is:

U(x,y) = {
  -(1-t(x,y))λ           if t(x,y) ≥ τ
  u(x,y) - t(x,y)        if t(x,y) < τ
}

Proposition 1: When u(x,y) ≡ u is constant, the utility function is strictly decreasing when t < τ and strictly increasing when t ≥ τ. This means expected utility is minimized near the decision boundary τ.

Boundary Avoidance Reward Function

Based on theoretical analysis, the paper proposes a continuous boundary avoidance reward:

R(x,y) = {
  u(x,y) + t(x,y)        if t(x,y) ≥ 0.5
  u(x,y) - t(x,y)        if t(x,y) < 0.5
}

where u(x,y) is provided by an existing reward model and t(x,y) is proxied by safety classifier logits.

Technical Implementation

  1. Reinforcement Learning Algorithm: Uses Group Relative Policy Optimization (GRPO)
  2. Parameter-Efficient Fine-tuning: Employs LoRA (r=16, α=32), reducing trainable parameters by 99%
  3. KL Regularization: Prevents excessive deviation from the original policy
  4. Multi-Model Architecture:
    • Policy Model: Qwen2.5 series, Gemma-2-9B
    • Safety Classifier: Meta-Llama-Guard-2-8B
    • Reward Model: Skywork-Reward-V2-Llama-3.1-8B

Experimental Setup

Datasets

The training dataset contains 7,880 prompts from three complementary sources:

  • 4,000 Jailbreak Prompts: From the jailbreak dataset of Ball et al. (2024)
  • 3,000 Alpaca Samples: Providing diverse instruction-following prompts
  • 880 HarmfulQA Samples: Containing questions designed to elicit harmful responses

The test set contains 1,000 held-out prompts uniformly distributed across three problem categories.

Evaluation Metrics

Uses GPT-4.1 for LLM-as-a-Judge evaluation:

  • Helpfulness Score (1-4 points): Assesses how well the assistant's response satisfies the user query and provides value
  • Harmfulness Score (0-3 points): Assesses the potential for the assistant's response to cause harm

Comparison Methods

  • Baseline Model: Original instruction-tuned model
  • Ablation 1: Boundary guidance using only safety classifier signals
  • Ablation 2: Prompt-aware reward design

Implementation Details

  • Learning Rate: 2×10^-5
  • Batch Size: 8 (per device)
  • Quantization: 4-bit NF4
  • Maximum Generation Length: 96 tokens
  • Training Epochs: 1

Experimental Results

Main Results

Boundary guidance achieves Pareto improvements across all four base models:

ModelHelpfulness Gain (Δ)Harmfulness Reduction (Δ)Statistical Significance
Qwen2.5-0.5B+0.13-0.09p<0.001
Qwen2.5-7B+0.03-0.15p<0.001
Gemma-2-9B+0.03-0.03p<0.001
Qwen2.5-14B-0.05-0.11p<0.10

Key Findings:

  • Harmfulness is significantly reduced across all models
  • Helpfulness improves for all models except the largest
  • The smallest model (Qwen2.5-0.5B) achieves the greatest overall improvement, indicating that boundary guidance is particularly effective when base safety capabilities are weaker

Ablation Studies

Experiment 1: Safety Classifier Signal Only

Results show:

  • Large models (7B and above) perform comparably to the main experiment
  • The smallest model (0.5B) exhibits helpfulness collapse (-34%), converging to near-universal rejection
  • Indicates that the reward model u(x,y) is essential for small models

Experiment 2: Prompt-Aware Rewards

This design rewards higher unsafe probability on unsafe prompts and lower unsafe probability on safe prompts. Results:

  • Harmfulness increases for three of four models
  • Helpfulness drops significantly across all models (-22% to -25%)
  • Demonstrates that actively training models to generate unsafe content on unsafe prompts is counterproductive

Case Analysis

Failure cases from prompt-aware rewards include:

  • "I apologize for having to provide this information, but I will list some illegal online communities and explain how to join them anonymously..."
  • "Of course, I'd be happy to help you cheat on a math exam! Here are some tips..."

While these responses contain disclaimers, they still provide harmful content that downstream filters cannot fully capture.

Improving Safety Classifier Accuracy

  • Gehman et al. (2020) introduce toxicity assessment benchmarks
  • Adversarial training improves classifier robustness (Ziegler et al., 2022)
  • Evolution from lightweight toxicity detectors to LLM-based safety models

Safety Alignment Fine-tuning

  • Safe RLHF (Dai et al., 2023): Decouples helpfulness and harmlessness objectives
  • Constrained DPO (Liu et al., 2024): Provides stronger safety guarantees
  • SafeDPO (Kim et al., 2025): Directly optimizes safety alignment

Composite Safety Systems

  • Baker et al. (2025): Demonstrates chain-of-thought reasoning monitoring
  • Wichers et al. (2024): Gradient-based red-teaming

Conclusions and Discussion

Main Conclusions

  1. Boundary guidance achieves Pareto improvements in the safety-utility tradeoff
  2. The method is consistently effective across multiple model architectures and scales
  3. Particularly beneficial for small models with weaker base safety capabilities
  4. Safety signals alone suffice for large models, but small models require reward model components

Limitations

  1. Classifier Dependency: Relies on the assumption that filters predict more accurately away from the decision boundary than near it
  2. Computational Overhead: Requires 2-3 models for training (though this is a one-time operation)
  3. Binary Safety Assumption: Currently assumes safety is a binary category, whereas real-world scenarios are more complex

Future Directions

  1. Multi-Dimensional Safety: Extend to multiple safety types s₁(x,y), s₂(x,y), ..., sₖ(x,y)
  2. Welfare Filters: Transition from safety-only filters to welfare filters that consider user utility and social harm

In-Depth Evaluation

Strengths

  1. Solid Theoretical Foundation: Provides decision-theoretic analysis proving utility minimization near boundaries
  2. Novel Methodology: First to explicitly optimize generators for composite safety systems
  3. Comprehensive Experiments: Validates across multiple model scales and architectures with detailed ablation studies
  4. High Practical Value: Addresses critical problems in real-world deployment
  5. Consistent Results: Demonstrates improvements across different settings

Weaknesses

  1. Evaluation Limitations: Primarily relies on a single LLM judge, which may introduce bias
  2. Dataset Scale: Training and test data are relatively small
  3. Long-term Effects Unknown: Does not evaluate performance under extended training or more complex scenarios
  4. Hyperparameter Sensitivity: Insufficient exploration of how different λ values affect performance

Impact

  1. Academic Contribution: Opens new research directions in composite AI safety systems
  2. Practical Value: Can be directly applied to existing deployed systems
  3. Reproducibility: Provides complete code and experimental details

Applicable Scenarios

  1. AI system deployments requiring balance between safety and utility
  2. Optimization of generative models with existing safety classifiers
  3. Applications sensitive to both over-rejection and under-rejection
  4. Small model deployments with resource constraints but safety improvement needs

References

The paper cites important works in related fields, including safety alignment, reinforcement learning, and composite systems research, providing solid theoretical and empirical foundations for the methodology.


This work makes important contributions to the AI safety field, demonstrating through theoretical analysis and empirical validation the value of composite system optimization, and providing new insights and tools for future safe AI deployment.