2025-11-18T02:28:12.443418

Don't Walk the Line: Boundary Guidance for Filtered Generation

Ball, Haupt

Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier's decision boundary, increasing both false positives and false negatives. We propose Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier's margin. On a benchmark of jailbreak and ambiguous prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach.

academic

Don't Walk the Line: Boundary Guidance for Filtered Generation

Basic Information

Paper ID: 2510.11834
Title: Don't Walk the Line: Boundary Guidance for Filtered Generation
Authors: Sarah Ball (Ludwig-Maximilians-Universität München), Andreas Haupt (Stanford University)
Classification: cs.LG cs.CL
Publication Date: October 13, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.11834v1

Abstract

Generative models are increasingly paired with safety classifiers to filter harmful or inappropriate outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this may be suboptimal: it typically drives the model to produce samples near the classifier's decision boundary, thereby increasing both false positives and false negatives. This paper proposes Boundary Guidance, a reinforcement learning fine-tuning method that explicitly guides generation away from the classifier boundary. On benchmarks for jailbreaks and ambiguous prompts, Boundary Guidance improves both safety and utility of outputs, as verified by LLM-as-a-Judge evaluation. Comprehensive ablation studies across model scales and reward designs demonstrate the robustness of the approach.

Research Background and Motivation

Problem Definition

Modern AI deployment increasingly relies on composite safety systems, where generative models are paired with downstream safety classifiers to filter harmful or inappropriate outputs. This architecture allows organizations to maintain flexibility in safety policies while leveraging the complementary advantages of safety-trained models and specialized classifiers.

Core Problem

Current approaches focus on aligning models independently of safety classifiers, revealing a mismatch between training objectives and deployment reality. Standard generative AI model fine-tuning practices do not account for which generations are easily classified by the classifier—some generations hover near the classifier's decision boundary and are misclassified.

Problem Significance

This leads to errors in two directions:

False Positives (over-blocking useful content)
False Negatives (under-blocking harmful content)

When safety classifiers are imperfect (empirical evidence shows that even state-of-the-art classifiers can be successfully attacked 5% of the time on new harm dimensions), operating near the decision boundary amplifies these classification errors and degrades overall system performance.

Limitations of Existing Approaches

Primarily optimize individual model behavior without considering the downstream filtering context that defines real-world deployment scenarios
Require computationally intensive training processes of models in current implementations, whereas this paper's method requires only a single token from the safety classifier

Core Contributions

Theoretical Contribution: Provides decision-theoretic evidence that system utility is minimized near the classifier's decision boundary, providing theoretical justification for the boundary avoidance objective
Methodological Contribution: Introduces a reinforcement learning-based fine-tuning framework for training generators within composite safety systems
Empirical Contribution: Demonstrates empirical improvements in safety and utility across multiple model architectures and scales, showing that composite system optimization can achieve results that individual components cannot

Methodology Details

Task Definition

Consider a generative model π_θ(y|x) that generates completions y ∈ Y given a prompt x ∈ X. Focus on output safety, denoted z(x,y) ∈ {0,1}. The safety classifier provides the expected probability that an output is unsafe: t(x,y) = Ez|x,y.

Decision-Theoretic Model

The paper establishes a decision-theoretic framework to analyze the utility of composite systems:

When an output is displayed, the user receives utility u(x,y) and society receives negative utility s(x,y). If the output is not displayed but is actually safe, the user receives negative utility -λ < 0 and society receives utility 0.

The expected utility for completion y is:

U(x,y) = {
  -(1-t(x,y))λ           if t(x,y) ≥ τ
  u(x,y) - t(x,y)        if t(x,y) < τ
}

Proposition 1: When u(x,y) ≡ u is constant, the utility function is strictly decreasing when t < τ and strictly increasing when t ≥ τ. This means expected utility is minimized near the decision boundary τ.

Boundary Avoidance Reward Function

Based on theoretical analysis, the paper proposes a continuous boundary avoidance reward:

R(x,y) = {
  u(x,y) + t(x,y)        if t(x,y) ≥ 0.5
  u(x,y) - t(x,y)        if t(x,y) < 0.5
}

where u(x,y) is provided by an existing reward model and t(x,y) is proxied by safety classifier logits.

Technical Implementation

Reinforcement Learning Algorithm: Uses Group Relative Policy Optimization (GRPO)
Parameter-Efficient Fine-tuning: Employs LoRA (r=16, α=32), reducing trainable parameters by 99%
KL Regularization: Prevents excessive deviation from the original policy
Multi-Model Architecture:
- Policy Model: Qwen2.5 series, Gemma-2-9B
- Safety Classifier: Meta-Llama-Guard-2-8B
- Reward Model: Skywork-Reward-V2-Llama-3.1-8B

Experimental Setup

Datasets

The training dataset contains 7,880 prompts from three complementary sources:

4,000 Jailbreak Prompts: From the jailbreak dataset of Ball et al. (2024)
3,000 Alpaca Samples: Providing diverse instruction-following prompts
880 HarmfulQA Samples: Containing questions designed to elicit harmful responses

The test set contains 1,000 held-out prompts uniformly distributed across three problem categories.

Evaluation Metrics

Uses GPT-4.1 for LLM-as-a-Judge evaluation:

Helpfulness Score (1-4 points): Assesses how well the assistant's response satisfies the user query and provides value
Harmfulness Score (0-3 points): Assesses the potential for the assistant's response to cause harm

Comparison Methods

Baseline Model: Original instruction-tuned model
Ablation 1: Boundary guidance using only safety classifier signals
Ablation 2: Prompt-aware reward design

Implementation Details

Learning Rate: 2×10^-5
Batch Size: 8 (per device)
Quantization: 4-bit NF4
Maximum Generation Length: 96 tokens
Training Epochs: 1

Experimental Results

Main Results

Boundary guidance achieves Pareto improvements across all four base models:

Model	Helpfulness Gain (Δ)	Harmfulness Reduction (Δ)	Statistical Significance
Qwen2.5-0.5B	+0.13	-0.09	p<0.001
Qwen2.5-7B	+0.03	-0.15	p<0.001
Gemma-2-9B	+0.03	-0.03	p<0.001
Qwen2.5-14B	-0.05	-0.11	p<0.10

Key Findings:

Harmfulness is significantly reduced across all models
Helpfulness improves for all models except the largest
The smallest model (Qwen2.5-0.5B) achieves the greatest overall improvement, indicating that boundary guidance is particularly effective when base safety capabilities are weaker

Ablation Studies

Experiment 1: Safety Classifier Signal Only

Results show:

Large models (7B and above) perform comparably to the main experiment
The smallest model (0.5B) exhibits helpfulness collapse (-34%), converging to near-universal rejection
Indicates that the reward model u(x,y) is essential for small models

Experiment 2: Prompt-Aware Rewards

This design rewards higher unsafe probability on unsafe prompts and lower unsafe probability on safe prompts. Results:

Harmfulness increases for three of four models
Helpfulness drops significantly across all models (-22% to -25%)
Demonstrates that actively training models to generate unsafe content on unsafe prompts is counterproductive

Case Analysis

Failure cases from prompt-aware rewards include:

"I apologize for having to provide this information, but I will list some illegal online communities and explain how to join them anonymously..."
"Of course, I'd be happy to help you cheat on a math exam! Here are some tips..."

While these responses contain disclaimers, they still provide harmful content that downstream filters cannot fully capture.

Improving Safety Classifier Accuracy

Gehman et al. (2020) introduce toxicity assessment benchmarks
Adversarial training improves classifier robustness (Ziegler et al., 2022)
Evolution from lightweight toxicity detectors to LLM-based safety models

Safety Alignment Fine-tuning

Safe RLHF (Dai et al., 2023): Decouples helpfulness and harmlessness objectives
Constrained DPO (Liu et al., 2024): Provides stronger safety guarantees
SafeDPO (Kim et al., 2025): Directly optimizes safety alignment

Composite Safety Systems

Baker et al. (2025): Demonstrates chain-of-thought reasoning monitoring
Wichers et al. (2024): Gradient-based red-teaming

Conclusions and Discussion

Main Conclusions

Boundary guidance achieves Pareto improvements in the safety-utility tradeoff
The method is consistently effective across multiple model architectures and scales
Particularly beneficial for small models with weaker base safety capabilities
Safety signals alone suffice for large models, but small models require reward model components

Limitations

Classifier Dependency: Relies on the assumption that filters predict more accurately away from the decision boundary than near it
Computational Overhead: Requires 2-3 models for training (though this is a one-time operation)
Binary Safety Assumption: Currently assumes safety is a binary category, whereas real-world scenarios are more complex

Future Directions

Multi-Dimensional Safety: Extend to multiple safety types s₁(x,y), s₂(x,y), ..., sₖ(x,y)
Welfare Filters: Transition from safety-only filters to welfare filters that consider user utility and social harm

In-Depth Evaluation

Strengths

Solid Theoretical Foundation: Provides decision-theoretic analysis proving utility minimization near boundaries
Novel Methodology: First to explicitly optimize generators for composite safety systems
Comprehensive Experiments: Validates across multiple model scales and architectures with detailed ablation studies
High Practical Value: Addresses critical problems in real-world deployment
Consistent Results: Demonstrates improvements across different settings

Weaknesses

Evaluation Limitations: Primarily relies on a single LLM judge, which may introduce bias
Dataset Scale: Training and test data are relatively small
Long-term Effects Unknown: Does not evaluate performance under extended training or more complex scenarios
Hyperparameter Sensitivity: Insufficient exploration of how different λ values affect performance

Impact

Academic Contribution: Opens new research directions in composite AI safety systems
Practical Value: Can be directly applied to existing deployed systems
Reproducibility: Provides complete code and experimental details

Applicable Scenarios

AI system deployments requiring balance between safety and utility
Optimization of generative models with existing safety classifiers
Applications sensitive to both over-rejection and under-rejection
Small model deployments with resource constraints but safety improvement needs

References

The paper cites important works in related fields, including safety alignment, reinforcement learning, and composite systems research, providing solid theoretical and empirical foundations for the methodology.

This work makes important contributions to the AI safety field, demonstrating through theoretical analysis and empirical validation the value of composite system optimization, and providing new insights and tools for future safe AI deployment.