Don't Walk the Line: Boundary Guidance for Filtered Generation
Ball, Haupt
Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier's decision boundary, increasing both false positives and false negatives. We propose Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier's margin. On a benchmark of jailbreak and ambiguous prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach.
academic
Don't Walk the Line: Boundary Guidance for Filtered Generation
Generative models are increasingly paired with safety classifiers to filter harmful or inappropriate outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this may be suboptimal: it typically drives the model to produce samples near the classifier's decision boundary, thereby increasing both false positives and false negatives. This paper proposes Boundary Guidance, a reinforcement learning fine-tuning method that explicitly guides generation away from the classifier boundary. On benchmarks for jailbreaks and ambiguous prompts, Boundary Guidance improves both safety and utility of outputs, as verified by LLM-as-a-Judge evaluation. Comprehensive ablation studies across model scales and reward designs demonstrate the robustness of the approach.
Modern AI deployment increasingly relies on composite safety systems, where generative models are paired with downstream safety classifiers to filter harmful or inappropriate outputs. This architecture allows organizations to maintain flexibility in safety policies while leveraging the complementary advantages of safety-trained models and specialized classifiers.
Current approaches focus on aligning models independently of safety classifiers, revealing a mismatch between training objectives and deployment reality. Standard generative AI model fine-tuning practices do not account for which generations are easily classified by the classifier—some generations hover near the classifier's decision boundary and are misclassified.
When safety classifiers are imperfect (empirical evidence shows that even state-of-the-art classifiers can be successfully attacked 5% of the time on new harm dimensions), operating near the decision boundary amplifies these classification errors and degrades overall system performance.
Primarily optimize individual model behavior without considering the downstream filtering context that defines real-world deployment scenarios
Require computationally intensive training processes of models in current implementations, whereas this paper's method requires only a single token from the safety classifier
Theoretical Contribution: Provides decision-theoretic evidence that system utility is minimized near the classifier's decision boundary, providing theoretical justification for the boundary avoidance objective
Methodological Contribution: Introduces a reinforcement learning-based fine-tuning framework for training generators within composite safety systems
Empirical Contribution: Demonstrates empirical improvements in safety and utility across multiple model architectures and scales, showing that composite system optimization can achieve results that individual components cannot
Consider a generative model π_θ(y|x) that generates completions y ∈ Y given a prompt x ∈ X. Focus on output safety, denoted z(x,y) ∈ {0,1}. The safety classifier provides the expected probability that an output is unsafe: t(x,y) = Ez|x,y.
The paper establishes a decision-theoretic framework to analyze the utility of composite systems:
When an output is displayed, the user receives utility u(x,y) and society receives negative utility s(x,y). If the output is not displayed but is actually safe, the user receives negative utility -λ < 0 and society receives utility 0.
The expected utility for completion y is:
U(x,y) = {
-(1-t(x,y))λ if t(x,y) ≥ τ
u(x,y) - t(x,y) if t(x,y) < τ
}
Proposition 1: When u(x,y) ≡ u is constant, the utility function is strictly decreasing when t < τ and strictly increasing when t ≥ τ. This means expected utility is minimized near the decision boundary τ.
Boundary guidance achieves Pareto improvements across all four base models:
Model
Helpfulness Gain (Δ)
Harmfulness Reduction (Δ)
Statistical Significance
Qwen2.5-0.5B
+0.13
-0.09
p<0.001
Qwen2.5-7B
+0.03
-0.15
p<0.001
Gemma-2-9B
+0.03
-0.03
p<0.001
Qwen2.5-14B
-0.05
-0.11
p<0.10
Key Findings:
Harmfulness is significantly reduced across all models
Helpfulness improves for all models except the largest
The smallest model (Qwen2.5-0.5B) achieves the greatest overall improvement, indicating that boundary guidance is particularly effective when base safety capabilities are weaker
The paper cites important works in related fields, including safety alignment, reinforcement learning, and composite systems research, providing solid theoretical and empirical foundations for the methodology.
This work makes important contributions to the AI safety field, demonstrating through theoretical analysis and empirical validation the value of composite system optimization, and providing new insights and tools for future safe AI deployment.