2025-11-23T07:10:16.507917

CADE 2.5 - ZeResFDG: Frequency-Decoupled, Rescaled and Zero-Projected Guidance for SD/SDXL Latent Diffusion Models

Rychkovskiy, GPT-5
We introduce CADE 2.5 (Comfy Adaptive Detail Enhancer), a sampler-level guidance stack for SD/SDXL latent diffusion models. The central module, ZeResFDG, unifies (i) frequency-decoupled guidance that reweights low- and high-frequency components of the guidance signal, (ii) energy rescaling that matches the per-sample magnitude of the guided prediction to the positive branch, and (iii) zero-projection that removes the component parallel to the unconditional direction. A lightweight spectral EMA with hysteresis switches between a conservative and a detail-seeking mode as structure crystallizes during sampling. Across SD/SDXL samplers, ZeResFDG improves sharpness, prompt adherence, and artifact control at moderate guidance scales without any retraining. In addition, we employ a training-free inference-time stabilizer, QSilk Micrograin Stabilizer (quantile clamp + depth/edge-gated micro-detail injection), which improves robustness and yields natural high-frequency micro-texture at high resolutions with negligible overhead. For completeness we note that the same rule is compatible with alternative parameterizations (e.g., velocity), which we briefly discuss in the Appendix; however, this paper focuses on SD/SDXL latent diffusion models.
academic

CADE 2.5 - ZeResFDG: Frequency-Decoupled, Rescaled and Zero-Projected Guidance for SD/SDXL Latent Diffusion Models

Basic Information

  • Paper ID: 2510.12954
  • Title: CADE 2.5 - ZeResFDG: Frequency-Decoupled, Rescaled and Zero-Projected Guidance for SD/SDXL Latent Diffusion Models
  • Authors: Denis Rychkovskiy ("DZRobo", Independent Researcher), GPT-5 (AI collaborator and co-author, OpenAI)
  • Classification: cs.CV (primary), cs.LG (secondary)
  • Publication Date: October 11, 2025
  • Paper Link: https://arxiv.org/abs/2510.12954

Abstract

This paper proposes CADE 2.5 (Comfy Adaptive Detail Enhancer), a sampler-level guidance stack for SD/SDXL latent diffusion models. The core module ZeResFDG unifies three key techniques: (1) frequency-decoupled guidance, which reweights low-frequency and high-frequency components of the guidance signal; (2) energy rescaling, matching the per-sample amplitude of guided predictions to the positive branch; (3) zero projection, removing components parallel to the unconditional direction. A lightweight spectral EMA with hysteresis mechanism switches between conservative and detail-seeking modes during structural crystallization in the sampling process. The method improves clarity, prompt adherence, and artifact control at moderate guidance scales without retraining.

Research Background and Motivation

Core Problem

Latent diffusion models (such as SD/SDXL), while capable of generating high-fidelity images, exhibit quality degradation at large classifier-free guidance (CFG) scales, manifesting as oversaturation, color shifts, or texture artifacts. Reducing CFG to avoid these effects often sacrifices clarity and prompt adherence.

Problem Significance

This issue directly impacts the quality of diffusion models in practical applications. Users face a trade-off between image clarity/prompt adherence and artifact control, which limits model utility.

Limitations of Existing Methods

Existing solutions include:

  • Attention-based guidance (SAG/PAG)
  • Schedule-aware or interval-restricted guidance
  • Rescaling heuristics widely used in practice

While these methods show some effectiveness, they lack a unified framework to simultaneously address frequency component processing, energy matching, and directional drift issues.

Research Motivation

This work aims to provide a compact sampler-end solution by reshaping the guidance signal itself to address the above issues while maintaining training-free characteristics.

Core Contributions

  1. Proposed the ZeResFDG unified framework: Organically combines frequency decoupling, energy rescaling, and zero projection techniques
  2. Designed an adaptive mode-switching mechanism: Dynamically switches between conservative and detail-seeking modes based on spectral EMA and hysteresis
  3. Developed QSilk Micrograin Stabilizer: A training-free inference-time stabilizer that improves robustness and produces natural microtextures at high resolution
  4. Implemented a plug-and-play sampler wrapper: Integrates into existing SD/SDXL pipelines without retraining
  5. Verified cross-parameterization compatibility: The method applies to different parameterizations (e.g., velocity parameterization)

Methodology Details

Task Definition

Given conditional prediction y_c and unconditional prediction y_u, standard CFG forms y_cfg = y_u + s(y_c - y_u), where s > 0 is the guidance scale. The objective is to reduce artifacts at high CFG scales while maintaining prompt adherence.

Model Architecture

1. Frequency-Decoupled Guidance (FDG)

Decomposes the original guidance Δ = y_c - y_u into low-frequency and high-frequency components via Gaussian low-pass filter G_σ:

  • Δ_ℓ = G_σ * Δ (low-frequency component)
  • Δ_h = Δ - Δ_ℓ (high-frequency component)
  • Reweighting: Δ̃ = λ_ℓΔ_ℓ + λ_hΔ_h, where λ_ℓ ∈ 0,1, λ_h ≳ 1

2. Energy Rescaling (RescaleCFG)

After forming y_cfg = y_u + sΔ̃, rescale to match per-sample standard deviation of y_c:

y_res = α · Rescale(y_cfg, std(y_c)) + (1-α)y_cfg

where α ∈ 0,1 is the blending coefficient.

3. Zero Projection (CFGZero)

To suppress leakage along the unconditional direction, compute:

  • α_∥ = ⟨y_c, y_u⟩/⟨y_u, y_u⟩
  • Use residual r = y_c - α_∥y_u as the guidance signal

4. Adaptive Mode Switching

Monitor the high-frequency ratio r_HF = ∥Δ_h∥²/(∥Δ_ℓ∥² + ∥Δ_h∥²) and track EMA ρ. Switch between conservative mode (CFGZeroFD) and detail-seeking mode (RescaleFDG) via two thresholds (τ_lo, τ_hi) and hysteresis mechanism.

QSilk Micrograin Stabilizer

1. Per-Step Quantile Clamping (QClamp)

After each denoising step, apply per-sample quantile clamping to the denoised tensor, restricting values to the (0.1%, 99.9%) quantile range.

2. Late-Stage Micrograin Injection

In later steps, add small high-frequency residuals:

x'_img = x_img + α(t)g_edge g_depth(x_img - G_σ(x_img))

where g_edge and g_depth are edge and depth gating functions respectively.

Technical Innovations

  1. Unified Framework Design: Organically combines three distinct guidance improvement techniques in a single framework
  2. Adaptive Switching Mechanism: Intelligent mode switching based on spectral analysis, adapting to structural changes during sampling
  3. Training-Free Characteristics: All components are applied at inference time without model retraining
  4. Frequency-Aware Processing: Explicitly handles different frequency components, protecting global structure while enhancing details

Experimental Setup

Dataset

Experiments use the SDXL model at 672×944 resolution with final output at 3688×5192. Testing includes different SDXL models targeting photography and anime styles.

Evaluation Metrics

Primarily qualitative assessment focusing on:

  • Portrait quality (eyes, hair, skin tone)
  • Hand details (fingers, nails)
  • High-frequency textures (skin microtextures)

Experimental Configuration

  • Sampler: Euler (anime) / UniPC (photography)
  • Steps: 25
  • CFG: 4.5
  • Denoising strength: 0.65

Implementation Details

Default parameters:

  • σ = 1.0 (Gaussian separation)
  • (λ_ℓ, λ_h) = (0.6, 1.3)
  • Rescaling blend α = 0.7
  • EMA β = 0.8
  • Hysteresis thresholds (τ_lo, τ_hi) = (0.45, 0.60)

Experimental Results

Main Results

Experiments demonstrate CADE 2.5 improvements across multiple aspects:

  1. Anime-Style Portraits: Clearer lines, better color and lighting effects, significant enhancement of eye, nose, and lip details, without flickering artifacts
  2. Photorealistic Portraits: Enhanced microtextures while maintaining global tone, reduced eye artifacts, richer hair details, more natural skin tone and microtextures
  3. High-Frequency Details: Significant enhancement of microtextures in lips, nose, neck and other regions

Case Analysis

The paper provides detailed visual comparisons showing ZeResFDG significantly improves microtexture quality and reduces typical high-CFG artifacts (oversaturation, halo effects) while maintaining global composition and tone.

Experimental Findings

  • The method effectively improves clarity and prompt adherence at moderate guidance scales
  • Successfully controls artifacts, particularly oversaturation and halo problems
  • Produces natural microtexture effects in high-resolution outputs

Main Research Directions

  1. Attention-Guided Control: Methods like SAG/PAG improve guidance effects by manipulating attention mechanisms
  2. Schedule-Aware Guidance: Applies guidance within limited intervals to suppress artifacts
  3. Rescaling Heuristics: Energy-matching methods widely used in practice

The paper particularly mentions complementarity with Sadat et al. (2025)'s Adaptive Projection Guidance (APG) framework. APG decomposes classifier-free guidance into parallel and orthogonal components, while this work extends this perspective by incorporating rescaling and zero projection terms specifically for SD/SDXL.

Relative Advantages

  • Provides a more unified solution
  • Incorporates frequency-domain analysis
  • Implements adaptive mode switching
  • Maintains training-free characteristics

Conclusions and Discussion

Main Conclusions

CADE 2.5 successfully addresses quality degradation of SD/SDXL models at high CFG scales through the ZeResFDG framework, significantly improving image quality while maintaining training-free characteristics.

Limitations

  1. Limited Evaluation Scope: Authors acknowledge evaluation is primarily qualitative, lacking comprehensive quantitative benchmarking
  2. Parameter Sensitivity: The method involves multiple hyperparameters that may require tuning for different scenarios
  3. Computational Overhead: While claimed to be lightweight, frequency decomposition and multi-mode switching still incur computational costs

Future Directions

  1. More comprehensive quantitative evaluation and ablation studies
  2. Adaptation to other diffusion model architectures
  3. Development of automatic parameter tuning mechanisms
  4. Deeper comparison with other guidance improvement methods

In-Depth Evaluation

Strengths

  1. Strong Method Innovation: Cleverly unifies three distinct improvement techniques in a single framework
  2. High Practical Value: Training-free, plug-and-play characteristics enable easy deployment
  3. Complete Technical Details: Provides detailed algorithm descriptions and implementation details
  4. Significant Visual Improvements: Demonstrated examples show clear improvements

Weaknesses

  1. Incomplete Evaluation: Lacks quantitative metrics and large-scale dataset validation
  2. Limited Theoretical Analysis: Insufficient theoretical explanation for why this combination is effective
  3. Experience-Dependent Parameter Setting: Multiple hyperparameter choices primarily based on empirical experience
  4. Insufficient Comparative Experiments: Limited direct comparisons with other SOTA methods

Impact

This work has significant implications for diffusion model inference optimization:

  • Provides new perspectives on guidance improvement
  • Offers effective tools for practical applications
  • May inspire more training-free optimization methods

Applicable Scenarios

  • Image generation quality enhancement for SD/SDXL models
  • Artistic creation requiring high-quality details
  • Commercial image generation applications
  • Research on diffusion model guidance mechanisms

References

The paper cites important works in the field, including:

  • Attention-guided methods such as SAG/PAG
  • Related research on the APG framework
  • Foundational theory on diffusion model guidance mechanisms
  • Optimization techniques widely used in practice

Overall Assessment: This is a technically strong engineering optimization paper. While it has some limitations in theoretical depth and evaluation comprehensiveness, its practical value is high, providing effective improvement solutions for diffusion model applications. The training-free characteristics and significant visual improvements make it promising for practical deployment.