2025-11-14T15:31:11.541597

Mitigating the Noise Shift for Denoising Generative Models via Noise Awareness Guidance

Zhong, Jiang, Tao et al.
Existing denoising generative models rely on solving discretized reverse-time SDEs or ODEs. In this paper, we identify a long-overlooked yet pervasive issue in this family of models: a misalignment between the pre-defined noise level and the actual noise level encoded in intermediate states during sampling. We refer to this misalignment as noise shift. Through empirical analysis, we demonstrate that noise shift is widespread in modern diffusion models and exhibits a systematic bias, leading to sub-optimal generation due to both out-of-distribution generalization and inaccurate denoising updates. To address this problem, we propose Noise Awareness Guidance (NAG), a simple yet effective correction method that explicitly steers sampling trajectories to remain consistent with the pre-defined noise schedule. We further introduce a classifier-free variant of NAG, which jointly trains a noise-conditional and a noise-unconditional model via noise-condition dropout, thereby eliminating the need for external classifiers. Extensive experiments, including ImageNet generation and various supervised fine-tuning tasks, show that NAG consistently mitigates noise shift and substantially improves the generation quality of mainstream diffusion models.
academic

Mitigating the Noise Shift for Denoising Generative Models via Noise Awareness Guidance

Basic Information

  • Paper ID: 2510.12497
  • Title: Mitigating the Noise Shift for Denoising Generative Models via Noise Awareness Guidance
  • Authors: Jincheng Zhong, Boyuan Jiang, Xin Tao, Pengfei Wan, Kun Gai, Mingsheng Long
  • Category: cs.LG (Machine Learning)
  • Publication Date: October 14, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.12497

Abstract

Existing denoising generative models rely on solving discretized reverse-time SDEs or ODEs. This paper identifies a long-overlooked but prevalent issue in such models: the mismatch between predefined noise levels and the actual noise levels encoded in intermediate states during sampling. The authors term this mismatch "noise shift." Through empirical analysis, the authors demonstrate that noise shift is widespread in modern diffusion models and exhibits systematic bias, leading to out-of-distribution artifacts and inaccurate denoising updates, thereby producing suboptimal generation results. To address this issue, the authors propose Noise Awareness Guidance (NAG), a simple yet effective correction method that explicitly guides the sampling trajectory to maintain consistency with predefined noise schedules.

Research Background and Motivation

Problem Identification

Denoising generative models such as diffusion models and flow models have achieved significant success in visual generation tasks like image synthesis and video generation. The core principle of these models is to iteratively recover the target sample from pure noise. However, during the iterative sampling process, the model inevitably accumulates errors from multiple sources, including:

  • Imperfect network approximations
  • Discretization errors in numerical integration
  • Other stochastic factors

Core Problem

The authors find that a key manifestation of these cumulative errors is that the noise level inherently encoded in intermediate states may deviate from the predefined schedule. This phenomenon, termed "noise shift," has long been overlooked by the community but is actually both widespread and rooted in the collective effects of various error sources.

Problem Importance

Noise shift leads to a fundamental mismatch between training and inference in denoising networks, specifically manifesting as:

  1. Out-of-distribution artifacts: Trained models applied to shifted intermediate states
  2. Suboptimal denoising operations: Using inaccurate predefined coefficients to compute the next state

Key Contributions

  1. Identification of Noise Shift: First systematic identification and analysis of the widespread but long-overlooked noise shift problem in denoising generative models
  2. Proposal of NAG Method: Design of Noise Awareness Guidance (NAG) to mitigate the noise shift problem
  3. Development of Classifier-Free Variant: Introduction of a classifier-free variant of NAG, jointly training noise-conditional and noise-unconditional models through noise-conditional dropout
  4. Comprehensive Experimental Validation: Verification of NAG's effectiveness and generality on ImageNet generation and supervised fine-tuning tasks

Method Details

Problem Formalization

Forward Process

For noise level ( t \in 0,T ), the continuous-time stochastic interpolation is defined as: xt=αtx0+σtϵx_t = \alpha_t x_0 + \sigma_t \epsilon where ( \alpha_0 = \sigma_T = 1 ), ( \alpha_T = \sigma_0 = 0 ), ( \alpha_t ) is monotonically decreasing, and ( \sigma_t ) is monotonically increasing.

Mathematical Description of Noise Shift

The cumulative error ( e ) can be viewed as an additional Gaussian perturbation applied to ( x_t ): ( \hat{x}_t = x_t + e ), where ( e \sim \mathcal{N}(0, \sigma_e^2 I) ).

This perturbation increases the effective variance from ( \sigma_t^2 ) to ( \sigma_t^2 + \sigma_e^2 ), making the perturbed state appear as if sampled at a shifted noise level ( t' = t + \delta ): σt+δ2=σt2+σe2\sigma_{t+\delta}^2 = \sigma_t^2 + \sigma_e^2

Statement 1: When the error variance ( \sigma_e^2 ) is small, the first-order approximation of the shift ( \delta ) is: δσt2+σe2σtσ˙t\delta \approx \frac{\sqrt{\sigma_t^2 + \sigma_e^2} - \sigma_t}{\dot{\sigma}_t}

Noise Awareness Guidance (NAG)

Classifier-Based NAG

The noise-conditional score can be written as: s(xt)=xlogpt(xt)=xlogpt(x)+xlogpt(tx)s(x|t) = \nabla_x \log p_t(x|t) = \nabla_x \log p_t(x) + \nabla_x \log p_t(t|x)

Guidance signals ( \nabla \log g_\phi(t|x) ) are provided by an external posterior estimator ( g_\phi ).

Classifier-Free NAG

Using ( p_t(t|x) \propto p_t(x|t)/p_t(x) ), score mixing is used to approximate the gradient of the implicit noise predictor: swnag(xt)=(wnag+1)s(xt)wnags(x)s^{w_{nag}}(x|t) = (w_{nag} + 1)s(x|t) - w_{nag}s(x)

where ( w_ ) is the guidance parameter for NAG.

Implementation Strategy

Follow the training strategy of CFG: randomly drop the noise condition ( t ) with a fixed probability during training, allowing the model to share weights between conditional and unconditional objectives.

Technical Innovations

  1. Direct Targeting of Noise Shift: NAG directly targets the noise level mismatch problem rather than indirectly mitigating it
  2. Orthogonal to CFG: The noise level conditional axis introduced by NAG is orthogonal to CFG's conditional axis, providing complementary control
  3. Simple and Effective: No external classifier is needed; it can be directly integrated into existing models

Experimental Setup

Datasets

  • ImageNet 256×256: Using pretrained Stable Diffusion VAE to obtain 32×32×4 latent vectors
  • Supervised Fine-Tuning Datasets: Food101, SUN397, DF20-Mini, Caltech101, CUB-200-2011, ArtBench-10, Stanford Cars

Model Architectures

  • DiT (Diffusion Transformers): S/2, B/2, L/2, XL/2 variants
  • SiT (Scalable Interpolant Transformers): Same configuration variants

Evaluation Metrics

  • FID (Fréchet Inception Distance): Primary evaluation metric
  • Precision & Recall: Used for retrieval result evaluation

Implementation Details

  • Sampling Steps: DiT uses 250-step DDPM sampling, SiT uses 250-step SDE-Euler-Maruyama sampling
  • Guidance Weight: ( w_ = 3.0 ) (without CFG), ( w_ = 2.0 ) (with CFG)
  • Noise Dropout: 10% probability of dropping noise conditions during training

Experimental Results

Main Results

ImageNet Generation

Table 1: Retrieval Model Comparison Results

ModelTraining RoundsNo CFG GenerationWith CFG Generation
DiT-XL/21400FID: 9.62FID: 2.27
+NAG10+(1400*)FID: 2.59FID: 2.14
SiT-XL/21400FID: 8.61FID: 2.06
+NAG10+(1400*)FID: 2.26FID: 1.72

Key Findings:

  • NAG alone achieves generation quality close to CFG guidance
  • Combined with CFG, NAG provides additional improvements
  • Only 10 extra rounds of fine-tuning (about 0.7% of pretraining cost) are needed to enable NAG

Supervised Fine-Tuning Results

Table 2: Fine-Tuning Task FID Comparison

MethodFoodSUNCaltechCUBStanford CarDF-20MArtBenchAvg. FID
Fine-Tune (No CFG)16.0421.4131.349.8111.2917.9222.7618.65
+NAG11.1814.9524.325.685.9214.7919.2213.72
Fine-Tune (With CFG)10.9314.1323.845.376.3215.2919.9413.69
+NAG5.788.8121.873.523.9112.5515.6910.31

Noise Shift Mitigation Effect

Empirical analysis using an external noise estimator ( g_\phi ) shows:

  • Noise shift is widespread in modern diffusion models
  • Manifests as a systematic shift towards higher noise levels
  • NAG effectively reduces this shift, especially in the range where the signal-to-noise ratio is greater than 1

Ablation Studies

  • Guidance Weight Sensitivity: ( w_ ) performs stably in the range of 2.0-4.0
  • Impact of Sampling Steps: NAG is effective across different sampling steps
  • Architecture Generality: Shows consistent improvements on both DiT and SiT architectures

Denoising Generative Models

  • Diffusion Models: DDPM, DiT, etc., focusing on noise schedules, training objectives, and model architectures
  • Flow Models: Methods like Flow Matching
  • Accelerated Sampling: High-order solvers, improved interval modeling, etc.

Guidance Techniques

  • Classifier Guidance: Using external classifiers for conditional generation
  • Classifier-Free Guidance (CFG): Achieving guidance through mixing conditional and unconditional models
  • Domain Guidance (DoG): Guidance method specifically designed for fine-tuning scenarios

This paper's NAG is the first method to explicitly use the noise level itself as a guidance signal, directly enhancing alignment with the expected noise condition.

Conclusion and Discussion

Main Conclusions

  1. Noise Shift Problem is Widespread: Training-inference mismatch is widely found in modern denoising generative models
  2. NAG Effectively Mitigates the Problem: By directly targeting noise level mismatch, significantly improves generation quality
  3. Method is Highly General: Shows consistent improvements across different architectures, tasks, and baseline methods

Limitations

  1. Noise Estimator Dependency: Empirical analysis relies on the accuracy of external noise estimators
  2. Simplified Theoretical Analysis: Theoretical analysis based on simplified assumptions may not fully capture real-world complexity
  3. Computational Overhead: Requires training an additional unconditional branch

Future Directions

The authors hope this work will attract researchers to focus on the widespread training-inference mismatch problem in denoising generation, promoting the following research directions:

  • Theoretical or empirical analysis of the noise shift problem
  • Building generative models robust to inference-phase shifts
  • Exploring the boundaries of high-quality generation
  • Faster sampling methods

In-Depth Evaluation

Strengths

  1. Innovative Problem Identification: First systematic identification and analysis of the widespread but overlooked noise shift problem
  2. Simple and Effective Method: NAG is simple in design, easy to integrate into existing models, and highly effective
  3. Comprehensive Experiments: Covers various architectures, datasets, and tasks, validating the method's generality
  4. Theoretical Support: Provides mathematical analysis and approximate formulas for noise shift
  5. High Practical Value: Requires only a small amount of additional training to significantly improve existing model performance

Weaknesses

  1. Limited Theoretical Analysis: Based on simplified assumptions, may not fully explain complex real-world situations
  2. Noise Estimator Issues: Empirical analysis relies on external estimators, potentially introducing additional errors
  3. Computational Cost: Requires training an additional unconditional branch, increasing training and inference costs
  4. Applicability Scope: Mainly validated on visual generation tasks; applicability to other modalities is unknown

Impact

  1. Academic Contribution: Reveals an important problem in denoising generative models, providing new research directions for the field
  2. Practical Value: Can be directly applied to improve existing model performance, with strong practical value
  3. Method Generality: Orthogonal and complementary to existing guidance methods, with broad applicability

Suitable Scenarios

  • Large-scale image generation tasks
  • Supervised fine-tuning of pretrained models
  • Application scenarios requiring high-quality generation
  • Environments with relatively abundant computational resources

References

The paper cites important works in related fields such as diffusion models, flow models, and guidance techniques, including:

  • Ho et al. (2020): Original DDPM paper
  • Peebles & Xie (2023): DiT architecture
  • Ma et al. (2024): SiT architecture
  • Ho & Salimans (2021): Classifier-free guidance
  • Dhariwal & Nichol (2021): Classifier guidance

Overall Evaluation: This is a high-quality research paper that identifies an important but overlooked problem in denoising generative models, proposes a simple and effective solution, and validates the method's effectiveness and generality through comprehensive experiments. This work has significant academic value and practical implications for the field of diffusion models.