2025-11-14T15:31:11.541597

Mitigating the Noise Shift for Denoising Generative Models via Noise Awareness Guidance

Zhong, Jiang, Tao et al.

Existing denoising generative models rely on solving discretized reverse-time SDEs or ODEs. In this paper, we identify a long-overlooked yet pervasive issue in this family of models: a misalignment between the pre-defined noise level and the actual noise level encoded in intermediate states during sampling. We refer to this misalignment as noise shift. Through empirical analysis, we demonstrate that noise shift is widespread in modern diffusion models and exhibits a systematic bias, leading to sub-optimal generation due to both out-of-distribution generalization and inaccurate denoising updates. To address this problem, we propose Noise Awareness Guidance (NAG), a simple yet effective correction method that explicitly steers sampling trajectories to remain consistent with the pre-defined noise schedule. We further introduce a classifier-free variant of NAG, which jointly trains a noise-conditional and a noise-unconditional model via noise-condition dropout, thereby eliminating the need for external classifiers. Extensive experiments, including ImageNet generation and various supervised fine-tuning tasks, show that NAG consistently mitigates noise shift and substantially improves the generation quality of mainstream diffusion models.

academic

Mitigating the Noise Shift for Denoising Generative Models via Noise Awareness Guidance

Basic Information

Paper ID: 2510.12497
Title: Mitigating the Noise Shift for Denoising Generative Models via Noise Awareness Guidance
Authors: Jincheng Zhong, Boyuan Jiang, Xin Tao, Pengfei Wan, Kun Gai, Mingsheng Long
Category: cs.LG (Machine Learning)
Publication Date: October 14, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.12497

Abstract

Existing denoising generative models rely on solving discretized reverse-time SDEs or ODEs. This paper identifies a long-overlooked but prevalent issue in such models: the mismatch between predefined noise levels and the actual noise levels encoded in intermediate states during sampling. The authors term this mismatch "noise shift." Through empirical analysis, the authors demonstrate that noise shift is widespread in modern diffusion models and exhibits systematic bias, leading to out-of-distribution artifacts and inaccurate denoising updates, thereby producing suboptimal generation results. To address this issue, the authors propose Noise Awareness Guidance (NAG), a simple yet effective correction method that explicitly guides the sampling trajectory to maintain consistency with predefined noise schedules.

Research Background and Motivation

Problem Identification

Denoising generative models such as diffusion models and flow models have achieved significant success in visual generation tasks like image synthesis and video generation. The core principle of these models is to iteratively recover the target sample from pure noise. However, during the iterative sampling process, the model inevitably accumulates errors from multiple sources, including:

Imperfect network approximations
Discretization errors in numerical integration
Other stochastic factors

Core Problem

The authors find that a key manifestation of these cumulative errors is that the noise level inherently encoded in intermediate states may deviate from the predefined schedule. This phenomenon, termed "noise shift," has long been overlooked by the community but is actually both widespread and rooted in the collective effects of various error sources.

Problem Importance

Noise shift leads to a fundamental mismatch between training and inference in denoising networks, specifically manifesting as:

Out-of-distribution artifacts: Trained models applied to shifted intermediate states
Suboptimal denoising operations: Using inaccurate predefined coefficients to compute the next state

Key Contributions

Identification of Noise Shift: First systematic identification and analysis of the widespread but long-overlooked noise shift problem in denoising generative models
Proposal of NAG Method: Design of Noise Awareness Guidance (NAG) to mitigate the noise shift problem
Development of Classifier-Free Variant: Introduction of a classifier-free variant of NAG, jointly training noise-conditional and noise-unconditional models through noise-conditional dropout
Comprehensive Experimental Validation: Verification of NAG's effectiveness and generality on ImageNet generation and supervised fine-tuning tasks

Method Details

Problem Formalization

Forward Process

For noise level ( t \in 0,T ), the continuous-time stochastic interpolation is defined as: $x_t = \alpha_t x_0 + \sigma_t \epsilon$ where ( \alpha_0 = \sigma_T = 1 ), ( \alpha_T = \sigma_0 = 0 ), ( \alpha_t ) is monotonically decreasing, and ( \sigma_t ) is monotonically increasing.

Mathematical Description of Noise Shift

The cumulative error ( e ) can be viewed as an additional Gaussian perturbation applied to ( x_t ): ( \hat{x}_t = x_t + e ), where ( e \sim \mathcal{N}(0, \sigma_e^2 I) ).

This perturbation increases the effective variance from ( \sigma_t^2 ) to ( \sigma_t^2 + \sigma_e^2 ), making the perturbed state appear as if sampled at a shifted noise level ( t' = t + \delta ): $\sigma_{t+\delta}^2 = \sigma_t^2 + \sigma_e^2$

Statement 1: When the error variance ( \sigma_e^2 ) is small, the first-order approximation of the shift ( \delta ) is: $\delta \approx \frac{\sqrt{\sigma_t^2 + \sigma_e^2} - \sigma_t}{\dot{\sigma}_t}$

Noise Awareness Guidance (NAG)

Classifier-Based NAG

The noise-conditional score can be written as: $s(x|t) = \nabla_x \log p_t(x|t) = \nabla_x \log p_t(x) + \nabla_x \log p_t(t|x)$

Guidance signals ( \nabla \log g_\phi(t|x) ) are provided by an external posterior estimator ( g_\phi ).

Classifier-Free NAG

Using ( p_t(t|x) \propto p_t(x|t)/p_t(x) ), score mixing is used to approximate the gradient of the implicit noise predictor: $s^{w_{nag}}(x|t) = (w_{nag} + 1)s(x|t) - w_{nag}s(x)$

where ( w_ ) is the guidance parameter for NAG.

Implementation Strategy

Follow the training strategy of CFG: randomly drop the noise condition ( t ) with a fixed probability during training, allowing the model to share weights between conditional and unconditional objectives.

Technical Innovations

Direct Targeting of Noise Shift: NAG directly targets the noise level mismatch problem rather than indirectly mitigating it
Orthogonal to CFG: The noise level conditional axis introduced by NAG is orthogonal to CFG's conditional axis, providing complementary control
Simple and Effective: No external classifier is needed; it can be directly integrated into existing models

Experimental Setup

Datasets

ImageNet 256×256: Using pretrained Stable Diffusion VAE to obtain 32×32×4 latent vectors
Supervised Fine-Tuning Datasets: Food101, SUN397, DF20-Mini, Caltech101, CUB-200-2011, ArtBench-10, Stanford Cars

Model Architectures

DiT (Diffusion Transformers): S/2, B/2, L/2, XL/2 variants
SiT (Scalable Interpolant Transformers): Same configuration variants

Evaluation Metrics

FID (Fréchet Inception Distance): Primary evaluation metric
Precision & Recall: Used for retrieval result evaluation

Implementation Details

Sampling Steps: DiT uses 250-step DDPM sampling, SiT uses 250-step SDE-Euler-Maruyama sampling
Guidance Weight: ( w_ = 3.0 ) (without CFG), ( w_ = 2.0 ) (with CFG)
Noise Dropout: 10% probability of dropping noise conditions during training

Experimental Results

Main Results

ImageNet Generation

Table 1: Retrieval Model Comparison Results

Model	Training Rounds	No CFG Generation	With CFG Generation
DiT-XL/2	1400	FID: 9.62	FID: 2.27
+NAG	10+(1400*)	FID: 2.59	FID: 2.14
SiT-XL/2	1400	FID: 8.61	FID: 2.06
+NAG	10+(1400*)	FID: 2.26	FID: 1.72

Key Findings:

NAG alone achieves generation quality close to CFG guidance
Combined with CFG, NAG provides additional improvements
Only 10 extra rounds of fine-tuning (about 0.7% of pretraining cost) are needed to enable NAG

Supervised Fine-Tuning Results

Table 2: Fine-Tuning Task FID Comparison

Method	Food	SUN	Caltech	CUB	Stanford Car	DF-20M	ArtBench	Avg. FID
Fine-Tune (No CFG)	16.04	21.41	31.34	9.81	11.29	17.92	22.76	18.65
+NAG	11.18	14.95	24.32	5.68	5.92	14.79	19.22	13.72
Fine-Tune (With CFG)	10.93	14.13	23.84	5.37	6.32	15.29	19.94	13.69
+NAG	5.78	8.81	21.87	3.52	3.91	12.55	15.69	10.31

Noise Shift Mitigation Effect

Empirical analysis using an external noise estimator ( g_\phi ) shows:

Noise shift is widespread in modern diffusion models
Manifests as a systematic shift towards higher noise levels
NAG effectively reduces this shift, especially in the range where the signal-to-noise ratio is greater than 1

Ablation Studies

Guidance Weight Sensitivity: ( w_ ) performs stably in the range of 2.0-4.0
Impact of Sampling Steps: NAG is effective across different sampling steps
Architecture Generality: Shows consistent improvements on both DiT and SiT architectures

Denoising Generative Models

Diffusion Models: DDPM, DiT, etc., focusing on noise schedules, training objectives, and model architectures
Flow Models: Methods like Flow Matching
Accelerated Sampling: High-order solvers, improved interval modeling, etc.

Guidance Techniques

Classifier Guidance: Using external classifiers for conditional generation
Classifier-Free Guidance (CFG): Achieving guidance through mixing conditional and unconditional models
Domain Guidance (DoG): Guidance method specifically designed for fine-tuning scenarios

This paper's NAG is the first method to explicitly use the noise level itself as a guidance signal, directly enhancing alignment with the expected noise condition.

Conclusion and Discussion

Main Conclusions

Noise Shift Problem is Widespread: Training-inference mismatch is widely found in modern denoising generative models
NAG Effectively Mitigates the Problem: By directly targeting noise level mismatch, significantly improves generation quality
Method is Highly General: Shows consistent improvements across different architectures, tasks, and baseline methods

Limitations

Noise Estimator Dependency: Empirical analysis relies on the accuracy of external noise estimators
Simplified Theoretical Analysis: Theoretical analysis based on simplified assumptions may not fully capture real-world complexity
Computational Overhead: Requires training an additional unconditional branch

Future Directions

The authors hope this work will attract researchers to focus on the widespread training-inference mismatch problem in denoising generation, promoting the following research directions:

Theoretical or empirical analysis of the noise shift problem
Building generative models robust to inference-phase shifts
Exploring the boundaries of high-quality generation
Faster sampling methods

In-Depth Evaluation

Strengths

Innovative Problem Identification: First systematic identification and analysis of the widespread but overlooked noise shift problem
Simple and Effective Method: NAG is simple in design, easy to integrate into existing models, and highly effective
Comprehensive Experiments: Covers various architectures, datasets, and tasks, validating the method's generality
Theoretical Support: Provides mathematical analysis and approximate formulas for noise shift
High Practical Value: Requires only a small amount of additional training to significantly improve existing model performance

Weaknesses

Limited Theoretical Analysis: Based on simplified assumptions, may not fully explain complex real-world situations
Noise Estimator Issues: Empirical analysis relies on external estimators, potentially introducing additional errors
Computational Cost: Requires training an additional unconditional branch, increasing training and inference costs
Applicability Scope: Mainly validated on visual generation tasks; applicability to other modalities is unknown

Impact

Academic Contribution: Reveals an important problem in denoising generative models, providing new research directions for the field
Practical Value: Can be directly applied to improve existing model performance, with strong practical value
Method Generality: Orthogonal and complementary to existing guidance methods, with broad applicability

Suitable Scenarios

Large-scale image generation tasks
Supervised fine-tuning of pretrained models
Application scenarios requiring high-quality generation
Environments with relatively abundant computational resources

References

The paper cites important works in related fields such as diffusion models, flow models, and guidance techniques, including:

Ho et al. (2020): Original DDPM paper
Peebles & Xie (2023): DiT architecture
Ma et al. (2024): SiT architecture
Ho & Salimans (2021): Classifier-free guidance
Dhariwal & Nichol (2021): Classifier guidance

Overall Evaluation: This is a high-quality research paper that identifies an important but overlooked problem in denoising generative models, proposes a simple and effective solution, and validates the method's effectiveness and generality through comprehensive experiments. This work has significant academic value and practical implications for the field of diffusion models.