2025-11-14T15:31:11.541597

Mitigating the Noise Shift for Denoising Generative Models via Noise Awareness Guidance

Zhong, Jiang, Tao et al.

Existing denoising generative models rely on solving discretized reverse-time SDEs or ODEs. In this paper, we identify a long-overlooked yet pervasive issue in this family of models: a misalignment between the pre-defined noise level and the actual noise level encoded in intermediate states during sampling. We refer to this misalignment as noise shift. Through empirical analysis, we demonstrate that noise shift is widespread in modern diffusion models and exhibits a systematic bias, leading to sub-optimal generation due to both out-of-distribution generalization and inaccurate denoising updates. To address this problem, we propose Noise Awareness Guidance (NAG), a simple yet effective correction method that explicitly steers sampling trajectories to remain consistent with the pre-defined noise schedule. We further introduce a classifier-free variant of NAG, which jointly trains a noise-conditional and a noise-unconditional model via noise-condition dropout, thereby eliminating the need for external classifiers. Extensive experiments, including ImageNet generation and various supervised fine-tuning tasks, show that NAG consistently mitigates noise shift and substantially improves the generation quality of mainstream diffusion models.

academic

Mitigating the Noise Shift for Denoising Generative Models via Noise Awareness Guidance

基本信息

论文ID: 2510.12497
标题: Mitigating the Noise Shift for Denoising Generative Models via Noise Awareness Guidance
作者: Jincheng Zhong, Boyuan Jiang, Xin Tao, Pengfei Wan, Kun Gai, Mingsheng Long
分类: cs.LG (Machine Learning)
发表时间: 2025年10月14日 (arXiv预印本)
论文链接: https://arxiv.org/abs/2510.12497

摘要

现有的去噪生成模型依赖于求解离散化的反向时间SDE或ODE。本文识别了这类模型中一个长期被忽视但普遍存在的问题：预定义噪声水平与采样过程中中间状态编码的实际噪声水平之间的不匹配。作者将这种不匹配称为噪声偏移(noise shift)。通过实证分析，作者证明噪声偏移在现代扩散模型中广泛存在并表现出系统性偏差，导致分布外泛化和不准确的去噪更新问题，从而产生次优生成结果。为解决这一问题，作者提出了噪声感知引导(NAG)，这是一种简单而有效的修正方法，明确引导采样轨迹保持与预定义噪声调度的一致性。

研究背景与动机

问题识别

扩散模型和流模型等去噪生成模型在图像合成、视频生成等视觉生成任务中取得了显著成功。这些模型的核心原理是通过迭代过程从纯噪声逐步恢复目标样本。然而，在迭代采样过程中，模型不可避免地会累积来自多个源的误差，包括：

不完美的网络近似
数值积分中的离散化误差
其他随机因素

核心问题

作者发现这些累积误差的一个关键表现是：中间状态固有编码的噪声水平可能偏离预定义的调度。这种被称为"噪声偏移"的现象长期被社区忽视，但实际上既广泛存在又根植于各种误差源的集体效应中。

问题重要性

噪声偏移会导致去噪网络在训练和推理之间出现根本性不匹配，具体表现为：

分布外泛化问题：训练好的模型被应用到偏移的中间状态上
次优去噪操作：使用不准确的预定义系数计算下一状态

核心贡献

识别噪声偏移问题：首次系统性地识别并分析了去噪生成模型中普遍存在但长期被忽视的噪声偏移问题
提出NAG方法：设计了噪声感知引导(NAG)方法来缓解噪声偏移问题
开发无分类器变体：提出了NAG的无分类器变体，通过噪声条件dropout联合训练噪声条件和噪声无条件模型
全面实验验证：在ImageNet生成和监督微调任务上验证了NAG的有效性和通用性

方法详解

问题形式化

前向过程

对于噪声水平 $t \in [0,T]$ ，连续时间随机插值定义为： $x_t = \alpha_t x_0 + \sigma_t \epsilon$ 其中 $\alpha_0 = \sigma_T = 1$ ， $\alpha_T = \sigma_0 = 0$ ， $\alpha_t$ 单调递减， $\sigma_t$ 单调递增。

噪声偏移的数学描述

累积误差 $e$ 可视为应用于 $x_t$ 的额外高斯扰动： $\hat{x}_t = x_t + e$ ，其中 $e \sim \mathcal{N}(0, \sigma_e^2 I)$ 。

这种扰动将有效方差从 $\sigma_t^2$ 增加到 $\sigma_t^2 + \sigma_e^2$ ，使扰动状态表现得像在偏移噪声水平 $t' = t + \delta$ 处采样： $\sigma_{t+\delta}^2 = \sigma_t^2 + \sigma_e^2$

Statement 1: 当误差方差 $\sigma_e^2$ 较小时，偏移 $\delta$ 的一阶近似为： $\delta \approx \frac{\sqrt{\sigma_t^2 + \sigma_e^2} - \sigma_t}{\dot{\sigma}_t}$

噪声感知引导(NAG)

基于分类器的NAG

噪声条件得分可写为： $s(x|t) = \nabla_x \log p_t(x|t) = \nabla_x \log p_t(x) + \nabla_x \log p_t(t|x)$

通过外部后验估计器 $g_\phi$ 提供引导信号 $\nabla \log g_\phi(t|x)$ 。

无分类器NAG

利用 $p_t(t|x) \propto p_t(x|t)/p_t(x)$ ，使用得分混合来近似隐式噪声预测器的梯度： $s^{w_{nag}}(x|t) = (w_{nag} + 1)s(x|t) - w_{nag}s(x)$

其中 $w_{nag}$ 是NAG的引导参数。

实现策略

遵循CFG的训练策略：训练期间以固定概率随机丢弃噪声条件 $t$ ，使模型在条件和无条件目标之间共享权重。

技术创新点

直接针对噪声偏移：NAG直接针对噪声水平不匹配问题，而不是间接缓解
与CFG正交：NAG引入的噪声水平条件轴与CFG的条件轴正交，提供互补控制
简单有效：无需外部分类器，可直接集成到现有模型中

实验设置

数据集

ImageNet 256×256：使用预训练的Stable Diffusion VAE获得32×32×4潜在向量
监督微调数据集：Food101、SUN397、DF20-Mini、Caltech101、CUB-200-2011、ArtBench-10、Stanford Cars

模型架构

DiT (Diffusion Transformers)：S/2、B/2、L/2、XL/2变体
SiT (Scalable Interpolant Transformers)：相同配置变体

评价指标

FID (Fréchet Inception Distance)：主要评价指标
Precision & Recall：用于收敛结果评估

实现细节

采样步数：DiT使用250步DDPM采样，SiT使用250步SDE-Euler-Maruyama采样
引导权重： $w_{nag} = 3.0$ （无CFG）， $w_{nag} = 2.0$ （有CFG时）
噪声dropout：训练时10%概率丢弃噪声条件

实验结果

主要结果

ImageNet生成

表1：收敛模型对比结果

模型	训练轮数	无CFG生成	有CFG生成
DiT-XL/2	1400	FID: 9.62	FID: 2.27
+NAG	10+(1400*)	FID: 2.59	FID: 2.14
SiT-XL/2	1400	FID: 8.61	FID: 2.06
+NAG	10+(1400*)	FID: 2.26	FID: 1.72

关键发现：

NAG单独使用就能达到接近CFG引导的生成质量
与CFG结合时，NAG继续提供额外改进
仅需额外10轮微调（约0.7%的预训练成本）即可启用NAG

监督微调结果

表2：微调任务FID对比

方法	Food	SUN	Caltech	CUB	Stanford Car	DF-20M	ArtBench	平均FID
微调(无CFG)	16.04	21.41	31.34	9.81	11.29	17.92	22.76	18.65
+NAG	11.18	14.95	24.32	5.68	5.92	14.79	19.22	13.72
微调(有CFG)	10.93	14.13	23.84	5.37	6.32	15.29	19.94	13.69
+NAG	5.78	8.81	21.87	3.52	3.91	12.55	15.69	10.31