2025-11-15T09:52:11.139771

Improved Sample Complexity For Diffusion Model Training Without Empirical Risk Minimizer Access

Gaur, Trivedi, Kunapuli et al.

Diffusion models have demonstrated state-of-the-art performance across vision, language, and scientific domains. Despite their empirical success, prior theoretical analyses of the sample complexity suffer from poor scaling with input data dimension or rely on unrealistic assumptions such as access to exact empirical risk minimizers. In this work, we provide a principled analysis of score estimation, establishing a sample complexity bound of $\mathcal{O}(Îµ^{-4})$. Our approach leverages a structured decomposition of the score estimation error into statistical, approximation, and optimization errors, enabling us to eliminate the exponential dependence on neural network parameters that arises in prior analyses. It is the first such result that achieves sample complexity bounds without assuming access to the empirical risk minimizer of score function estimation loss.

academic

Improved Sample Complexity For Diffusion Model Training Without Empirical Risk Minimizer Access

Basic Information

Paper ID: 2505.18344
Title: Improved Sample Complexity For Diffusion Model Training Without Empirical Risk Minimizer Access
Authors: Mudit Gaur, Prashant Trivedi, Sasidhar Kunapuli, Amrit Singh Bedi, and Vaneet Aggarwal
Institutions: Purdue University, University of Central Florida, UC Berkeley
Classification: cs.LG, cs.AI, stat.ML
Publication Date: arXiv:2505.18344v6 cs.LG 12 Nov 2025
Paper Link: https://arxiv.org/abs/2505.18344

Abstract

Diffusion models have demonstrated state-of-the-art performance in vision, language, and scientific domains. Despite empirical success, prior theoretical analyses of sample complexity suffer from two major limitations: exponential growth with input data dimensionality and dependence on unrealistic assumptions (such as access to exact empirical risk minimizers). This paper provides principled analysis of score estimation, establishing a sample complexity bound of $\tilde{O}(\epsilon^{-4})$ . The approach systematically decomposes score estimation error into statistical error, approximation error, and optimization error, eliminating the exponential dependence on neural network parameters from prior analyses. This is the first result achieving sample complexity bounds without assuming access to empirical risk minimizers for score function estimation loss.

Research Background and Motivation

Problem Definition

Diffusion models generate samples from complex distributions by learning to reverse the noise addition process, with the core task being estimation of the score function $\nabla \log p_t(x)$ . Despite excellent practical performance, theoretical understanding remains limited, particularly:

Sample Complexity Problem: How many samples are needed to train a high-quality diffusion model?
Curse of Dimensionality: Existing theoretical results exhibit exponential dependence on data dimension $d$ (e.g., $\tilde{O}(\epsilon^{-d})$ )
Unrealistic Assumptions: All prior work assumes access to empirical risk minimizers (ERM) for score estimation loss, which is infeasible in practice

Research Significance

Understanding sample complexity is important for:

Theoretical Guarantees: Ensuring model efficiency, generalization ability, and scalability
Practical Guidance: Generating high-quality samples with minimal data in resource-constrained scenarios
Bridging Theory and Practice: Bringing diffusion model theory to the level of reinforcement learning, bilevel optimization, and other fields

Limitations of Existing Methods

As shown in Table 1, existing work suffers from the following issues:

Literature	Sample Complexity	ERM Assumption
Zhang et al. (2024)	$\tilde{O}(\epsilon^{-d})$	Yes
Wibisono et al. (2024)	$\tilde{O}(\epsilon^{-d})$	Yes
Gupta et al. (2024)	$\tilde{O}(\epsilon^{-5})$ *	Yes
This Work	$\tilde{O}(\epsilon^{-4})$	No

*Note: Gupta et al. (2024) claims $\tilde{O}(\epsilon^{-3})$ , but does not properly accumulate discretization step errors

Research Motivation

This paper aims to answer the core question:

How many samples are needed for sufficiently expressive neural networks to estimate the score function without access to empirical risk minimizers, enabling high-quality sample generation using the DDPM algorithm?

Core Contributions

First Finite-Time Sample Complexity Bound Without ERM Assumption: Establishes a sample complexity bound of $\tilde{O}(\epsilon^{-4})$ without requiring access to empirical risk minimizers, with no exponential dependence on data dimensionality or neural network parameters
Principled Error Decomposition Framework: Proposes systematic decomposition of score estimation error into three components:
- Approximation Error: Limitations of neural network function class expressiveness
- Statistical Error: Error due to finite samples
- Optimization Error: Error due to finite SGD steps
Novel Technical Analysis:
- Handling unbounded loss functions' statistical error via conditional normality
- Bounding optimization error through Polyak-Łojasiewicz (PL) condition and recursive analysis
- Convergence guarantees supporting both constant and decreasing learning rates
Bridge Between Theory and Practice: Directly connects the quality of learned score functions to total variation distance between generated and target distributions

Method Details

Task Definition

Forward Diffusion Process: Employing Ornstein-Uhlenbeck (OU) process: $dx_t = -x_t dt + \sqrt{2}dB_t, \quad x_0 \sim p_0, \quad x \in \mathbb{R}^d$

Closed-form solution: $x_t \sim e^{-t}x_0 + \sqrt{1-e^{-2t}}\epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$

As $t \to \infty$ , the process converges to stationary distribution $\mathcal{N}(0, I)$ .

Reverse Diffusion Process: Obtained through time-reversal theory: $dx_{T-t} = (x_{T-t} + 2\nabla \log p_{T-t}(x_{T-t}))dt + \sqrt{2}dB_t$

Discretization: Discretizing at time points $0 < t_0 < t_1 < \cdots < t_K = T$ , implementing DDPM algorithm using estimated score function $\hat{s}_{t_k}$ .

Objective: Quantifying total variation (TV) distance between learned generative model $\hat{p}_{t_0}$ and true data distribution $p$ : $\text{TV}(p_{t_0}, \hat{p}_{t_0}) \leq O(\epsilon)$

Core Assumptions

Assumption 1 (Bounded Second Moment Data Distribution): Data distribution $p_0$ is absolutely continuous with support on compact set $\Gamma \subset \mathbb{R}^d$ , and $\mathbb{E}[\|x_0\|^2] \leq C_1$ .

Assumption 2 (Polyak-Łojasiewicz Condition): Loss function $L_k(\theta)$ satisfies PL condition: $\frac{1}{2}\|\nabla L_k(\theta)\|^2 \geq \mu_t(L_k(\theta) - L_k(\theta^*))$

This is significantly weaker than strong convexity and commonly holds in over-parameterized neural networks.

Assumption 3 (Approximation Error): There exists neural network parameter $\theta \in \Theta$ such that: $\mathbb{E}_{x \sim p_t}[\|s_\theta(x,t) - \nabla \log p_t(x)\|^2] \leq \epsilon_{\text{approx}}$

Assumption 4 (Smoothness and Bounded Gradient Variance):

Loss function is $\kappa$ -smooth: $\|\nabla L_k(\theta) - \nabla L_k(\theta')\| \leq \kappa\|\theta - \theta'\|$
Gradient estimate variance is bounded: $\mathbb{E}\|\nabla \hat{L}_k(\theta) - \nabla L_k(\theta)\|^2 \leq \sigma^2$

Error Decomposition Framework

For time step $k$ , score estimation error decomposes as: $\mathbb{E}_{x \sim p_{t_k}}[\|\hat{s}_{t_k}(x,t_k) - \nabla \log p_{t_k}(x)\|^2] \leq 4E_k^{\text{approx}} + 4E_k^{\text{stat}} + 4E_k^{\text{opt}}$

where:

$\theta_k^a = \arg\min_{\theta \in \Theta} \mathbb{E}_{x \sim p_{t_k}}[\|s_\theta(x,t_k) - \nabla \log p_t(x,t_k)\|^2]$ (theoretically optimal)
$\theta_k^b = \arg\min_{\theta \in \Theta} \frac{1}{n}\sum_{i=1}^n \|s_\theta(x_i,t_k) - \nabla \log p_t(x_i,t_k)\|^2$ (empirically optimal)
$\hat{\theta}_k$ = parameters after SGD iterations (actually obtained)

Error Bounds

Lemma 1 (Approximation Error): Directly from Assumption 3: $E_k^{\text{approx}} \leq \epsilon_{\text{approx}}$

Lemma 2 (Statistical Error): Using conditional normality and bounded second moment, with probability at least $1-\delta$ : $E_k^{\text{stat}} \leq O\left(W^D \cdot d \cdot \sqrt{\frac{\log(2/\delta)}{n_k}}\right)$

Key Techniques:

Defining truncated score function to handle unboundedness
Using Rademacher complexity to bound generalization error
Controlling probability mass outside truncation region

Lemma 3 (Optimization Error): Using decreasing learning rate $\eta_i = \frac{\alpha}{i+\gamma}$ (where $\alpha \mu > 1$ , $\gamma > \alpha \kappa$ ), with probability at least $1-\delta$ : $E_k^{\text{opt}} \leq O\left(W^D \cdot d \cdot \sqrt{\frac{\log(2/\delta)}{n_k}}\right)$

Key Techniques:

Exploiting quadratic growth property of PL condition
Recursive analysis of each SGD step's error
Handling gradient clipping under heavy-tailed noise

Main Theoretical Results

Theorem 1 (Total Variation Distance Bound): Under Assumptions 1-4, if sample count satisfies: $n_k = \Omega\left(W^{2D} \cdot d^2 \cdot \log\left(\frac{4K}{\delta}\right) \cdot \epsilon^{-4} \sigma_k^{-4}\right)$

then with probability at least $1-\delta$ : $\text{TV}(p_{t_0}, \hat{p}_{t_0}) \leq O(e^{-T}) + O\left(\frac{1}{\sqrt{K}}\right) + O(\epsilon) + \epsilon_{\text{approx}}$

Setting $T = \Omega(\log(1/\epsilon))$ and $K = \Omega(\epsilon^{-2})$ , we obtain: $\text{TV}(p_{t_0}, \hat{p}_{t_0}) \leq O(\epsilon) + \epsilon_{\text{approx}}$

Total Sample Complexity: $n_{\text{total}} = \sum_{k=0}^K n_k = \tilde{O}(\epsilon^{-4})$

Proof Strategy

TV Distance Decomposition: $\text{TV}(p_{t_0}, \hat{p}_{t_0}) \leq \text{TV}(p_{t_0}, p_{t_0}^{\text{dis}}) + \text{TV}(p_{t_0}^{\text{dis}}, \tilde{p}_{t_0}) + \text{TV}(\tilde{p}_{t_0}, \hat{p}_{t_0})$
Score Error Accumulation: Using Girsanov's theorem: $\text{TV}(p_{t_0}^{\text{dis}}, \tilde{p}_{t_0}) \leq \frac{1}{2}\sqrt{\sum_{k=0}^K \mathbb{E}[\|\hat{s}_{t_k} - \nabla \log p_{t_k}\|^2](t_{k+1}-t_k)}$
Error Summation: Through three-term error bounds, setting appropriate sample counts such that: $\sum_{k=0}^K A(k)(t_{k+1}-t_k) \leq \epsilon^2 T$
Parameter Selection: Balancing discretization error, initialization error, and score estimation error

Experimental Setup

Note: This is a purely theoretical paper without experimental components. The main contributions lie in theoretical analysis and establishment of sample complexity bounds.

Diffusion Model Foundations

DDPM (Ho et al., 2020): Foundational work on denoising diffusion probabilistic models
Score-based SDE (Song et al., 2021): Score-based stochastic differential equation framework
Latent Diffusion (Rombach et al., 2022): Efficient generation in latent space

Iteration Complexity Research

Works assuming bounded score estimation:

Li et al. (2024b), Benton et al. (2024): Iteration complexity guarantees for DDPM
Li & Yan (2024), Huang et al. (2024): Improved convergence rates under low-dimensional data assumptions
Liang et al. (2025a,b): Accelerated denoising schedules

Sample Complexity Research

Exponential Dimension Dependence: Chen et al. (2023), Zhang et al. (2024), Wibisono et al. (2024), Oko et al. (2023) with bounds of $\tilde{O}(\epsilon^{-O(d)})$
Improved Bounds but Requiring ERM: Gupta et al. (2024) actually achieves $\tilde{O}(\epsilon^{-5})$ (requiring joint bounds)

Advantages of This Work

No ERM Assumption: First to establish sample complexity bounds under realistic optimization settings
Better Bounds: Improvement from $\tilde{O}(\epsilon^{-5})$ to $\tilde{O}(\epsilon^{-4})$
Weaker Assumptions: Only requires bounded second moment, no bounded support or sub-Gaussian requirement
Complete Analysis: Explicitly addresses statistical, approximation, and optimization error categories

Conclusions and Discussion

Main Conclusions

Sample Complexity: Achieving $\epsilon$ -accuracy without ERM access requires $\tilde{O}(\epsilon^{-4})$ samples
Error Sources: Systematically identifies and bounds contributions of three error categories
Theoretical Progress: First sample complexity bound for diffusion models under realistic optimization assumptions

Limitations

Approximation Error Constant: Treats $\epsilon_{\text{approx}}$ as constant without analyzing its relationship to network size (practically may require exponentially large networks for small approximation error)
PL Condition: While weaker than strong convexity, may not hold in general non-convex settings (though common in over-parameterized networks)
Early Stopping Time: Bounds $\text{TV}(p_{t_0}, \hat{p}_{t_0})$ rather than $\text{TV}(p_0, \hat{p}_{t_0})$ ; the latter requires additional sub-Gaussian assumptions (Theorem 2)
Unconditional Generation: Analysis only covers unconditional distributions; extension to conditional settings is a future direction
Experimental Validation: As a purely theoretical work, lacks experimental verification of theoretical predictions

Future Directions

Conditional Generation: Extend guarantees to conditional diffusion models (e.g., classifier-free guidance)
Weaker Assumptions: Explore bounds under more general data distributions and optimization conditions
Tightness Analysis: Investigate whether $\epsilon^{-4}$ bound is tight and possible lower bounds
Practical Algorithms: Design practical training algorithms leveraging theoretical insights
Other Architectures: Analyze sample complexity for modern architectures like Transformers

In-Depth Evaluation

Strengths

Important Theoretical Breakthrough:
- First to eliminate ERM assumption, a critical practical limitation
- Improves best known bounds (from $\epsilon^{-5}$ to $\epsilon^{-4}$ )
- No exponential dimension dependence, applicable to high-dimensional settings
Technical Innovation:
- Statistical Error Analysis: Clever use of conditional normality and truncation techniques for unbounded losses
- Optimization Error Analysis: First explicit analysis of finite SGD iterations using PL condition and recursive techniques
- Error Decomposition Framework: Clear three-term decomposition making individual factor contributions transparent
Theoretical Rigor:
- Complete detailed proofs (appendix exceeds 30 pages)
- Clear and relatively mild assumptions (compared to prior work)
- Transparent constant dependencies (though potentially large)
Writing Quality:
- Clear structure with sufficient motivation
- Clear explanation of technical contributions
- Comprehensive comparison with related work (especially Appendix A analysis of Gupta et al.)

Weaknesses

Theory-Practice Gap:
- Sample complexity bound, though polynomial, may hide large constants ( $W^{2D} \cdot d^2$ )
- Practical neural networks far smaller than theoretical requirements
- Lacks experimental verification of theoretical predictions' validity
Assumption Practicality:
- PL Condition: While common in over-parameterized settings, difficult to verify
- Approximation Error Constant: Treating as constant sidesteps network capacity-approximation quality tradeoff
- Smoothness and Bounded Variance: May not strictly hold in actual training
Technical Limitations:
- Analysis depends on OU process; other noise schedules (VP/VE SDE) not covered
- Early stopping time $t_0$ selection impact insufficiently discussed
- Bounds on $p_{t_0}$ rather than $p_0$ require additional assumptions (Theorem 2)
Comparison Fairness:
- Comparison with Gupta et al. (2024) depends on reinterpretation of their results (Appendix A)
- No comparison with other ERM-free methods (e.g., Block et al. 2020)
Missing Content:
- No lower bound analysis; optimality of $\epsilon^{-4}$ unknown
- No algorithm implementation details or pseudocode (only high-level description)
- No numerical experiments verifying theoretical predictions

Impact

Theoretical Contribution:
- Provides important benchmark for diffusion model theory
- Error decomposition framework may inspire analysis of other generative models
- Bridges theory-practice gap
Practical Value:
- Guides practitioners' understanding of sample requirements
- Provides theoretical basis for algorithm design (e.g., learning rate scheduling)
- Identifies key bottlenecks (optimization error)
Reproducibility:
- As theoretical work, proofs are detailed and verifiable
- Assumptions clear, applicable when conditions met
- Lacks code or experiments; practical application requires further work

Applicable Scenarios

Theoretical Research: Provides theoretical foundation for diffusion models, score matching, generative models
Algorithm Design: Guides training strategies (sample size, learning rate, early stopping)
Resource Planning: Estimates computational and data resources needed for target quality
Assumption Verification: Applicable in specific settings satisfying PL conditions

Not Suitable For:

Applications requiring tight constants
General non-convex optimization not satisfying PL condition
Conditional generation tasks (currently uncovered)

Deep Technical Insights

Innovative Statistical Error Handling

Traditional statistical learning theory (e.g., Shalev-Shwartz & Ben-David, 2014) requires bounded loss functions to apply Rademacher complexity. However, score function $\nabla \log p_t(x) = \frac{x - e^{-t}x_0}{\sigma_t^2}$ is unbounded when $x$ is unbounded.

Solution:

Define truncated score function:

(\nabla \log p_t(x))_j & \text{if } \left|\frac{x-e^{-t}x_0}{\sigma_t^2}\right|_j \leq \kappa \\ 0 & \text{otherwise} \end{cases}$$ 2. Control probability mass outside truncation region: Setting $\kappa = \log(dn/\delta)$: $$P\left(\left|\frac{x-e^{-t}x_0}{\sigma_t^2}\right|_j \geq \kappa\right) \leq \frac{\delta}{dn}$$ 3. Bound truncation error: Using conditional normality and Mill's ratio: $$\mathbb{E}[X^2 | |X-\mu| > a] = \mu^2 + \sigma^2 + \sigma a \cdot \frac{\phi(a/\sigma)}{1-\Phi(a/\sigma)}$$ ### Recursive Analysis of Optimization Error Under PL condition, SGD progress can be recursively bounded. For decreasing learning rate $\eta_i = \frac{\alpha}{i+\gamma}$: **Recursive Relation**: $$\mathbb{E}[\Delta_{i+1}] \leq \left(1 - \frac{\alpha\mu}{i+\gamma}\right)\mathbb{E}[\Delta_i] + \frac{\alpha^2 L \sigma^2}{2(i+\gamma)^2}$$ where $\Delta_i = L(\theta_i) - L^*$. **Solution Form**: Through integrating factor technique: $$\mathbb{E}[\Delta_i] \leq \frac{\gamma^{\alpha\mu} \Delta_0}{(i+\gamma)^{\alpha\mu}} + \frac{\alpha^2 L \sigma^2}{2(\alpha\mu - 1)} \cdot \frac{1}{i+\gamma}$$ When $\alpha\mu > 1$, dominant term is $O(1/i)$. ### Heavy-Tailed Noise Under Gradient Clipping The paper also handles gradients with finite $q$-th moment ($q \in (1,2]$) rather than bounded variance: **Clipping Strategy**: $\tilde{g}_t = \text{clip}(g_t, \tau_t)$, where $\tau_t = \Theta(\sigma_q (t+\gamma)^{1/(2q)})$ **Bias Bound**: $$\|\mathbb{E}[\tilde{g}_t | \mathcal{F}_t] - \nabla f(x_t)\| \leq C_q \frac{\sigma_q^q}{\tau_t^{q-1}}$$ **Convergence Rate**: Still maintains $O(1/t)$ since both bias and variance terms decay to $o(1/t)$. ## Detailed Comparison with Related Work ### vs. Gupta et al. (2024) | Aspect | Gupta et al. | This Work | |--------|-------------|-----------| | Sample Complexity | $\tilde{O}(\epsilon^{-5})$* | $\tilde{O}(\epsilon^{-4})$ | | ERM Assumption | Required | **Not Required** | | Error Analysis | Two-term (approx + stat) | Three-term (+ opt) | | Data Assumption | Bounded second moment | Bounded second moment | | Technical Tools | Quantile bounds | Global L2 bounds | *Original claims $\epsilon^{-3}$, but Appendix A of this work shows joint bounds needed ### vs. Block et al. (2020) Block et al. studied Langevin sampling convergence, also assuming ERM access (implicit in their definition). This work explicitly handles optimization error through PL condition. ### vs. Iteration Complexity Literature Li et al. (2024b), Benton et al. (2024) and others study iteration complexity assuming bounded score estimation error. This work's contribution is establishing sample complexity needed for that error bound. ## Open Problems 1. **Tightness**: Is $\epsilon^{-4}$ optimal? What are possible lower bounds? 2. **Constant Optimization**: Can we improve $W^{2D} \cdot d^2$ dependence? 3. **PL Condition Verification**: When does it hold for specific network architectures? 4. **Conditional Generation**: How to extend to classifier-free guidance settings? 5. **Empirical Verification**: How large is the gap between theoretical predictions and actual training? ## Selected References 1. **Ho et al. (2020)**: Denoising Diffusion Probabilistic Models - Foundational DDPM work 2. **Song et al. (2021)**: Score-Based Generative Modeling through SDEs - Continuous-time framework 3. **Gupta et al. (2024)**: Improved Sample Complexity Bounds for Diffusion Model Training - Closest prior work 4. **Liu et al. (2022)**: Loss Landscapes and Optimization in Over-parameterized Networks - Theoretical foundation for PL condition 5. **Shalev-Shwartz & Ben-David (2014)**: Understanding Machine Learning - Statistical learning theory foundation --- ## Summary This is an important theoretical paper making significant progress on sample complexity analysis for diffusion models. The core contribution is eliminating unrealistic ERM assumptions while improving known optimal bounds. Technically, through clever handling of unbounded losses and explicit optimization error analysis, it establishes a complete theoretical framework. **Recommended For**: Machine learning theory researchers, researchers interested in diffusion model theoretical foundations, optimization theory researchers. **Main Value**: Provides solid theoretical foundation for diffusion models, identifies theory-practice gaps, and points directions for future research. While theoretical bounds may not be tight, this represents important progress toward understanding diffusion model sample efficiency.