Improved Sample Complexity For Diffusion Model Training Without Empirical Risk Minimizer Access
Gaur, Trivedi, Kunapuli et al.
Diffusion models have demonstrated state-of-the-art performance across vision, language, and scientific domains. Despite their empirical success, prior theoretical analyses of the sample complexity suffer from poor scaling with input data dimension or rely on unrealistic assumptions such as access to exact empirical risk minimizers. In this work, we provide a principled analysis of score estimation, establishing a sample complexity bound of $\mathcal{O}(ε^{-4})$. Our approach leverages a structured decomposition of the score estimation error into statistical, approximation, and optimization errors, enabling us to eliminate the exponential dependence on neural network parameters that arises in prior analyses. It is the first such result that achieves sample complexity bounds without assuming access to the empirical risk minimizer of score function estimation loss.
academic
Improved Sample Complexity For Diffusion Model Training Without Empirical Risk Minimizer Access
Diffusion models have demonstrated state-of-the-art performance in vision, language, and scientific domains. Despite empirical success, prior theoretical analyses of sample complexity suffer from two major limitations: exponential growth with input data dimensionality and dependence on unrealistic assumptions (such as access to exact empirical risk minimizers). This paper provides principled analysis of score estimation, establishing a sample complexity bound of O~(ϵ−4). The approach systematically decomposes score estimation error into statistical error, approximation error, and optimization error, eliminating the exponential dependence on neural network parameters from prior analyses. This is the first result achieving sample complexity bounds without assuming access to empirical risk minimizers for score function estimation loss.
Diffusion models generate samples from complex distributions by learning to reverse the noise addition process, with the core task being estimation of the score function ∇logpt(x). Despite excellent practical performance, theoretical understanding remains limited, particularly:
Sample Complexity Problem: How many samples are needed to train a high-quality diffusion model?
Curse of Dimensionality: Existing theoretical results exhibit exponential dependence on data dimension d (e.g., O~(ϵ−d))
Unrealistic Assumptions: All prior work assumes access to empirical risk minimizers (ERM) for score estimation loss, which is infeasible in practice
How many samples are needed for sufficiently expressive neural networks to estimate the score function without access to empirical risk minimizers, enabling high-quality sample generation using the DDPM algorithm?
First Finite-Time Sample Complexity Bound Without ERM Assumption: Establishes a sample complexity bound of O~(ϵ−4) without requiring access to empirical risk minimizers, with no exponential dependence on data dimensionality or neural network parameters
Principled Error Decomposition Framework: Proposes systematic decomposition of score estimation error into three components:
Approximation Error: Limitations of neural network function class expressiveness
Statistical Error: Error due to finite samples
Optimization Error: Error due to finite SGD steps
Novel Technical Analysis:
Handling unbounded loss functions' statistical error via conditional normality
Bounding optimization error through Polyak-Łojasiewicz (PL) condition and recursive analysis
Convergence guarantees supporting both constant and decreasing learning rates
Bridge Between Theory and Practice: Directly connects the quality of learned score functions to total variation distance between generated and target distributions
Assumption 1 (Bounded Second Moment Data Distribution): Data distribution p0 is absolutely continuous with support on compact set Γ⊂Rd, and E[∥x0∥2]≤C1.
Assumption 2 (Polyak-Łojasiewicz Condition): Loss function Lk(θ) satisfies PL condition:
21∥∇Lk(θ)∥2≥μt(Lk(θ)−Lk(θ∗))
This is significantly weaker than strong convexity and commonly holds in over-parameterized neural networks.
Assumption 3 (Approximation Error): There exists neural network parameter θ∈Θ such that:
Ex∼pt[∥sθ(x,t)−∇logpt(x)∥2]≤ϵapprox
Assumption 4 (Smoothness and Bounded Gradient Variance):
Loss function is κ-smooth: ∥∇Lk(θ)−∇Lk(θ′)∥≤κ∥θ−θ′∥
Gradient estimate variance is bounded: E∥∇L^k(θ)−∇Lk(θ)∥2≤σ2
Note: This is a purely theoretical paper without experimental components. The main contributions lie in theoretical analysis and establishment of sample complexity bounds.
Approximation Error Constant: Treats ϵapprox as constant without analyzing its relationship to network size (practically may require exponentially large networks for small approximation error)
PL Condition: While weaker than strong convexity, may not hold in general non-convex settings (though common in over-parameterized networks)
Early Stopping Time: Bounds TV(pt0,p^t0) rather than TV(p0,p^t0); the latter requires additional sub-Gaussian assumptions (Theorem 2)
Unconditional Generation: Analysis only covers unconditional distributions; extension to conditional settings is a future direction
Experimental Validation: As a purely theoretical work, lacks experimental verification of theoretical predictions
Traditional statistical learning theory (e.g., Shalev-Shwartz & Ben-David, 2014) requires bounded loss functions to apply Rademacher complexity. However, score function ∇logpt(x)=σt2x−e−tx0 is unbounded when x is unbounded.
Solution:
Define truncated score function:
(\nabla \log p_t(x))_j & \text{if } \left|\frac{x-e^{-t}x_0}{\sigma_t^2}\right|_j \leq \kappa \\
0 & \text{otherwise}
\end{cases}$$
2. Control probability mass outside truncation region: Setting $\kappa = \log(dn/\delta)$:
$$P\left(\left|\frac{x-e^{-t}x_0}{\sigma_t^2}\right|_j \geq \kappa\right) \leq \frac{\delta}{dn}$$
3. Bound truncation error: Using conditional normality and Mill's ratio:
$$\mathbb{E}[X^2 | |X-\mu| > a] = \mu^2 + \sigma^2 + \sigma a \cdot \frac{\phi(a/\sigma)}{1-\Phi(a/\sigma)}$$
### Recursive Analysis of Optimization Error
Under PL condition, SGD progress can be recursively bounded. For decreasing learning rate $\eta_i = \frac{\alpha}{i+\gamma}$:
**Recursive Relation**:
$$\mathbb{E}[\Delta_{i+1}] \leq \left(1 - \frac{\alpha\mu}{i+\gamma}\right)\mathbb{E}[\Delta_i] + \frac{\alpha^2 L \sigma^2}{2(i+\gamma)^2}$$
where $\Delta_i = L(\theta_i) - L^*$.
**Solution Form**: Through integrating factor technique:
$$\mathbb{E}[\Delta_i] \leq \frac{\gamma^{\alpha\mu} \Delta_0}{(i+\gamma)^{\alpha\mu}} + \frac{\alpha^2 L \sigma^2}{2(\alpha\mu - 1)} \cdot \frac{1}{i+\gamma}$$
When $\alpha\mu > 1$, dominant term is $O(1/i)$.
### Heavy-Tailed Noise Under Gradient Clipping
The paper also handles gradients with finite $q$-th moment ($q \in (1,2]$) rather than bounded variance:
**Clipping Strategy**: $\tilde{g}_t = \text{clip}(g_t, \tau_t)$, where $\tau_t = \Theta(\sigma_q (t+\gamma)^{1/(2q)})$
**Bias Bound**:
$$\|\mathbb{E}[\tilde{g}_t | \mathcal{F}_t] - \nabla f(x_t)\| \leq C_q \frac{\sigma_q^q}{\tau_t^{q-1}}$$
**Convergence Rate**: Still maintains $O(1/t)$ since both bias and variance terms decay to $o(1/t)$.
## Detailed Comparison with Related Work
### vs. Gupta et al. (2024)
| Aspect | Gupta et al. | This Work |
|--------|-------------|-----------|
| Sample Complexity | $\tilde{O}(\epsilon^{-5})$* | $\tilde{O}(\epsilon^{-4})$ |
| ERM Assumption | Required | **Not Required** |
| Error Analysis | Two-term (approx + stat) | Three-term (+ opt) |
| Data Assumption | Bounded second moment | Bounded second moment |
| Technical Tools | Quantile bounds | Global L2 bounds |
*Original claims $\epsilon^{-3}$, but Appendix A of this work shows joint bounds needed
### vs. Block et al. (2020)
Block et al. studied Langevin sampling convergence, also assuming ERM access (implicit in their definition). This work explicitly handles optimization error through PL condition.
### vs. Iteration Complexity Literature
Li et al. (2024b), Benton et al. (2024) and others study iteration complexity assuming bounded score estimation error. This work's contribution is establishing sample complexity needed for that error bound.
## Open Problems
1. **Tightness**: Is $\epsilon^{-4}$ optimal? What are possible lower bounds?
2. **Constant Optimization**: Can we improve $W^{2D} \cdot d^2$ dependence?
3. **PL Condition Verification**: When does it hold for specific network architectures?
4. **Conditional Generation**: How to extend to classifier-free guidance settings?
5. **Empirical Verification**: How large is the gap between theoretical predictions and actual training?
## Selected References
1. **Ho et al. (2020)**: Denoising Diffusion Probabilistic Models - Foundational DDPM work
2. **Song et al. (2021)**: Score-Based Generative Modeling through SDEs - Continuous-time framework
3. **Gupta et al. (2024)**: Improved Sample Complexity Bounds for Diffusion Model Training - Closest prior work
4. **Liu et al. (2022)**: Loss Landscapes and Optimization in Over-parameterized Networks - Theoretical foundation for PL condition
5. **Shalev-Shwartz & Ben-David (2014)**: Understanding Machine Learning - Statistical learning theory foundation
---
## Summary
This is an important theoretical paper making significant progress on sample complexity analysis for diffusion models. The core contribution is eliminating unrealistic ERM assumptions while improving known optimal bounds. Technically, through clever handling of unbounded losses and explicit optimization error analysis, it establishes a complete theoretical framework.
**Recommended For**: Machine learning theory researchers, researchers interested in diffusion model theoretical foundations, optimization theory researchers.
**Main Value**: Provides solid theoretical foundation for diffusion models, identifies theory-practice gaps, and points directions for future research. While theoretical bounds may not be tight, this represents important progress toward understanding diffusion model sample efficiency.