2025-11-15T09:52:11.139771

Improved Sample Complexity For Diffusion Model Training Without Empirical Risk Minimizer Access

Gaur, Trivedi, Kunapuli et al.
Diffusion models have demonstrated state-of-the-art performance across vision, language, and scientific domains. Despite their empirical success, prior theoretical analyses of the sample complexity suffer from poor scaling with input data dimension or rely on unrealistic assumptions such as access to exact empirical risk minimizers. In this work, we provide a principled analysis of score estimation, establishing a sample complexity bound of $\mathcal{O}(ε^{-4})$. Our approach leverages a structured decomposition of the score estimation error into statistical, approximation, and optimization errors, enabling us to eliminate the exponential dependence on neural network parameters that arises in prior analyses. It is the first such result that achieves sample complexity bounds without assuming access to the empirical risk minimizer of score function estimation loss.
academic

Improved Sample Complexity For Diffusion Model Training Without Empirical Risk Minimizer Access

Basic Information

  • Paper ID: 2505.18344
  • Title: Improved Sample Complexity For Diffusion Model Training Without Empirical Risk Minimizer Access
  • Authors: Mudit Gaur, Prashant Trivedi, Sasidhar Kunapuli, Amrit Singh Bedi, and Vaneet Aggarwal
  • Institutions: Purdue University, University of Central Florida, UC Berkeley
  • Classification: cs.LG, cs.AI, stat.ML
  • Publication Date: arXiv:2505.18344v6 cs.LG 12 Nov 2025
  • Paper Link: https://arxiv.org/abs/2505.18344

Abstract

Diffusion models have demonstrated state-of-the-art performance in vision, language, and scientific domains. Despite empirical success, prior theoretical analyses of sample complexity suffer from two major limitations: exponential growth with input data dimensionality and dependence on unrealistic assumptions (such as access to exact empirical risk minimizers). This paper provides principled analysis of score estimation, establishing a sample complexity bound of O~(ϵ4)\tilde{O}(\epsilon^{-4}). The approach systematically decomposes score estimation error into statistical error, approximation error, and optimization error, eliminating the exponential dependence on neural network parameters from prior analyses. This is the first result achieving sample complexity bounds without assuming access to empirical risk minimizers for score function estimation loss.

Research Background and Motivation

Problem Definition

Diffusion models generate samples from complex distributions by learning to reverse the noise addition process, with the core task being estimation of the score function logpt(x)\nabla \log p_t(x). Despite excellent practical performance, theoretical understanding remains limited, particularly:

  1. Sample Complexity Problem: How many samples are needed to train a high-quality diffusion model?
  2. Curse of Dimensionality: Existing theoretical results exhibit exponential dependence on data dimension dd (e.g., O~(ϵd)\tilde{O}(\epsilon^{-d}))
  3. Unrealistic Assumptions: All prior work assumes access to empirical risk minimizers (ERM) for score estimation loss, which is infeasible in practice

Research Significance

Understanding sample complexity is important for:

  • Theoretical Guarantees: Ensuring model efficiency, generalization ability, and scalability
  • Practical Guidance: Generating high-quality samples with minimal data in resource-constrained scenarios
  • Bridging Theory and Practice: Bringing diffusion model theory to the level of reinforcement learning, bilevel optimization, and other fields

Limitations of Existing Methods

As shown in Table 1, existing work suffers from the following issues:

LiteratureSample ComplexityERM Assumption
Zhang et al. (2024)O~(ϵd)\tilde{O}(\epsilon^{-d})Yes
Wibisono et al. (2024)O~(ϵd)\tilde{O}(\epsilon^{-d})Yes
Gupta et al. (2024)O~(ϵ5)\tilde{O}(\epsilon^{-5})*Yes
This WorkO~(ϵ4)\tilde{O}(\epsilon^{-4})No

*Note: Gupta et al. (2024) claims O~(ϵ3)\tilde{O}(\epsilon^{-3}), but does not properly accumulate discretization step errors

Research Motivation

This paper aims to answer the core question:

How many samples are needed for sufficiently expressive neural networks to estimate the score function without access to empirical risk minimizers, enabling high-quality sample generation using the DDPM algorithm?

Core Contributions

  1. First Finite-Time Sample Complexity Bound Without ERM Assumption: Establishes a sample complexity bound of O~(ϵ4)\tilde{O}(\epsilon^{-4}) without requiring access to empirical risk minimizers, with no exponential dependence on data dimensionality or neural network parameters
  2. Principled Error Decomposition Framework: Proposes systematic decomposition of score estimation error into three components:
    • Approximation Error: Limitations of neural network function class expressiveness
    • Statistical Error: Error due to finite samples
    • Optimization Error: Error due to finite SGD steps
  3. Novel Technical Analysis:
    • Handling unbounded loss functions' statistical error via conditional normality
    • Bounding optimization error through Polyak-Łojasiewicz (PL) condition and recursive analysis
    • Convergence guarantees supporting both constant and decreasing learning rates
  4. Bridge Between Theory and Practice: Directly connects the quality of learned score functions to total variation distance between generated and target distributions

Method Details

Task Definition

Forward Diffusion Process: Employing Ornstein-Uhlenbeck (OU) process: dxt=xtdt+2dBt,x0p0,xRddx_t = -x_t dt + \sqrt{2}dB_t, \quad x_0 \sim p_0, \quad x \in \mathbb{R}^d

Closed-form solution: xtetx0+1e2tϵ,ϵN(0,I)x_t \sim e^{-t}x_0 + \sqrt{1-e^{-2t}}\epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

As tt \to \infty, the process converges to stationary distribution N(0,I)\mathcal{N}(0, I).

Reverse Diffusion Process: Obtained through time-reversal theory: dxTt=(xTt+2logpTt(xTt))dt+2dBtdx_{T-t} = (x_{T-t} + 2\nabla \log p_{T-t}(x_{T-t}))dt + \sqrt{2}dB_t

Discretization: Discretizing at time points 0<t0<t1<<tK=T0 < t_0 < t_1 < \cdots < t_K = T, implementing DDPM algorithm using estimated score function s^tk\hat{s}_{t_k}.

Objective: Quantifying total variation (TV) distance between learned generative model p^t0\hat{p}_{t_0} and true data distribution pp: TV(pt0,p^t0)O(ϵ)\text{TV}(p_{t_0}, \hat{p}_{t_0}) \leq O(\epsilon)

Core Assumptions

Assumption 1 (Bounded Second Moment Data Distribution): Data distribution p0p_0 is absolutely continuous with support on compact set ΓRd\Gamma \subset \mathbb{R}^d, and E[x02]C1\mathbb{E}[\|x_0\|^2] \leq C_1.

Assumption 2 (Polyak-Łojasiewicz Condition): Loss function Lk(θ)L_k(\theta) satisfies PL condition: 12Lk(θ)2μt(Lk(θ)Lk(θ))\frac{1}{2}\|\nabla L_k(\theta)\|^2 \geq \mu_t(L_k(\theta) - L_k(\theta^*))

This is significantly weaker than strong convexity and commonly holds in over-parameterized neural networks.

Assumption 3 (Approximation Error): There exists neural network parameter θΘ\theta \in \Theta such that: Expt[sθ(x,t)logpt(x)2]ϵapprox\mathbb{E}_{x \sim p_t}[\|s_\theta(x,t) - \nabla \log p_t(x)\|^2] \leq \epsilon_{\text{approx}}

Assumption 4 (Smoothness and Bounded Gradient Variance):

  • Loss function is κ\kappa-smooth: Lk(θ)Lk(θ)κθθ\|\nabla L_k(\theta) - \nabla L_k(\theta')\| \leq \kappa\|\theta - \theta'\|
  • Gradient estimate variance is bounded: EL^k(θ)Lk(θ)2σ2\mathbb{E}\|\nabla \hat{L}_k(\theta) - \nabla L_k(\theta)\|^2 \leq \sigma^2

Error Decomposition Framework

For time step kk, score estimation error decomposes as: Exptk[s^tk(x,tk)logptk(x)2]4Ekapprox+4Ekstat+4Ekopt\mathbb{E}_{x \sim p_{t_k}}[\|\hat{s}_{t_k}(x,t_k) - \nabla \log p_{t_k}(x)\|^2] \leq 4E_k^{\text{approx}} + 4E_k^{\text{stat}} + 4E_k^{\text{opt}}

where:

  • θka=argminθΘExptk[sθ(x,tk)logpt(x,tk)2]\theta_k^a = \arg\min_{\theta \in \Theta} \mathbb{E}_{x \sim p_{t_k}}[\|s_\theta(x,t_k) - \nabla \log p_t(x,t_k)\|^2] (theoretically optimal)
  • θkb=argminθΘ1ni=1nsθ(xi,tk)logpt(xi,tk)2\theta_k^b = \arg\min_{\theta \in \Theta} \frac{1}{n}\sum_{i=1}^n \|s_\theta(x_i,t_k) - \nabla \log p_t(x_i,t_k)\|^2 (empirically optimal)
  • θ^k\hat{\theta}_k = parameters after SGD iterations (actually obtained)

Error Bounds

Lemma 1 (Approximation Error): Directly from Assumption 3: EkapproxϵapproxE_k^{\text{approx}} \leq \epsilon_{\text{approx}}

Lemma 2 (Statistical Error): Using conditional normality and bounded second moment, with probability at least 1δ1-\delta: EkstatO(WDdlog(2/δ)nk)E_k^{\text{stat}} \leq O\left(W^D \cdot d \cdot \sqrt{\frac{\log(2/\delta)}{n_k}}\right)

Key Techniques:

  1. Defining truncated score function to handle unboundedness
  2. Using Rademacher complexity to bound generalization error
  3. Controlling probability mass outside truncation region

Lemma 3 (Optimization Error): Using decreasing learning rate ηi=αi+γ\eta_i = \frac{\alpha}{i+\gamma} (where αμ>1\alpha \mu > 1, γ>ακ\gamma > \alpha \kappa), with probability at least 1δ1-\delta: EkoptO(WDdlog(2/δ)nk)E_k^{\text{opt}} \leq O\left(W^D \cdot d \cdot \sqrt{\frac{\log(2/\delta)}{n_k}}\right)

Key Techniques:

  1. Exploiting quadratic growth property of PL condition
  2. Recursive analysis of each SGD step's error
  3. Handling gradient clipping under heavy-tailed noise

Main Theoretical Results

Theorem 1 (Total Variation Distance Bound): Under Assumptions 1-4, if sample count satisfies: nk=Ω(W2Dd2log(4Kδ)ϵ4σk4)n_k = \Omega\left(W^{2D} \cdot d^2 \cdot \log\left(\frac{4K}{\delta}\right) \cdot \epsilon^{-4} \sigma_k^{-4}\right)

then with probability at least 1δ1-\delta: TV(pt0,p^t0)O(eT)+O(1K)+O(ϵ)+ϵapprox\text{TV}(p_{t_0}, \hat{p}_{t_0}) \leq O(e^{-T}) + O\left(\frac{1}{\sqrt{K}}\right) + O(\epsilon) + \epsilon_{\text{approx}}

Setting T=Ω(log(1/ϵ))T = \Omega(\log(1/\epsilon)) and K=Ω(ϵ2)K = \Omega(\epsilon^{-2}), we obtain: TV(pt0,p^t0)O(ϵ)+ϵapprox\text{TV}(p_{t_0}, \hat{p}_{t_0}) \leq O(\epsilon) + \epsilon_{\text{approx}}

Total Sample Complexity: ntotal=k=0Knk=O~(ϵ4)n_{\text{total}} = \sum_{k=0}^K n_k = \tilde{O}(\epsilon^{-4})

Proof Strategy

  1. TV Distance Decomposition: TV(pt0,p^t0)TV(pt0,pt0dis)+TV(pt0dis,p~t0)+TV(p~t0,p^t0)\text{TV}(p_{t_0}, \hat{p}_{t_0}) \leq \text{TV}(p_{t_0}, p_{t_0}^{\text{dis}}) + \text{TV}(p_{t_0}^{\text{dis}}, \tilde{p}_{t_0}) + \text{TV}(\tilde{p}_{t_0}, \hat{p}_{t_0})
  2. Score Error Accumulation: Using Girsanov's theorem: TV(pt0dis,p~t0)12k=0KE[s^tklogptk2](tk+1tk)\text{TV}(p_{t_0}^{\text{dis}}, \tilde{p}_{t_0}) \leq \frac{1}{2}\sqrt{\sum_{k=0}^K \mathbb{E}[\|\hat{s}_{t_k} - \nabla \log p_{t_k}\|^2](t_{k+1}-t_k)}
  3. Error Summation: Through three-term error bounds, setting appropriate sample counts such that: k=0KA(k)(tk+1tk)ϵ2T\sum_{k=0}^K A(k)(t_{k+1}-t_k) \leq \epsilon^2 T
  4. Parameter Selection: Balancing discretization error, initialization error, and score estimation error

Experimental Setup

Note: This is a purely theoretical paper without experimental components. The main contributions lie in theoretical analysis and establishment of sample complexity bounds.

Diffusion Model Foundations

  • DDPM (Ho et al., 2020): Foundational work on denoising diffusion probabilistic models
  • Score-based SDE (Song et al., 2021): Score-based stochastic differential equation framework
  • Latent Diffusion (Rombach et al., 2022): Efficient generation in latent space

Iteration Complexity Research

Works assuming bounded score estimation:

  • Li et al. (2024b), Benton et al. (2024): Iteration complexity guarantees for DDPM
  • Li & Yan (2024), Huang et al. (2024): Improved convergence rates under low-dimensional data assumptions
  • Liang et al. (2025a,b): Accelerated denoising schedules

Sample Complexity Research

  • Exponential Dimension Dependence: Chen et al. (2023), Zhang et al. (2024), Wibisono et al. (2024), Oko et al. (2023) with bounds of O~(ϵO(d))\tilde{O}(\epsilon^{-O(d)})
  • Improved Bounds but Requiring ERM: Gupta et al. (2024) actually achieves O~(ϵ5)\tilde{O}(\epsilon^{-5}) (requiring joint bounds)

Advantages of This Work

  1. No ERM Assumption: First to establish sample complexity bounds under realistic optimization settings
  2. Better Bounds: Improvement from O~(ϵ5)\tilde{O}(\epsilon^{-5}) to O~(ϵ4)\tilde{O}(\epsilon^{-4})
  3. Weaker Assumptions: Only requires bounded second moment, no bounded support or sub-Gaussian requirement
  4. Complete Analysis: Explicitly addresses statistical, approximation, and optimization error categories

Conclusions and Discussion

Main Conclusions

  1. Sample Complexity: Achieving ϵ\epsilon-accuracy without ERM access requires O~(ϵ4)\tilde{O}(\epsilon^{-4}) samples
  2. Error Sources: Systematically identifies and bounds contributions of three error categories
  3. Theoretical Progress: First sample complexity bound for diffusion models under realistic optimization assumptions

Limitations

  1. Approximation Error Constant: Treats ϵapprox\epsilon_{\text{approx}} as constant without analyzing its relationship to network size (practically may require exponentially large networks for small approximation error)
  2. PL Condition: While weaker than strong convexity, may not hold in general non-convex settings (though common in over-parameterized networks)
  3. Early Stopping Time: Bounds TV(pt0,p^t0)\text{TV}(p_{t_0}, \hat{p}_{t_0}) rather than TV(p0,p^t0)\text{TV}(p_0, \hat{p}_{t_0}); the latter requires additional sub-Gaussian assumptions (Theorem 2)
  4. Unconditional Generation: Analysis only covers unconditional distributions; extension to conditional settings is a future direction
  5. Experimental Validation: As a purely theoretical work, lacks experimental verification of theoretical predictions

Future Directions

  1. Conditional Generation: Extend guarantees to conditional diffusion models (e.g., classifier-free guidance)
  2. Weaker Assumptions: Explore bounds under more general data distributions and optimization conditions
  3. Tightness Analysis: Investigate whether ϵ4\epsilon^{-4} bound is tight and possible lower bounds
  4. Practical Algorithms: Design practical training algorithms leveraging theoretical insights
  5. Other Architectures: Analyze sample complexity for modern architectures like Transformers

In-Depth Evaluation

Strengths

  1. Important Theoretical Breakthrough:
    • First to eliminate ERM assumption, a critical practical limitation
    • Improves best known bounds (from ϵ5\epsilon^{-5} to ϵ4\epsilon^{-4})
    • No exponential dimension dependence, applicable to high-dimensional settings
  2. Technical Innovation:
    • Statistical Error Analysis: Clever use of conditional normality and truncation techniques for unbounded losses
    • Optimization Error Analysis: First explicit analysis of finite SGD iterations using PL condition and recursive techniques
    • Error Decomposition Framework: Clear three-term decomposition making individual factor contributions transparent
  3. Theoretical Rigor:
    • Complete detailed proofs (appendix exceeds 30 pages)
    • Clear and relatively mild assumptions (compared to prior work)
    • Transparent constant dependencies (though potentially large)
  4. Writing Quality:
    • Clear structure with sufficient motivation
    • Clear explanation of technical contributions
    • Comprehensive comparison with related work (especially Appendix A analysis of Gupta et al.)

Weaknesses

  1. Theory-Practice Gap:
    • Sample complexity bound, though polynomial, may hide large constants (W2Dd2W^{2D} \cdot d^2)
    • Practical neural networks far smaller than theoretical requirements
    • Lacks experimental verification of theoretical predictions' validity
  2. Assumption Practicality:
    • PL Condition: While common in over-parameterized settings, difficult to verify
    • Approximation Error Constant: Treating as constant sidesteps network capacity-approximation quality tradeoff
    • Smoothness and Bounded Variance: May not strictly hold in actual training
  3. Technical Limitations:
    • Analysis depends on OU process; other noise schedules (VP/VE SDE) not covered
    • Early stopping time t0t_0 selection impact insufficiently discussed
    • Bounds on pt0p_{t_0} rather than p0p_0 require additional assumptions (Theorem 2)
  4. Comparison Fairness:
    • Comparison with Gupta et al. (2024) depends on reinterpretation of their results (Appendix A)
    • No comparison with other ERM-free methods (e.g., Block et al. 2020)
  5. Missing Content:
    • No lower bound analysis; optimality of ϵ4\epsilon^{-4} unknown
    • No algorithm implementation details or pseudocode (only high-level description)
    • No numerical experiments verifying theoretical predictions

Impact

  1. Theoretical Contribution:
    • Provides important benchmark for diffusion model theory
    • Error decomposition framework may inspire analysis of other generative models
    • Bridges theory-practice gap
  2. Practical Value:
    • Guides practitioners' understanding of sample requirements
    • Provides theoretical basis for algorithm design (e.g., learning rate scheduling)
    • Identifies key bottlenecks (optimization error)
  3. Reproducibility:
    • As theoretical work, proofs are detailed and verifiable
    • Assumptions clear, applicable when conditions met
    • Lacks code or experiments; practical application requires further work

Applicable Scenarios

  1. Theoretical Research: Provides theoretical foundation for diffusion models, score matching, generative models
  2. Algorithm Design: Guides training strategies (sample size, learning rate, early stopping)
  3. Resource Planning: Estimates computational and data resources needed for target quality
  4. Assumption Verification: Applicable in specific settings satisfying PL conditions

Not Suitable For:

  • Applications requiring tight constants
  • General non-convex optimization not satisfying PL condition
  • Conditional generation tasks (currently uncovered)

Deep Technical Insights

Innovative Statistical Error Handling

Traditional statistical learning theory (e.g., Shalev-Shwartz & Ben-David, 2014) requires bounded loss functions to apply Rademacher complexity. However, score function logpt(x)=xetx0σt2\nabla \log p_t(x) = \frac{x - e^{-t}x_0}{\sigma_t^2} is unbounded when xx is unbounded.

Solution:

  1. Define truncated score function:
(\nabla \log p_t(x))_j & \text{if } \left|\frac{x-e^{-t}x_0}{\sigma_t^2}\right|_j \leq \kappa \\ 0 & \text{otherwise} \end{cases}$$ 2. Control probability mass outside truncation region: Setting $\kappa = \log(dn/\delta)$: $$P\left(\left|\frac{x-e^{-t}x_0}{\sigma_t^2}\right|_j \geq \kappa\right) \leq \frac{\delta}{dn}$$ 3. Bound truncation error: Using conditional normality and Mill's ratio: $$\mathbb{E}[X^2 | |X-\mu| > a] = \mu^2 + \sigma^2 + \sigma a \cdot \frac{\phi(a/\sigma)}{1-\Phi(a/\sigma)}$$ ### Recursive Analysis of Optimization Error Under PL condition, SGD progress can be recursively bounded. For decreasing learning rate $\eta_i = \frac{\alpha}{i+\gamma}$: **Recursive Relation**: $$\mathbb{E}[\Delta_{i+1}] \leq \left(1 - \frac{\alpha\mu}{i+\gamma}\right)\mathbb{E}[\Delta_i] + \frac{\alpha^2 L \sigma^2}{2(i+\gamma)^2}$$ where $\Delta_i = L(\theta_i) - L^*$. **Solution Form**: Through integrating factor technique: $$\mathbb{E}[\Delta_i] \leq \frac{\gamma^{\alpha\mu} \Delta_0}{(i+\gamma)^{\alpha\mu}} + \frac{\alpha^2 L \sigma^2}{2(\alpha\mu - 1)} \cdot \frac{1}{i+\gamma}$$ When $\alpha\mu > 1$, dominant term is $O(1/i)$. ### Heavy-Tailed Noise Under Gradient Clipping The paper also handles gradients with finite $q$-th moment ($q \in (1,2]$) rather than bounded variance: **Clipping Strategy**: $\tilde{g}_t = \text{clip}(g_t, \tau_t)$, where $\tau_t = \Theta(\sigma_q (t+\gamma)^{1/(2q)})$ **Bias Bound**: $$\|\mathbb{E}[\tilde{g}_t | \mathcal{F}_t] - \nabla f(x_t)\| \leq C_q \frac{\sigma_q^q}{\tau_t^{q-1}}$$ **Convergence Rate**: Still maintains $O(1/t)$ since both bias and variance terms decay to $o(1/t)$. ## Detailed Comparison with Related Work ### vs. Gupta et al. (2024) | Aspect | Gupta et al. | This Work | |--------|-------------|-----------| | Sample Complexity | $\tilde{O}(\epsilon^{-5})$* | $\tilde{O}(\epsilon^{-4})$ | | ERM Assumption | Required | **Not Required** | | Error Analysis | Two-term (approx + stat) | Three-term (+ opt) | | Data Assumption | Bounded second moment | Bounded second moment | | Technical Tools | Quantile bounds | Global L2 bounds | *Original claims $\epsilon^{-3}$, but Appendix A of this work shows joint bounds needed ### vs. Block et al. (2020) Block et al. studied Langevin sampling convergence, also assuming ERM access (implicit in their definition). This work explicitly handles optimization error through PL condition. ### vs. Iteration Complexity Literature Li et al. (2024b), Benton et al. (2024) and others study iteration complexity assuming bounded score estimation error. This work's contribution is establishing sample complexity needed for that error bound. ## Open Problems 1. **Tightness**: Is $\epsilon^{-4}$ optimal? What are possible lower bounds? 2. **Constant Optimization**: Can we improve $W^{2D} \cdot d^2$ dependence? 3. **PL Condition Verification**: When does it hold for specific network architectures? 4. **Conditional Generation**: How to extend to classifier-free guidance settings? 5. **Empirical Verification**: How large is the gap between theoretical predictions and actual training? ## Selected References 1. **Ho et al. (2020)**: Denoising Diffusion Probabilistic Models - Foundational DDPM work 2. **Song et al. (2021)**: Score-Based Generative Modeling through SDEs - Continuous-time framework 3. **Gupta et al. (2024)**: Improved Sample Complexity Bounds for Diffusion Model Training - Closest prior work 4. **Liu et al. (2022)**: Loss Landscapes and Optimization in Over-parameterized Networks - Theoretical foundation for PL condition 5. **Shalev-Shwartz & Ben-David (2014)**: Understanding Machine Learning - Statistical learning theory foundation --- ## Summary This is an important theoretical paper making significant progress on sample complexity analysis for diffusion models. The core contribution is eliminating unrealistic ERM assumptions while improving known optimal bounds. Technically, through clever handling of unbounded losses and explicit optimization error analysis, it establishes a complete theoretical framework. **Recommended For**: Machine learning theory researchers, researchers interested in diffusion model theoretical foundations, optimization theory researchers. **Main Value**: Provides solid theoretical foundation for diffusion models, identifies theory-practice gaps, and points directions for future research. While theoretical bounds may not be tight, this represents important progress toward understanding diffusion model sample efficiency.