2025-11-10T02:37:09.167057

Distributionally robust approximation property of neural networks

Ceylan, Prömel
The universal approximation property uniformly with respect to weakly compact families of measures is established for several classes of neural networks. To that end, we prove that these neural networks are dense in Orlicz spaces, thereby extending classical universal approximation theorems even beyond the traditional $L^p$-setting. The covered classes of neural networks include widely used architectures like feedforward neural networks with non-polynomial activation functions, deep narrow networks with ReLU activation functions and functional input neural networks.
academic

Distributionally robust approximation property of neural networks

Basic Information

  • Paper ID: 2510.09177
  • Title: Distributionally robust approximation property of neural networks
  • Authors: Mihriban Ceylan, David J. Prömel
  • Classification: stat.ML cs.LG math.FA math.PR
  • Publication Date: October 13, 2025
  • Paper Link: https://arxiv.org/abs/2510.09177

Abstract

The universal approximation property uniformly with respect to weakly compact families of measures is established for several classes of neural networks. To that end, we prove that these neural networks are dense in Orlicz spaces, thereby extending classical universal approximation theorems even beyond the traditional LpL^p-setting. The covered classes of neural networks include widely used architectures like feedforward neural networks with non-polynomial activation functions, deep narrow networks with ReLU activation functions and functional input neural networks.

Research Background and Motivation

Problem Definition

The core problem addressed by this research is to establish the distributionally robust approximation property of neural networks. Specifically, traditional Universal Approximation Theorems (UATs) only consider approximation in Lp(μ)L^p(μ) spaces under a single fixed distribution μ, whereas this paper aims to prove that neural networks can uniformly approximate functions over weakly compact families of measures M\mathcal{M}. That is, for a given function ff and any ε>0ε > 0, there exists a neural network ηη such that: supνMfηL1(ν)<ε\sup_{ν \in \mathcal{M}} \|f - η\|_{L^1(ν)} < ε

Research Significance

  1. Theoretical Importance: Extends classical universal approximation theorems from single-distribution settings to uniform approximation over families of distributions
  2. Practical Necessity: Distributional uncertainty is a ubiquitous challenge in machine learning practice
  3. Application Value: Provides theoretical foundations for distributionally robust learning, adversarial training, and noisy data handling

Limitations of Existing Methods

Classical universal approximation theorems suffer from the following limitations:

  1. Single-Distribution Restriction: Only establish approximation properties for a fixed single measure μ in Lp(μ)L^p(μ) spaces
  2. Space Limitations: Primarily confined to the LpL^p space framework, lacking more general function space theory
  3. Robustness Deficiency: Cannot handle distribution shift or distributional uncertainty scenarios

Research Motivation

The motivation for this research stems from:

  1. The ubiquity of distributional uncertainty in real-world applications (such as Knightian uncertainty and adversarial examples)
  2. The need for theoretical support for the development of distributionally robust optimization and statistical learning
  3. The theoretical requirement to extend neural network theory from LpL^p spaces to more general Orlicz spaces

Core Contributions

  1. Universal Approximation Theorems in Orlicz Spaces: First proves the density of multiple classes of neural networks in Orlicz spaces with respect to the Luxemburg norm, representing an important generalization of classical LpL^p space results
  2. Distributionally Robust Approximation Property: Establishes distributionally robust universal approximation theorems for neural networks with respect to weakly compact families of measures, providing theoretical foundations for handling distributional uncertainty
  3. Broad Coverage of Network Architectures: Encompasses multiple important neural network architectures:
    • Feedforward networks with bounded non-polynomial activation functions
    • Deep narrow networks with ReLU activation
    • Functional input neural networks
  4. Theoretical Framework Innovation: Through Orlicz space theory, provides a unified mathematical framework for handling different loss functions (such as cross-entropy and KL divergence)

Methodology Details

Task Definition

Given a weakly compact family of measures M\mathcal{M} and an appropriate function f:RN0RNLf: \mathbb{R}^{N_0} \to \mathbb{R}^{N_L}, for any ε>0ε > 0, find a neural network ηη such that: supνMfηL1(ν)<ε\sup_{ν \in \mathcal{M}} \|f - η\|_{L^1(ν)} < ε

Theoretical Framework

Orlicz Space Framework

The paper constructs its mathematical framework based on Orlicz space theory. For a Young function φ, the Orlicz space is defined as: Lφ(μ;RNL):={f:RN0RNL:RN0φ(αf)dμ< for some α>0}L^φ(μ; \mathbb{R}^{N_L}) := \{f: \mathbb{R}^{N_0} \to \mathbb{R}^{N_L} : \int_{\mathbb{R}^{N_0}} φ(α\|f\|) dμ < ∞ \text{ for some } α > 0\}

equipped with the gauge norm: Nφ,μ(f):=inf{k>0:RN0φ(f/k)dμ1}N_{φ,μ}(f) := \inf\{k > 0: \int_{\mathbb{R}^{N_0}} φ(\|f\|/k) dμ ≤ 1\}

Neural Network Definition

  1. Feedforward Neural Networks: η=wLϱwL1ϱw1η = w_L ∘ ϱ ∘ w_{L-1} ∘ \cdots ∘ ϱ ∘ w_1
  2. Functional Input Neural Networks: η(x)=n=1Nynϱ(hn(x))η(x) = \sum_{n=1}^N y_n ϱ(h_n(x)), where hnHh_n \in \mathcal{H} form an additive family

Core Theorems

Theorem 2.3 (Universal Approximation Theorem in Orlicz Spaces)

For an N-function φ and locally finite Borel measure μ, neural networks are dense in the Orlicz heart Mφ(μ)M^φ(μ) with respect to the gauge norm, covering:

  1. Bounded non-constant activation functions (finite measures)
  2. ReLU activation functions (locally finite measures)
  3. Continuous non-polynomial activation functions (compactly supported measures)
  4. Functional input neural networks (satisfying specific conditions)

Theorem 3.1 (Distributionally Robust Universal Approximation Theorem)

For a weakly compact family of measures M\mathcal{M} and its associated Young pair (φM,ψM)(φ_\mathcal{M}, ψ_\mathcal{M}), for any fMφM(μ;RNL)f \in M^{φ_\mathcal{M}}(μ; \mathbb{R}^{N_L}) and ε>0ε > 0, there exists a neural network η of the corresponding class such that: supνMfηL1(ν;RNL)<ε\sup_{ν \in \mathcal{M}} \|f - η\|_{L^1(ν; \mathbb{R}^{N_L})} < ε

Technical Innovations

  1. Young Pair Construction: Utilizes uniform integrability of weakly compact measure families and constructs associated Young pairs via the De la Vallée Poussin theorem
  2. Generalized Hölder Inequality: Employs generalized Hölder inequalities to establish connections between Orlicz spaces and L1L^1 spaces
  3. Density Arguments: Proves density of neural networks through generalized versions of the Hahn-Banach theorem and Riesz representation theorem

Experimental Setup

This is a purely theoretical paper with no numerical experiments. All results are established through rigorous mathematical proofs.

Proof Strategy

  1. Proof by Contradiction: Assumes neural networks are not dense and derives a contradiction using the Hahn-Banach theorem
  2. Constructive Proofs: For ReLU networks, constructs approximating networks explicitly
  3. Approximation Theory Techniques: Combines classical approximation theory results with measure theory

Experimental Results

Main Theoretical Results

Proposition 2.4 (Bounded Activation Functions)

For bounded non-constant activation functions ϱ and L2L ≥ 2, NNN0,NL,L,ϱ\mathcal{NN}^ϱ_{N_0,N_L,L,∞} is dense in Mφ(μ)M^φ(μ) on any finite Borel measure.

Proposition 2.6 (ReLU Activation Functions)

For ReLU activation functions, NNN0,NL,,N0+NL+1ϱ\mathcal{NN}^ϱ_{N_0,N_L,∞,N_0+N_L+1} is dense in Mφ(μ)M^φ(μ) on any locally finite Borel measure.

Proposition 2.8 (Non-Polynomial Activation Functions)

For continuous non-polynomial activation functions, NNN0,NL,L,ϱ\mathcal{NN}^ϱ_{N_0,N_L,L,∞} is dense in Mφ(μ)M^φ(μ) on compactly supported finite Borel measures.

Proposition 2.10 (Functional Input Neural Networks)

Under appropriate conditions, functional input neural networks NNRN0,RN2H,ϱ\mathcal{NN}^{\mathcal{H},ϱ}_{\mathbb{R}^{N_0},\mathbb{R}^{N_2}} are dense in Mφ(μ)M^φ(μ) on finite Borel measures.

Theoretical Findings

  1. Space Extension: Successfully generalizes classical LpL^p results to Orlicz spaces, providing a framework for handling non-standard growth conditions
  2. Measure Generalization: Extends from Lebesgue measures to general locally finite Borel measures
  3. Architecture Unification: Handles multiple neural network architectures within a unified theoretical framework

Classical Universal Approximation Theory

  • Cybenko (1989): Establishes universal approximation property for feedforward networks with sigmoid activation functions
  • Hornik (1991): Extends to more general activation functions and Sobolev spaces
  • Leshno et al. (1993): Results for non-polynomial activation functions

Modern Developments

  • Kidger & Lyons (2020): Universal approximation property of deep narrow ReLU networks
  • Cuchiero et al. (2025): Global universal approximation for functional input neural networks
  • Costarelli & Vinti (2019): Kantorovich operators in Orlicz spaces

Distributionally Robust Optimization

  • Ben-Tal et al. (2013): Robust optimization under uncertain probabilities
  • Gao & Kleywegt (2016): Distributionally robust stochastic optimization under Wasserstein distance

Conclusions and Discussion

Main Conclusions

  1. Establishes universal approximation properties of neural networks in Orlicz spaces, significantly extending classical theory
  2. Proves the distributionally robust approximation capability of neural networks, providing theoretical foundations for handling distributional uncertainty
  3. Covers widely used neural network architectures with good practical value

Limitations

  1. Measure Conditions: Different network architectures require different measure conditions (finiteness, compact support, etc.)
  2. Constructivity: While existence is proven, explicit network construction methods are lacking
  3. Computational Complexity: Does not analyze quantitative relationships between required network size and approximation accuracy

Future Directions

  1. Quantitative Analysis: Establish quantitative relationships between approximation error and network complexity
  2. Algorithm Implementation: Develop practical algorithms based on theoretical results
  3. Application Extension: Apply theory to concrete machine learning tasks

In-Depth Evaluation

Strengths

  1. Theoretical Depth: Mathematically rigorous and profound, advancing neural network theory to new heights
  2. Unified Framework: The Orlicz space framework provides a unified perspective for addressing multiple problems
  3. Practical Significance: Provides solid theoretical foundations for distributionally robust learning
  4. Technical Innovation: Cleverly combines techniques from functional analysis, measure theory, and approximation theory

Weaknesses

  1. Practical Gap: Pure theoretical results with significant distance from practical applications
  2. Condition Restrictions: Different results require different technical conditions, limiting universality
  3. Construction Deficiency: Lacks concrete network construction and training algorithms

Impact

  1. Theoretical Contribution: Establishes new mathematical foundations for neural network theory
  2. Interdisciplinary Value: Connects machine learning, functional analysis, and measure theory
  3. Long-term Significance: Provides theoretical guidance for future research in distributionally robust learning

Applicable Scenarios

  1. Theoretical Research: Provides new tools for neural network theory researchers
  2. Robust Learning: Guides theoretical development of distributionally robust optimization and adversarial training
  3. Non-Standard Losses: Theoretical analysis of non-LpL^p type loss functions such as cross-entropy and KL divergence

References

The paper includes abundant references covering important works in approximation theory, functional analysis, neural network theory, and distributionally robust optimization, providing readers with comprehensive background knowledge.


Overall Assessment: This is a theoretically rigorous and profound paper that successfully generalizes neural network universal approximation theory from classical LpL^p spaces to Orlicz spaces and establishes distributionally robust approximation properties. While there remains distance from practical applications, it provides important mathematical foundations for neural network theory and distributionally robust learning.