The universal approximation property uniformly with respect to weakly compact families of measures is established for several classes of neural networks. To that end, we prove that these neural networks are dense in Orlicz spaces, thereby extending classical universal approximation theorems even beyond the traditional $L^p$-setting. The covered classes of neural networks include widely used architectures like feedforward neural networks with non-polynomial activation functions, deep narrow networks with ReLU activation functions and functional input neural networks.
- Paper ID: 2510.09177
- Title: Distributionally robust approximation property of neural networks
- Authors: Mihriban Ceylan, David J. Prömel
- Classification: stat.ML cs.LG math.FA math.PR
- Publication Date: October 13, 2025
- Paper Link: https://arxiv.org/abs/2510.09177
The universal approximation property uniformly with respect to weakly compact families of measures is established for several classes of neural networks. To that end, we prove that these neural networks are dense in Orlicz spaces, thereby extending classical universal approximation theorems even beyond the traditional Lp-setting. The covered classes of neural networks include widely used architectures like feedforward neural networks with non-polynomial activation functions, deep narrow networks with ReLU activation functions and functional input neural networks.
The core problem addressed by this research is to establish the distributionally robust approximation property of neural networks. Specifically, traditional Universal Approximation Theorems (UATs) only consider approximation in Lp(μ) spaces under a single fixed distribution μ, whereas this paper aims to prove that neural networks can uniformly approximate functions over weakly compact families of measures M. That is, for a given function f and any ε>0, there exists a neural network η such that:
supν∈M∥f−η∥L1(ν)<ε
- Theoretical Importance: Extends classical universal approximation theorems from single-distribution settings to uniform approximation over families of distributions
- Practical Necessity: Distributional uncertainty is a ubiquitous challenge in machine learning practice
- Application Value: Provides theoretical foundations for distributionally robust learning, adversarial training, and noisy data handling
Classical universal approximation theorems suffer from the following limitations:
- Single-Distribution Restriction: Only establish approximation properties for a fixed single measure μ in Lp(μ) spaces
- Space Limitations: Primarily confined to the Lp space framework, lacking more general function space theory
- Robustness Deficiency: Cannot handle distribution shift or distributional uncertainty scenarios
The motivation for this research stems from:
- The ubiquity of distributional uncertainty in real-world applications (such as Knightian uncertainty and adversarial examples)
- The need for theoretical support for the development of distributionally robust optimization and statistical learning
- The theoretical requirement to extend neural network theory from Lp spaces to more general Orlicz spaces
- Universal Approximation Theorems in Orlicz Spaces: First proves the density of multiple classes of neural networks in Orlicz spaces with respect to the Luxemburg norm, representing an important generalization of classical Lp space results
- Distributionally Robust Approximation Property: Establishes distributionally robust universal approximation theorems for neural networks with respect to weakly compact families of measures, providing theoretical foundations for handling distributional uncertainty
- Broad Coverage of Network Architectures: Encompasses multiple important neural network architectures:
- Feedforward networks with bounded non-polynomial activation functions
- Deep narrow networks with ReLU activation
- Functional input neural networks
- Theoretical Framework Innovation: Through Orlicz space theory, provides a unified mathematical framework for handling different loss functions (such as cross-entropy and KL divergence)
Given a weakly compact family of measures M and an appropriate function f:RN0→RNL, for any ε>0, find a neural network η such that:
supν∈M∥f−η∥L1(ν)<ε
The paper constructs its mathematical framework based on Orlicz space theory. For a Young function φ, the Orlicz space is defined as:
Lφ(μ;RNL):={f:RN0→RNL:∫RN0φ(α∥f∥)dμ<∞ for some α>0}
equipped with the gauge norm:
Nφ,μ(f):=inf{k>0:∫RN0φ(∥f∥/k)dμ≤1}
- Feedforward Neural Networks: η=wL∘ϱ∘wL−1∘⋯∘ϱ∘w1
- Functional Input Neural Networks: η(x)=∑n=1Nynϱ(hn(x)), where hn∈H form an additive family
For an N-function φ and locally finite Borel measure μ, neural networks are dense in the Orlicz heart Mφ(μ) with respect to the gauge norm, covering:
- Bounded non-constant activation functions (finite measures)
- ReLU activation functions (locally finite measures)
- Continuous non-polynomial activation functions (compactly supported measures)
- Functional input neural networks (satisfying specific conditions)
For a weakly compact family of measures M and its associated Young pair (φM,ψM), for any f∈MφM(μ;RNL) and ε>0, there exists a neural network η of the corresponding class such that:
supν∈M∥f−η∥L1(ν;RNL)<ε
- Young Pair Construction: Utilizes uniform integrability of weakly compact measure families and constructs associated Young pairs via the De la Vallée Poussin theorem
- Generalized Hölder Inequality: Employs generalized Hölder inequalities to establish connections between Orlicz spaces and L1 spaces
- Density Arguments: Proves density of neural networks through generalized versions of the Hahn-Banach theorem and Riesz representation theorem
This is a purely theoretical paper with no numerical experiments. All results are established through rigorous mathematical proofs.
- Proof by Contradiction: Assumes neural networks are not dense and derives a contradiction using the Hahn-Banach theorem
- Constructive Proofs: For ReLU networks, constructs approximating networks explicitly
- Approximation Theory Techniques: Combines classical approximation theory results with measure theory
For bounded non-constant activation functions ϱ and L≥2, NNN0,NL,L,∞ϱ is dense in Mφ(μ) on any finite Borel measure.
For ReLU activation functions, NNN0,NL,∞,N0+NL+1ϱ is dense in Mφ(μ) on any locally finite Borel measure.
For continuous non-polynomial activation functions, NNN0,NL,L,∞ϱ is dense in Mφ(μ) on compactly supported finite Borel measures.
Under appropriate conditions, functional input neural networks NNRN0,RN2H,ϱ are dense in Mφ(μ) on finite Borel measures.
- Space Extension: Successfully generalizes classical Lp results to Orlicz spaces, providing a framework for handling non-standard growth conditions
- Measure Generalization: Extends from Lebesgue measures to general locally finite Borel measures
- Architecture Unification: Handles multiple neural network architectures within a unified theoretical framework
- Cybenko (1989): Establishes universal approximation property for feedforward networks with sigmoid activation functions
- Hornik (1991): Extends to more general activation functions and Sobolev spaces
- Leshno et al. (1993): Results for non-polynomial activation functions
- Kidger & Lyons (2020): Universal approximation property of deep narrow ReLU networks
- Cuchiero et al. (2025): Global universal approximation for functional input neural networks
- Costarelli & Vinti (2019): Kantorovich operators in Orlicz spaces
- Ben-Tal et al. (2013): Robust optimization under uncertain probabilities
- Gao & Kleywegt (2016): Distributionally robust stochastic optimization under Wasserstein distance
- Establishes universal approximation properties of neural networks in Orlicz spaces, significantly extending classical theory
- Proves the distributionally robust approximation capability of neural networks, providing theoretical foundations for handling distributional uncertainty
- Covers widely used neural network architectures with good practical value
- Measure Conditions: Different network architectures require different measure conditions (finiteness, compact support, etc.)
- Constructivity: While existence is proven, explicit network construction methods are lacking
- Computational Complexity: Does not analyze quantitative relationships between required network size and approximation accuracy
- Quantitative Analysis: Establish quantitative relationships between approximation error and network complexity
- Algorithm Implementation: Develop practical algorithms based on theoretical results
- Application Extension: Apply theory to concrete machine learning tasks
- Theoretical Depth: Mathematically rigorous and profound, advancing neural network theory to new heights
- Unified Framework: The Orlicz space framework provides a unified perspective for addressing multiple problems
- Practical Significance: Provides solid theoretical foundations for distributionally robust learning
- Technical Innovation: Cleverly combines techniques from functional analysis, measure theory, and approximation theory
- Practical Gap: Pure theoretical results with significant distance from practical applications
- Condition Restrictions: Different results require different technical conditions, limiting universality
- Construction Deficiency: Lacks concrete network construction and training algorithms
- Theoretical Contribution: Establishes new mathematical foundations for neural network theory
- Interdisciplinary Value: Connects machine learning, functional analysis, and measure theory
- Long-term Significance: Provides theoretical guidance for future research in distributionally robust learning
- Theoretical Research: Provides new tools for neural network theory researchers
- Robust Learning: Guides theoretical development of distributionally robust optimization and adversarial training
- Non-Standard Losses: Theoretical analysis of non-Lp type loss functions such as cross-entropy and KL divergence
The paper includes abundant references covering important works in approximation theory, functional analysis, neural network theory, and distributionally robust optimization, providing readers with comprehensive background knowledge.
Overall Assessment: This is a theoretically rigorous and profound paper that successfully generalizes neural network universal approximation theory from classical Lp spaces to Orlicz spaces and establishes distributionally robust approximation properties. While there remains distance from practical applications, it provides important mathematical foundations for neural network theory and distributionally robust learning.