2025-11-10T02:37:09.167057

Distributionally robust approximation property of neural networks

Ceylan, PrÃ¶mel

The universal approximation property uniformly with respect to weakly compact families of measures is established for several classes of neural networks. To that end, we prove that these neural networks are dense in Orlicz spaces, thereby extending classical universal approximation theorems even beyond the traditional $L^p$-setting. The covered classes of neural networks include widely used architectures like feedforward neural networks with non-polynomial activation functions, deep narrow networks with ReLU activation functions and functional input neural networks.

academic

Distributionally robust approximation property of neural networks

Basic Information

Paper ID: 2510.09177
Title: Distributionally robust approximation property of neural networks
Authors: Mihriban Ceylan, David J. Prömel
Classification: stat.ML cs.LG math.FA math.PR
Publication Date: October 13, 2025
Paper Link: https://arxiv.org/abs/2510.09177

Abstract

The universal approximation property uniformly with respect to weakly compact families of measures is established for several classes of neural networks. To that end, we prove that these neural networks are dense in Orlicz spaces, thereby extending classical universal approximation theorems even beyond the traditional $L^p$ -setting. The covered classes of neural networks include widely used architectures like feedforward neural networks with non-polynomial activation functions, deep narrow networks with ReLU activation functions and functional input neural networks.

Research Background and Motivation

Problem Definition

The core problem addressed by this research is to establish the distributionally robust approximation property of neural networks. Specifically, traditional Universal Approximation Theorems (UATs) only consider approximation in $L^p(μ)$ spaces under a single fixed distribution μ, whereas this paper aims to prove that neural networks can uniformly approximate functions over weakly compact families of measures $\mathcal{M}$ . That is, for a given function $f$ and any $ε > 0$ , there exists a neural network $η$ such that: $\sup_{ν \in \mathcal{M}} \|f - η\|_{L^1(ν)} < ε$

Research Significance

Theoretical Importance: Extends classical universal approximation theorems from single-distribution settings to uniform approximation over families of distributions
Practical Necessity: Distributional uncertainty is a ubiquitous challenge in machine learning practice
Application Value: Provides theoretical foundations for distributionally robust learning, adversarial training, and noisy data handling

Limitations of Existing Methods

Classical universal approximation theorems suffer from the following limitations:

Single-Distribution Restriction: Only establish approximation properties for a fixed single measure μ in $L^p(μ)$ spaces
Space Limitations: Primarily confined to the $L^p$ space framework, lacking more general function space theory
Robustness Deficiency: Cannot handle distribution shift or distributional uncertainty scenarios

Research Motivation

The motivation for this research stems from:

The ubiquity of distributional uncertainty in real-world applications (such as Knightian uncertainty and adversarial examples)
The need for theoretical support for the development of distributionally robust optimization and statistical learning
The theoretical requirement to extend neural network theory from $L^p$ spaces to more general Orlicz spaces

Core Contributions

Universal Approximation Theorems in Orlicz Spaces: First proves the density of multiple classes of neural networks in Orlicz spaces with respect to the Luxemburg norm, representing an important generalization of classical $L^p$ space results
Distributionally Robust Approximation Property: Establishes distributionally robust universal approximation theorems for neural networks with respect to weakly compact families of measures, providing theoretical foundations for handling distributional uncertainty
Broad Coverage of Network Architectures: Encompasses multiple important neural network architectures:
- Feedforward networks with bounded non-polynomial activation functions
- Deep narrow networks with ReLU activation
- Functional input neural networks
Theoretical Framework Innovation: Through Orlicz space theory, provides a unified mathematical framework for handling different loss functions (such as cross-entropy and KL divergence)

Methodology Details

Task Definition

Given a weakly compact family of measures $\mathcal{M}$ and an appropriate function $f: \mathbb{R}^{N_0} \to \mathbb{R}^{N_L}$ , for any $ε > 0$ , find a neural network $η$ such that: $\sup_{ν \in \mathcal{M}} \|f - η\|_{L^1(ν)} < ε$

Theoretical Framework

Orlicz Space Framework

The paper constructs its mathematical framework based on Orlicz space theory. For a Young function φ, the Orlicz space is defined as: $L^φ(μ; \mathbb{R}^{N_L}) := \{f: \mathbb{R}^{N_0} \to \mathbb{R}^{N_L} : \int_{\mathbb{R}^{N_0}} φ(α\|f\|) dμ < ∞ \text{ for some } α > 0\}$

equipped with the gauge norm: $N_{φ,μ}(f) := \inf\{k > 0: \int_{\mathbb{R}^{N_0}} φ(\|f\|/k) dμ ≤ 1\}$

Neural Network Definition

Feedforward Neural Networks: $η = w_L ∘ ϱ ∘ w_{L-1} ∘ \cdots ∘ ϱ ∘ w_1$
Functional Input Neural Networks: $η(x) = \sum_{n=1}^N y_n ϱ(h_n(x))$ , where $h_n \in \mathcal{H}$ form an additive family

Core Theorems

Theorem 2.3 (Universal Approximation Theorem in Orlicz Spaces)

For an N-function φ and locally finite Borel measure μ, neural networks are dense in the Orlicz heart $M^φ(μ)$ with respect to the gauge norm, covering:

Bounded non-constant activation functions (finite measures)
ReLU activation functions (locally finite measures)
Continuous non-polynomial activation functions (compactly supported measures)
Functional input neural networks (satisfying specific conditions)

Theorem 3.1 (Distributionally Robust Universal Approximation Theorem)

For a weakly compact family of measures $\mathcal{M}$ and its associated Young pair $(φ_\mathcal{M}, ψ_\mathcal{M})$ , for any $f \in M^{φ_\mathcal{M}}(μ; \mathbb{R}^{N_L})$ and $ε > 0$ , there exists a neural network η of the corresponding class such that: $\sup_{ν \in \mathcal{M}} \|f - η\|_{L^1(ν; \mathbb{R}^{N_L})} < ε$

Technical Innovations

Young Pair Construction: Utilizes uniform integrability of weakly compact measure families and constructs associated Young pairs via the De la Vallée Poussin theorem
Generalized Hölder Inequality: Employs generalized Hölder inequalities to establish connections between Orlicz spaces and $L^1$ spaces
Density Arguments: Proves density of neural networks through generalized versions of the Hahn-Banach theorem and Riesz representation theorem

Experimental Setup

This is a purely theoretical paper with no numerical experiments. All results are established through rigorous mathematical proofs.

Proof Strategy

Proof by Contradiction: Assumes neural networks are not dense and derives a contradiction using the Hahn-Banach theorem
Constructive Proofs: For ReLU networks, constructs approximating networks explicitly
Approximation Theory Techniques: Combines classical approximation theory results with measure theory

Experimental Results

Main Theoretical Results

Proposition 2.4 (Bounded Activation Functions)

For bounded non-constant activation functions ϱ and $L ≥ 2$ , $\mathcal{NN}^ϱ_{N_0,N_L,L,∞}$ is dense in $M^φ(μ)$ on any finite Borel measure.

Proposition 2.6 (ReLU Activation Functions)

For ReLU activation functions, $\mathcal{NN}^ϱ_{N_0,N_L,∞,N_0+N_L+1}$ is dense in $M^φ(μ)$ on any locally finite Borel measure.

Proposition 2.8 (Non-Polynomial Activation Functions)

For continuous non-polynomial activation functions, $\mathcal{NN}^ϱ_{N_0,N_L,L,∞}$ is dense in $M^φ(μ)$ on compactly supported finite Borel measures.

Proposition 2.10 (Functional Input Neural Networks)

Under appropriate conditions, functional input neural networks $\mathcal{NN}^{\mathcal{H},ϱ}_{\mathbb{R}^{N_0},\mathbb{R}^{N_2}}$ are dense in $M^φ(μ)$ on finite Borel measures.

Theoretical Findings

Space Extension: Successfully generalizes classical $L^p$ results to Orlicz spaces, providing a framework for handling non-standard growth conditions
Measure Generalization: Extends from Lebesgue measures to general locally finite Borel measures
Architecture Unification: Handles multiple neural network architectures within a unified theoretical framework

Classical Universal Approximation Theory

Cybenko (1989): Establishes universal approximation property for feedforward networks with sigmoid activation functions
Hornik (1991): Extends to more general activation functions and Sobolev spaces
Leshno et al. (1993): Results for non-polynomial activation functions

Modern Developments

Kidger & Lyons (2020): Universal approximation property of deep narrow ReLU networks
Cuchiero et al. (2025): Global universal approximation for functional input neural networks
Costarelli & Vinti (2019): Kantorovich operators in Orlicz spaces

Distributionally Robust Optimization

Ben-Tal et al. (2013): Robust optimization under uncertain probabilities
Gao & Kleywegt (2016): Distributionally robust stochastic optimization under Wasserstein distance

Conclusions and Discussion

Main Conclusions

Establishes universal approximation properties of neural networks in Orlicz spaces, significantly extending classical theory
Proves the distributionally robust approximation capability of neural networks, providing theoretical foundations for handling distributional uncertainty
Covers widely used neural network architectures with good practical value

Limitations

Measure Conditions: Different network architectures require different measure conditions (finiteness, compact support, etc.)
Constructivity: While existence is proven, explicit network construction methods are lacking
Computational Complexity: Does not analyze quantitative relationships between required network size and approximation accuracy

Future Directions

Quantitative Analysis: Establish quantitative relationships between approximation error and network complexity
Algorithm Implementation: Develop practical algorithms based on theoretical results
Application Extension: Apply theory to concrete machine learning tasks

In-Depth Evaluation

Strengths

Theoretical Depth: Mathematically rigorous and profound, advancing neural network theory to new heights
Unified Framework: The Orlicz space framework provides a unified perspective for addressing multiple problems
Practical Significance: Provides solid theoretical foundations for distributionally robust learning
Technical Innovation: Cleverly combines techniques from functional analysis, measure theory, and approximation theory

Weaknesses

Practical Gap: Pure theoretical results with significant distance from practical applications
Condition Restrictions: Different results require different technical conditions, limiting universality
Construction Deficiency: Lacks concrete network construction and training algorithms

Impact

Theoretical Contribution: Establishes new mathematical foundations for neural network theory
Interdisciplinary Value: Connects machine learning, functional analysis, and measure theory
Long-term Significance: Provides theoretical guidance for future research in distributionally robust learning

Applicable Scenarios

Theoretical Research: Provides new tools for neural network theory researchers
Robust Learning: Guides theoretical development of distributionally robust optimization and adversarial training
Non-Standard Losses: Theoretical analysis of non- $L^p$ type loss functions such as cross-entropy and KL divergence

References

The paper includes abundant references covering important works in approximation theory, functional analysis, neural network theory, and distributionally robust optimization, providing readers with comprehensive background knowledge.

Overall Assessment: This is a theoretically rigorous and profound paper that successfully generalizes neural network universal approximation theory from classical $L^p$ spaces to Orlicz spaces and establishes distributionally robust approximation properties. While there remains distance from practical applications, it provides important mathematical foundations for neural network theory and distributionally robust learning.