2025-11-25T01:46:17.329771

Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training

Milkert, Hyde, Laine

In a neural network with ReLU activations, the number of piecewise linear regions in the output can grow exponentially with depth. However, this is highly unlikely to happen when the initial parameters are sampled randomly, which therefore often leads to the use of networks that are unnecessarily large. To address this problem, we introduce a novel parameterization of the network that restricts its weights so that a depth $d$ network produces exactly $2^d$ linear regions at initialization and maintains those regions throughout training under the parameterization. This approach allows us to learn approximations of convex, one dimensional functions that are several orders of magnitude more accurate than their randomly initialized counterparts. We further demonstrate a preliminary extension of our construction to multidimensional and non-convex functions, allowing the technique to replace traditional dense layers in various architectures.

academic

Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training

Basic Information

Paper ID: 2311.18022
Title: Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training
Authors: Max Milkert, David Hyde, Forrest Laine
Classification: cs.LG cs.AI
Publication Time/Conference: Proceedings of the 42nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025
Paper Link: https://arxiv.org/abs/2311.18022

Abstract

In neural networks with ReLU activation functions, the number of piecewise linear regions in the output can theoretically grow exponentially with depth. However, this rarely occurs when network parameters are randomly sampled, often necessitating unnecessarily large networks. To address this issue, this paper proposes a novel network reparameterization method that constrains weights such that a network of depth $d$ produces exactly $2^d$ linear regions at initialization and maintains these regions during training. The method achieves several orders of magnitude improvement in accuracy compared to randomly initialized networks when learning convex one-dimensional function approximations. The authors also demonstrate preliminary results extending the construction to multidimensional and non-convex functions, enabling this technique to serve as a replacement for conventional dense layers in various architectures.

Research Background and Motivation

Problem Definition

ReLU networks theoretically possess powerful expressive capacity with linear region counts growing exponentially with depth, yet significant gaps exist between theory and practice:

Theory-Practice Gap: Although theoretically a depth- $d$ ReLU network can produce $2^d$ linear regions, Hanin & Rolnick (2019) proved that randomly initialized networks have average linear region counts independent of depth, depending only on total neuron count.
Limitations of Gradient Descent: Gradient descent struggles to create new activation regions because the number of linear regions is not a "local" property in parameter space and cannot be directly optimized through gradient-based methods.
Network Redundancy Problem: In practice, approximately 95% of weights can be eliminated without significantly affecting accuracy, indicating inefficiency in conventional training methods.

Research Motivation

The core motivation of this work is to develop mathematical algorithms that circumvent limitations of random initialization, compelling ReLU networks to realize their theoretical expressive capacity, thereby achieving better performance with smaller networks.

Core Contributions

Novel Reparameterization Method: Proposes a reparameterization strategy for 4-neuron-wide ReLU networks of arbitrary depth, ensuring that depth- $d$ networks produce $2^d$ activation regions at initialization.
Pretraining Strategy: Develops a pretraining method that enforces the existence of $2^d$ activation regions during optimization.
Significant Performance Improvement: Achieves orders-of-magnitude improvements in network performance on one-dimensional test cases.
Extended Applications: Extends the method to non-convex and multidimensional functions, and serves as a plug-and-play replacement for dense layers in arbitrary networks.

Methodology Details

Core Idea

The method is based on combinations of triangular wave functions to construct networks with exponentially many linear regions:

Triangular Function Definition

Ti(x) = {
    x/ai,           0 ≤ x ≤ ai
    1-(x-ai)/(1-ai), ai ≤ x ≤ 1
}

where $0 < a_i < 1$ is the peak position of the triangular function at layer $i$ .

Combined Waveforms

Each layer produces triangular waves through function composition:

Wi(x) = Ti ∘ Ti-1 ∘ ... ∘ T0(x)

These waveforms possess $2^i$ linear regions, doubling at each layer.

Network Output

The final network output is a weighted sum of triangular waves from each layer:

F(x) = Σ(i=0 to ∞) si * Wi(x)

Network Architecture Design

Single Layer Implementation

Each triangular function requires two ReLU neurons:

Neuron t1: Input weight 1, output weight 1/a, always activated
Neuron t2: Bias -a, output weight -1/(a-a²), activated when x>a

Multi-layer Combination

Function composition is achieved through depth stacking, with each layer containing:

t1, t2 neurons: Implementing triangular functions
sum neuron: Accumulating triangular wave outputs from previous layers
bias neuron: Handling exponentially decaying biases

Weight Matrix Form

The hidden layer matrix form is:

[1  ±[Si/ai  -Si/(ai-ai²)]  0    ]   [sum ]
[0   Si/ai   -Si/(ai-ai²)   0    ] × [t1  ]
[0   Si/ai   -Si/(ai-ai²)  -Siai+1]   [t2  ]
[0   0       0              Si   ]   [bias]

Differentiability Constraints

Theorem 3.1

To ensure network output differentiability in the infinite-depth limit, scaling coefficients must satisfy:

si+1 = si(1-ai+1)ai+2

This constraint ensures derivative continuity, preventing outputs from becoming fractal curves.

Training Algorithm

Three-Stage Training Process

Reparameterization and Initialization: Set network weights according to triangular peak positions
Pretraining: Train the network under reparameterization constraints
Standard Training: Directly optimize network weights

Algorithm Flow

Algorithm 1: Initialization and Pretraining
A ← Random((0,1)^n)  # Triangular peak positions
while Epochs > 0:
    Network ← Set_Weights(A)  # Set weights according to A
    Loss ← (Network(x) - y)²
    Network_Gradient ← ∂Loss/∂Network
    A_Gradient ← ∂Network/∂A  # Backprop through weight setting
    Gradient ← Network_Gradient × A_Gradient
    A ← A - ε × Gradient  # Update A rather than network weights

Experimental Setup

One-Dimensional Function Experiments

Dataset

Dense Data: 500 equally-spaced points on 0,1
Sparse Data: 10 training points, 10 test points (located between training points)

Target Functions

$x^3$ , $x^{11}$ (convex functions, subtractive combinations)
$\sin(x)$ , $\tanh(3x)$ (approximated through additive combinations)

Network Configuration

4-neuron width, 5 hidden layers
Adam optimizer, learning rate 0.001, 1000 epochs

Comparison Methods

Default Network: Kaiming initialization
RAAI Distribution: Improved weight distribution initialization
Skip Pretraining: Using proposed initialization with standard training only
Unregularized Pretraining: Without differentiability constraints
Complete Method: Pretraining + differentiability constraints

Extended Experiments

Non-convex and Multidimensional Functions

Non-convex Function: $y = x^3 - x$ (difference of two networks)
Two-dimensional Function: $z = r^3$ (sum of two networks)

Image Classification

VGG-16 on ImageNet: Replace dense layers in classifier
CIFAR-10: Apply in CNN architecture

Experimental Results

One-Dimensional Function Approximation Results

Dense Data Performance (Minimum MSE Error)

Method	$x^3$	$x^{11}$	$\sin(x)$	$\tanh(3x)$
Kaiming Initialization	2.11×10⁻⁵	2.19×10⁻⁵	4.50×10⁻⁵	5.75×10⁻⁵
RAAI Distribution	2.14×10⁻⁵	4.40×10⁻⁵	3.59×10⁻⁵	1.09×10⁻⁵
Skip Pretraining	7.63×10⁻⁷	1.86×10⁻⁵	1.96×10⁻⁷	1.07×10⁻⁶
Unregularized Pretraining	1.64×10⁻⁷	3.20×10⁻⁶	4.41×10⁻⁸	1.49×10⁻⁷
Complete Method	7.86×10⁻⁸	8.86×10⁻⁷	5.06×10⁻⁸	6.82×10⁻⁸

Key Findings

Orders-of-Magnitude Improvement: Complete method achieves 3 orders of magnitude higher precision than default networks
Importance of Pretraining: Even skipping pretraining, initialization alone provides significant improvements
Differentiability Constraint Effect: Enforcing differentiability further enhances stability and accuracy
Dead ReLU Problem: Conventional methods suffer ~50% network collapse due to dead ReLU phenomenon

Sparse Data Generalization Capability

Method	$x^3$	$x^{11}$	$\sin(x)$	$\tanh(3x)$
Kaiming Initialization	2.41×10⁻⁴	2.14×10⁻³	2.27×10⁻⁵	1.60×10⁻⁴
Complete Method	5.65×10⁻⁶	6.53×10⁻⁴	7.92×10⁻⁷	5.09×10⁻⁶

Extended Application Results

Non-convex and Multidimensional Functions

$x^3-x$ Approximation: Proposed method error 5.52×10⁻⁷ vs standard 8×5 network error 8×10⁻⁶
$z=r^3$ Approximation: Proposed method error 3.5×10⁻⁶ vs standard network error 1.5×10⁻⁴ (nearly two orders of magnitude improvement)

Image Classification Performance

ImageNet VGG-16: Advantages in early training, comparable final accuracy (73.3%)
CIFAR-10: Comparable performance to standard methods, demonstrating generality

Function Approximation Theory

This work builds upon classical neural network approximation theory:

Universal Approximation Theorem: Approximation capacity of infinitely wide or deep networks
Depth Advantage Theory: Certain functions require subexponential neurons in deep networks but exponential neurons in shallow networks

Triangular Wave Construction

Builds upon work by Telgarsky (2015) and Yarotsky (2017):

Symmetric Triangular Waves: Used for constructing exponential-precision approximations of $x^2$
Function Composition: Achieving complex function representation through inter-layer composition

Network Initialization Methods

Comparison with existing initialization approaches:

Kaiming/Xavier Initialization: Homogeneous methods based on statistical distributions
Dead ReLU Problem: Inherent issue of random initialization in deep networks
This Work's Contribution: Heterogeneous initialization based on mathematical construction

Conclusions and Discussion

Main Conclusions

Theoretical Breakthrough: First practical method to force ReLU networks to produce exponentially many linear regions
Significant Improvement: Achieves orders-of-magnitude accuracy improvements on one-dimensional function approximation tasks
Extension Potential: Demonstrates applicability of the method to multidimensional and non-convex functions
Practical Value: Can serve as plug-and-play replacement for dense layers in existing architectures

Limitations

Architectural Constraints: Current method limited to specific 4-neuron-wide structures
Function Class Limitations: Direct applicability to one-dimensional convex functions; multidimensional extensions require combinatorial strategies
Limited Effect on Classification: Improvements not significant on image classification tasks
Theoretical Completeness: Lacks universal theoretical framework for arbitrary ReLU networks

Future Directions

Theoretical Extension: Identify dense sets of one-dimensional functions that can be efficiently represented
Multidimensional Methods: Develop more natural multidimensional function representation approaches
Sparse Structures: Overcome current limitation of creating only sparse block-diagonal matrices
Application Exploration: Identify more suitable practical regression tasks

In-Depth Evaluation

Strengths

Theoretical Innovation: Bridges theoretical expressive capacity with practical implementation
Mathematical Rigor: Complete differentiability analysis and convergence proofs
Comprehensive Experiments: Full validation from one-dimensional to multidimensional, from regression to classification
Practical Value: Directly applicable to existing architectures without redesign

Weaknesses

Limited Applicability: Main advantages concentrated in specific function approximation tasks
Scalability Issues: Multidimensional extensions rely on simple combinations lacking theoretical guarantees
Limited Practical Effectiveness: Improvements limited on real classification tasks
Computational Complexity: Two-stage training increases implementation complexity

Impact

Theoretical Contribution: Provides new perspectives and tools for deep learning theory
Methodological Significance: Demonstrates value of mathematical construction in neural network design
Practical Potential: May have important value in scientific computing and engineering applications
Inspirational Value: Provides new ideas and directions for subsequent research

Applicable Scenarios

Scientific Computing: Numerical computation tasks requiring high-precision function approximation
Engineering Applications: Control systems, signal processing and other domains requiring precise modeling
Small Data Scenarios: Tasks with scarce training data but requiring good generalization
Theoretical Research: Tool for studying neural network expressive capacity

References

Hanin, B. & Rolnick, D. (2019). Deep ReLU networks have surprisingly few activation patterns.
Telgarsky, M. (2015). Representation benefits of deep feedforward networks.
Yarotsky, D. (2017). Error bounds for approximations with deep ReLU networks.
Montúfar, G. F. et al. (2014). On the number of linear regions of deep neural networks.
Perekrestenko, D. et al. (2018). The universal approximation power of finite-width deep ReLU networks.

Overall Assessment: This is an excellent paper balancing theory and practice, achieving important breakthroughs in realizing the expressive capacity of ReLU networks. While current applications are limited, it provides valuable contributions and insights for both deep learning theory and practice.