2025-11-25T01:46:17.329771

Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training

Milkert, Hyde, Laine
In a neural network with ReLU activations, the number of piecewise linear regions in the output can grow exponentially with depth. However, this is highly unlikely to happen when the initial parameters are sampled randomly, which therefore often leads to the use of networks that are unnecessarily large. To address this problem, we introduce a novel parameterization of the network that restricts its weights so that a depth $d$ network produces exactly $2^d$ linear regions at initialization and maintains those regions throughout training under the parameterization. This approach allows us to learn approximations of convex, one dimensional functions that are several orders of magnitude more accurate than their randomly initialized counterparts. We further demonstrate a preliminary extension of our construction to multidimensional and non-convex functions, allowing the technique to replace traditional dense layers in various architectures.
academic

Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training

Basic Information

  • Paper ID: 2311.18022
  • Title: Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training
  • Authors: Max Milkert, David Hyde, Forrest Laine
  • Classification: cs.LG cs.AI
  • Publication Time/Conference: Proceedings of the 42nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025
  • Paper Link: https://arxiv.org/abs/2311.18022

Abstract

In neural networks with ReLU activation functions, the number of piecewise linear regions in the output can theoretically grow exponentially with depth. However, this rarely occurs when network parameters are randomly sampled, often necessitating unnecessarily large networks. To address this issue, this paper proposes a novel network reparameterization method that constrains weights such that a network of depth dd produces exactly 2d2^d linear regions at initialization and maintains these regions during training. The method achieves several orders of magnitude improvement in accuracy compared to randomly initialized networks when learning convex one-dimensional function approximations. The authors also demonstrate preliminary results extending the construction to multidimensional and non-convex functions, enabling this technique to serve as a replacement for conventional dense layers in various architectures.

Research Background and Motivation

Problem Definition

ReLU networks theoretically possess powerful expressive capacity with linear region counts growing exponentially with depth, yet significant gaps exist between theory and practice:

  1. Theory-Practice Gap: Although theoretically a depth-dd ReLU network can produce 2d2^d linear regions, Hanin & Rolnick (2019) proved that randomly initialized networks have average linear region counts independent of depth, depending only on total neuron count.
  2. Limitations of Gradient Descent: Gradient descent struggles to create new activation regions because the number of linear regions is not a "local" property in parameter space and cannot be directly optimized through gradient-based methods.
  3. Network Redundancy Problem: In practice, approximately 95% of weights can be eliminated without significantly affecting accuracy, indicating inefficiency in conventional training methods.

Research Motivation

The core motivation of this work is to develop mathematical algorithms that circumvent limitations of random initialization, compelling ReLU networks to realize their theoretical expressive capacity, thereby achieving better performance with smaller networks.

Core Contributions

  1. Novel Reparameterization Method: Proposes a reparameterization strategy for 4-neuron-wide ReLU networks of arbitrary depth, ensuring that depth-dd networks produce 2d2^d activation regions at initialization.
  2. Pretraining Strategy: Develops a pretraining method that enforces the existence of 2d2^d activation regions during optimization.
  3. Significant Performance Improvement: Achieves orders-of-magnitude improvements in network performance on one-dimensional test cases.
  4. Extended Applications: Extends the method to non-convex and multidimensional functions, and serves as a plug-and-play replacement for dense layers in arbitrary networks.

Methodology Details

Core Idea

The method is based on combinations of triangular wave functions to construct networks with exponentially many linear regions:

Triangular Function Definition

Ti(x) = {
    x/ai,           0 ≤ x ≤ ai
    1-(x-ai)/(1-ai), ai ≤ x ≤ 1
}

where 0<ai<10 < a_i < 1 is the peak position of the triangular function at layer ii.

Combined Waveforms

Each layer produces triangular waves through function composition:

Wi(x) = Ti ∘ Ti-1 ∘ ... ∘ T0(x)

These waveforms possess 2i2^i linear regions, doubling at each layer.

Network Output

The final network output is a weighted sum of triangular waves from each layer:

F(x) = Σ(i=0 to ∞) si * Wi(x)

Network Architecture Design

Single Layer Implementation

Each triangular function requires two ReLU neurons:

  • Neuron t1: Input weight 1, output weight 1/a, always activated
  • Neuron t2: Bias -a, output weight -1/(a-a²), activated when x>a

Multi-layer Combination

Function composition is achieved through depth stacking, with each layer containing:

  • t1, t2 neurons: Implementing triangular functions
  • sum neuron: Accumulating triangular wave outputs from previous layers
  • bias neuron: Handling exponentially decaying biases

Weight Matrix Form

The hidden layer matrix form is:

[1  ±[Si/ai  -Si/(ai-ai²)]  0    ]   [sum ]
[0   Si/ai   -Si/(ai-ai²)   0    ] × [t1  ]
[0   Si/ai   -Si/(ai-ai²)  -Siai+1]   [t2  ]
[0   0       0              Si   ]   [bias]

Differentiability Constraints

Theorem 3.1

To ensure network output differentiability in the infinite-depth limit, scaling coefficients must satisfy:

si+1 = si(1-ai+1)ai+2

This constraint ensures derivative continuity, preventing outputs from becoming fractal curves.

Training Algorithm

Three-Stage Training Process

  1. Reparameterization and Initialization: Set network weights according to triangular peak positions
  2. Pretraining: Train the network under reparameterization constraints
  3. Standard Training: Directly optimize network weights

Algorithm Flow

Algorithm 1: Initialization and Pretraining
A ← Random((0,1)^n)  # Triangular peak positions
while Epochs > 0:
    Network ← Set_Weights(A)  # Set weights according to A
    Loss ← (Network(x) - y)²
    Network_Gradient ← ∂Loss/∂Network
    A_Gradient ← ∂Network/∂A  # Backprop through weight setting
    Gradient ← Network_Gradient × A_Gradient
    A ← A - ε × Gradient  # Update A rather than network weights

Experimental Setup

One-Dimensional Function Experiments

Dataset

  • Dense Data: 500 equally-spaced points on 0,1
  • Sparse Data: 10 training points, 10 test points (located between training points)

Target Functions

  • x3x^3, x11x^{11} (convex functions, subtractive combinations)
  • sin(x)\sin(x), tanh(3x)\tanh(3x) (approximated through additive combinations)

Network Configuration

  • 4-neuron width, 5 hidden layers
  • Adam optimizer, learning rate 0.001, 1000 epochs

Comparison Methods

  • Default Network: Kaiming initialization
  • RAAI Distribution: Improved weight distribution initialization
  • Skip Pretraining: Using proposed initialization with standard training only
  • Unregularized Pretraining: Without differentiability constraints
  • Complete Method: Pretraining + differentiability constraints

Extended Experiments

Non-convex and Multidimensional Functions

  • Non-convex Function: y=x3xy = x^3 - x (difference of two networks)
  • Two-dimensional Function: z=r3z = r^3 (sum of two networks)

Image Classification

  • VGG-16 on ImageNet: Replace dense layers in classifier
  • CIFAR-10: Apply in CNN architecture

Experimental Results

One-Dimensional Function Approximation Results

Dense Data Performance (Minimum MSE Error)

Methodx3x^3x11x^{11}sin(x)\sin(x)tanh(3x)\tanh(3x)
Kaiming Initialization2.11×10⁻⁵2.19×10⁻⁵4.50×10⁻⁵5.75×10⁻⁵
RAAI Distribution2.14×10⁻⁵4.40×10⁻⁵3.59×10⁻⁵1.09×10⁻⁵
Skip Pretraining7.63×10⁻⁷1.86×10⁻⁵1.96×10⁻⁷1.07×10⁻⁶
Unregularized Pretraining1.64×10⁻⁷3.20×10⁻⁶4.41×10⁻⁸1.49×10⁻⁷
Complete Method7.86×10⁻⁸8.86×10⁻⁷5.06×10⁻⁸6.82×10⁻⁸

Key Findings

  1. Orders-of-Magnitude Improvement: Complete method achieves 3 orders of magnitude higher precision than default networks
  2. Importance of Pretraining: Even skipping pretraining, initialization alone provides significant improvements
  3. Differentiability Constraint Effect: Enforcing differentiability further enhances stability and accuracy
  4. Dead ReLU Problem: Conventional methods suffer ~50% network collapse due to dead ReLU phenomenon

Sparse Data Generalization Capability

Methodx3x^3x11x^{11}sin(x)\sin(x)tanh(3x)\tanh(3x)
Kaiming Initialization2.41×10⁻⁴2.14×10⁻³2.27×10⁻⁵1.60×10⁻⁴
Complete Method5.65×10⁻⁶6.53×10⁻⁴7.92×10⁻⁷5.09×10⁻⁶

Extended Application Results

Non-convex and Multidimensional Functions

  • x3xx^3-x Approximation: Proposed method error 5.52×10⁻⁷ vs standard 8×5 network error 8×10⁻⁶
  • z=r3z=r^3 Approximation: Proposed method error 3.5×10⁻⁶ vs standard network error 1.5×10⁻⁴ (nearly two orders of magnitude improvement)

Image Classification Performance

  • ImageNet VGG-16: Advantages in early training, comparable final accuracy (73.3%)
  • CIFAR-10: Comparable performance to standard methods, demonstrating generality

Function Approximation Theory

This work builds upon classical neural network approximation theory:

  • Universal Approximation Theorem: Approximation capacity of infinitely wide or deep networks
  • Depth Advantage Theory: Certain functions require subexponential neurons in deep networks but exponential neurons in shallow networks

Triangular Wave Construction

Builds upon work by Telgarsky (2015) and Yarotsky (2017):

  • Symmetric Triangular Waves: Used for constructing exponential-precision approximations of x2x^2
  • Function Composition: Achieving complex function representation through inter-layer composition

Network Initialization Methods

Comparison with existing initialization approaches:

  • Kaiming/Xavier Initialization: Homogeneous methods based on statistical distributions
  • Dead ReLU Problem: Inherent issue of random initialization in deep networks
  • This Work's Contribution: Heterogeneous initialization based on mathematical construction

Conclusions and Discussion

Main Conclusions

  1. Theoretical Breakthrough: First practical method to force ReLU networks to produce exponentially many linear regions
  2. Significant Improvement: Achieves orders-of-magnitude accuracy improvements on one-dimensional function approximation tasks
  3. Extension Potential: Demonstrates applicability of the method to multidimensional and non-convex functions
  4. Practical Value: Can serve as plug-and-play replacement for dense layers in existing architectures

Limitations

  1. Architectural Constraints: Current method limited to specific 4-neuron-wide structures
  2. Function Class Limitations: Direct applicability to one-dimensional convex functions; multidimensional extensions require combinatorial strategies
  3. Limited Effect on Classification: Improvements not significant on image classification tasks
  4. Theoretical Completeness: Lacks universal theoretical framework for arbitrary ReLU networks

Future Directions

  1. Theoretical Extension: Identify dense sets of one-dimensional functions that can be efficiently represented
  2. Multidimensional Methods: Develop more natural multidimensional function representation approaches
  3. Sparse Structures: Overcome current limitation of creating only sparse block-diagonal matrices
  4. Application Exploration: Identify more suitable practical regression tasks

In-Depth Evaluation

Strengths

  1. Theoretical Innovation: Bridges theoretical expressive capacity with practical implementation
  2. Mathematical Rigor: Complete differentiability analysis and convergence proofs
  3. Comprehensive Experiments: Full validation from one-dimensional to multidimensional, from regression to classification
  4. Practical Value: Directly applicable to existing architectures without redesign

Weaknesses

  1. Limited Applicability: Main advantages concentrated in specific function approximation tasks
  2. Scalability Issues: Multidimensional extensions rely on simple combinations lacking theoretical guarantees
  3. Limited Practical Effectiveness: Improvements limited on real classification tasks
  4. Computational Complexity: Two-stage training increases implementation complexity

Impact

  1. Theoretical Contribution: Provides new perspectives and tools for deep learning theory
  2. Methodological Significance: Demonstrates value of mathematical construction in neural network design
  3. Practical Potential: May have important value in scientific computing and engineering applications
  4. Inspirational Value: Provides new ideas and directions for subsequent research

Applicable Scenarios

  1. Scientific Computing: Numerical computation tasks requiring high-precision function approximation
  2. Engineering Applications: Control systems, signal processing and other domains requiring precise modeling
  3. Small Data Scenarios: Tasks with scarce training data but requiring good generalization
  4. Theoretical Research: Tool for studying neural network expressive capacity

References

  1. Hanin, B. & Rolnick, D. (2019). Deep ReLU networks have surprisingly few activation patterns.
  2. Telgarsky, M. (2015). Representation benefits of deep feedforward networks.
  3. Yarotsky, D. (2017). Error bounds for approximations with deep ReLU networks.
  4. Montúfar, G. F. et al. (2014). On the number of linear regions of deep neural networks.
  5. Perekrestenko, D. et al. (2018). The universal approximation power of finite-width deep ReLU networks.

Overall Assessment: This is an excellent paper balancing theory and practice, achieving important breakthroughs in realizing the expressive capacity of ReLU networks. While current applications are limited, it provides valuable contributions and insights for both deep learning theory and practice.