Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training
Milkert, Hyde, Laine
In a neural network with ReLU activations, the number of piecewise linear regions in the output can grow exponentially with depth. However, this is highly unlikely to happen when the initial parameters are sampled randomly, which therefore often leads to the use of networks that are unnecessarily large. To address this problem, we introduce a novel parameterization of the network that restricts its weights so that a depth $d$ network produces exactly $2^d$ linear regions at initialization and maintains those regions throughout training under the parameterization. This approach allows us to learn approximations of convex, one dimensional functions that are several orders of magnitude more accurate than their randomly initialized counterparts. We further demonstrate a preliminary extension of our construction to multidimensional and non-convex functions, allowing the technique to replace traditional dense layers in various architectures.
academic
Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training
In neural networks with ReLU activation functions, the number of piecewise linear regions in the output can theoretically grow exponentially with depth. However, this rarely occurs when network parameters are randomly sampled, often necessitating unnecessarily large networks. To address this issue, this paper proposes a novel network reparameterization method that constrains weights such that a network of depth d produces exactly 2d linear regions at initialization and maintains these regions during training. The method achieves several orders of magnitude improvement in accuracy compared to randomly initialized networks when learning convex one-dimensional function approximations. The authors also demonstrate preliminary results extending the construction to multidimensional and non-convex functions, enabling this technique to serve as a replacement for conventional dense layers in various architectures.
ReLU networks theoretically possess powerful expressive capacity with linear region counts growing exponentially with depth, yet significant gaps exist between theory and practice:
Theory-Practice Gap: Although theoretically a depth-d ReLU network can produce 2d linear regions, Hanin & Rolnick (2019) proved that randomly initialized networks have average linear region counts independent of depth, depending only on total neuron count.
Limitations of Gradient Descent: Gradient descent struggles to create new activation regions because the number of linear regions is not a "local" property in parameter space and cannot be directly optimized through gradient-based methods.
Network Redundancy Problem: In practice, approximately 95% of weights can be eliminated without significantly affecting accuracy, indicating inefficiency in conventional training methods.
The core motivation of this work is to develop mathematical algorithms that circumvent limitations of random initialization, compelling ReLU networks to realize their theoretical expressive capacity, thereby achieving better performance with smaller networks.
Novel Reparameterization Method: Proposes a reparameterization strategy for 4-neuron-wide ReLU networks of arbitrary depth, ensuring that depth-d networks produce 2d activation regions at initialization.
Pretraining Strategy: Develops a pretraining method that enforces the existence of 2d activation regions during optimization.
Significant Performance Improvement: Achieves orders-of-magnitude improvements in network performance on one-dimensional test cases.
Extended Applications: Extends the method to non-convex and multidimensional functions, and serves as a plug-and-play replacement for dense layers in arbitrary networks.
Algorithm 1: Initialization and Pretraining
A ← Random((0,1)^n) # Triangular peak positions
while Epochs > 0:
Network ← Set_Weights(A) # Set weights according to A
Loss ← (Network(x) - y)²
Network_Gradient ← ∂Loss/∂Network
A_Gradient ← ∂Network/∂A # Backprop through weight setting
Gradient ← Network_Gradient × A_Gradient
A ← A - ε × Gradient # Update A rather than network weights
Hanin, B. & Rolnick, D. (2019). Deep ReLU networks have surprisingly few activation patterns.
Telgarsky, M. (2015). Representation benefits of deep feedforward networks.
Yarotsky, D. (2017). Error bounds for approximations with deep ReLU networks.
Montúfar, G. F. et al. (2014). On the number of linear regions of deep neural networks.
Perekrestenko, D. et al. (2018). The universal approximation power of finite-width deep ReLU networks.
Overall Assessment: This is an excellent paper balancing theory and practice, achieving important breakthroughs in realizing the expressive capacity of ReLU networks. While current applications are limited, it provides valuable contributions and insights for both deep learning theory and practice.