Weight Initialization and Variance Dynamics in Deep Neural Networks and Large Language Models
Han
Weight initialization governs signal propagation and gradient flow at the start of training. This paper offers a theory-grounded and empirically validated study across two regimes: compact ReLU multilayer perceptrons and GPT-2-style transformers. First, a logarithmic sweep of the initial standard deviation maps vanishing and exploding regimes and identifies a broad stability band with standard deviations between 1e-2 and 1e-1. Second, a controlled comparison shows that Kaiming (fan-in) initialization converges faster and more stably than Xavier under ReLU, consistent with variance-preserving theory. Third, in a from-scratch 12-layer GPT-2-style model, this paper tracks layerwise Q/K/V weight variance through pretraining and observe depth-dependent equilibration into narrow bands: shallow layers expand rapidly while deeper layers change more gradually. Together, these results connect classic initialization principles with modern transformer behavior and yield simple, practical recipes for robust training.
academic
Weight Initialization and Variance Dynamics in Deep Neural Networks and Large Language Models
Weight initialization controls signal propagation and gradient flow at the beginning of training. This paper provides a theoretically grounded and empirically validated study covering two domains: compact ReLU multilayer perceptrons and GPT-2-style Transformers. First, through logarithmic sweeps over initial standard deviations, the paper maps regions of vanishing and exploding gradients, identifying a broad stability band with standard deviations between 1e-2 and 1e-1. Second, controlled comparisons demonstrate that under ReLU activation, Kaiming (fan-in) initialization converges faster and more stably than Xavier initialization, consistent with variance preservation theory. Third, in a 12-layer GPT-2-style model built from scratch, the paper tracks variance changes in Q/K/V weight matrices across layers during pretraining, observing depth-dependent equilibrium phenomena: shallow layers expand rapidly while deeper layers change more gradually.
The core problems addressed by this research concern the impact of weight initialization on training stability and convergence in deep neural networks and large language models. Specifically:
Initialization Scale Sensitivity: How different initial standard deviations affect training stability
Activation Function Specificity: Whether activation functions like ReLU and GELU require specific initialization strategies
Variance Dynamics in Modern Transformers: Whether variance stabilization persists in large Transformer models
Although classical initialization methods (LeCun, Xavier/Glorot, He/Kaiming) have intuitive variance preservation foundations, they suffer from the following issues in practical applications:
Sensitivity to deviations from ideal scales has not been adequately quantified
The mechanisms of specific activation functions (e.g., ReLU, GELU) remain unclear
Systematic studies of performance in large Transformers are lacking
Unified Variance Analysis Framework: Derives forward and backward variance propagation conditions for common activation functions (ReLU, GELU), explaining how fan-in scaling preserves signal amplitude and the origin of the factor of 2 in ReLU
Quantification of Scale Sensitivity: Through logarithmic sweeps over 25 standard deviation values, maps vanishing/exploding gradient regions and identifies a stable training band σ ∈ 10⁻², 10⁻¹
Activation-Aware Initialization Verification: In controlled ReLU MLP training, confirms that Kaiming normal (fan-in) converges faster with smaller loss variance compared to Xavier normal
Transformer Variance Dynamics Analysis: In a 12-layer GPT-2-style model built from scratch, discovers clear depth-dependent patterns: shallow layer weight standard deviations expand rapidly while deeper layers change more gradually, ultimately stabilizing in a narrow variance band
Forward and backward preservation conditions typically cannot be satisfied simultaneously unless n_in ≈ n_out and c_φ ≈ d_φ. In practice, maintaining forward signal stability is usually more important, explaining why fan-in He/Kaiming converges faster than Xavier.
Depth-Dependent Patterns: Shallow layers display rapid and significant weight standard deviation expansion in early training, while deeper layers expand more slowly and smoothly
Variance Equilibrium: All layers eventually stabilize in a narrow variance band
Distribution Sparsification: Post-training weight distributions become sparser, with many entries near zero remaining unchanged while a few large-magnitude weights dominate
The paper cites key works in the initialization field, including foundational research by LeCun, Glorot, He and others, as well as recent advances in Transformer optimization, providing a solid theoretical foundation for this research.