Stability of Transformers under Layer Normalization
Kan, Li, Zhang et al.
Despite their widespread use, training deep Transformers can be unstable. Layer normalization, a standard component, improves training stability, but its placement has often been ad-hoc. In this paper, we conduct a principled study on the forward (hidden states) and backward (gradient) stability of Transformers under different layer normalization placements. Our theory provides key insights into the training dynamics: whether training drives Transformers toward regular solutions or pathological behaviors. For forward stability, we derive explicit bounds on the growth of hidden states in trained Transformers. For backward stability, we analyze how layer normalization affects the backpropagation of gradients, thereby explaining the training dynamics of each layer normalization placement. Our analysis also guides the scaling of residual steps in Transformer blocks, where appropriate choices can further improve stability and performance. Our numerical results corroborate our theoretical findings. Beyond these results, our framework provides a principled way to sanity-check the stability of Transformers under new architectural modifications, offering guidance for future designs.
academic
Stability of Transformers under Layer Normalization
Title: Stability of Transformers under Layer Normalization
Authors: Kelvin Kan (UCLA), Xingjian Li (UT Austin), Benjamin J. Zhang (UNC Chapel Hill), Tuhin Sahai (SRI International), Stanley Osher (UCLA), Krishna Kumar (UT Austin), Markos A. Katsoulakis (UMass Amherst)
Although Transformers are widely used, training deep Transformers can be unstable. Layer Normalization (LN) as a standard component improves training stability, but its placement is often ad-hoc. This paper provides a principled investigation of forward stability (hidden states) and backward stability (gradients) of Transformers under different layer normalization positions. Theoretical analysis reveals key insights into training dynamics: whether training drives the Transformer toward regular solutions or pathological behavior. For forward stability, explicit bounds on hidden state growth in trained Transformers are derived. For backward stability, the paper analyzes how layer normalization affects gradient backpropagation, explaining the training dynamics of each layer normalization position. The analysis also guides the scaling of residual step sizes in Transformer blocks, with appropriate choices further improving stability and performance.
The core problem addressed in this research is: The mechanism by which different layer normalization positions affect Transformer training stability. Specifically, this includes:
Forward Stability Problem: Controlling hidden state growth in deep networks
Backward Stability Problem: Gradient stability during backpropagation
Architecture Design Guidance: Providing theoretical guidance for new Transformer variants
Practical Value: Transformers are fundamental to modern deep learning, and their training stability directly impacts model performance and training efficiency
The authors adopt a novel perspective using continuous-time dynamics and optimal control theory, modeling Transformer training as a mean-field control problem, enabling analysis of model properties after training convergence rather than focusing solely on initialization behavior.
Theoretical Framework Innovation: Proposes a novel framework based on optimal control theory to systematically analyze Transformer stability under different layer normalization positions
Forward Stability Analysis: Derives explicit bounds on hidden state growth, proving that Pre-LN leads to unbounded growth while Peri-LN maintains controlled growth
Backward Stability Analysis: Reveals the mechanism by which layer normalization affects gradient backpropagation
Residual Step Size Scaling: Proposes residual step size scaling methods to improve stability and performance
Experimental Validation: Validates theoretical findings on GPT-2 series models
Theorem 2: The optimal solution of the Pre-LN training problem is unbounded in magnitude.
Proof Strategy: By analyzing the Hamilton-Jacobi-Bellman (HJB) partial differential equation, it is proven that the corresponding Hamiltonian does not exist, causing the training problem to degenerate.
Theorem 3: Even with weight decay, Pre-LN Transformer hidden states exhibit exponential growth:
MA(XD)≤(1+C(λ))Dnd∥X0∥F=O(eD)
The paper cites multiple important works, including:
Ba et al. (2016): Original Layer Normalization paper
Xiong et al. (2020): Pre-LN vs Post-LN comparative study
Kim et al. (2025): Empirical study of Peri-LN
He et al. (2016): Pioneering work on residual connections
Overall Assessment: This is a high-quality paper that effectively combines theory and practice, providing a new mathematical framework for Transformer stability analysis with significant academic value and practical implications. The theoretical analysis is rigorous and in-depth, experimental validation is comprehensive, and it provides valuable guidance for deep learning architecture design.