2025-11-19T10:07:13.697330

Nonlinearly Preconditioned Gradient Methods: Momentum and Stochastic Analysis

Oikonomidis, Quan, Patrinos
We study nonlinearly preconditioned gradient methods for smooth nonconvex optimization problems, focusing on sigmoid preconditioners that inherently perform a form of gradient clipping akin to the widely used gradient clipping technique. Building upon this idea, we introduce a novel heavy ball-type algorithm and provide convergence guarantees under a generalized smoothness condition that is less restrictive than traditional Lipschitz smoothness, thus covering a broader class of functions. Additionally, we develop a stochastic variant of the base method and study its convergence properties under different noise assumptions. We compare the proposed algorithms with baseline methods on diverse tasks from machine learning including neural network training.
academic

Nonlinearly Preconditioned Gradient Methods: Momentum and Stochastic Analysis

Basic Information

  • Paper ID: 2510.11312
  • Title: Nonlinearly Preconditioned Gradient Methods: Momentum and Stochastic Analysis
  • Authors: Konstantinos Oikonomidis, Jan Quan, Panagiotis Patrinos (KU Leuven)
  • Classification: math.OC (Optimization and Control)
  • Conference: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
  • Paper Link: https://arxiv.org/abs/2510.11312

Abstract

This paper investigates nonlinearly preconditioned gradient methods for smooth nonconvex optimization problems, with emphasis on sigmoid preconditioning that essentially performs gradient clipping—a widely used technique. Building on this idea, the authors introduce a novel heavy-ball algorithm and provide convergence guarantees under generalized smoothness conditions that are less restrictive than traditional Lipschitz smoothness, thereby covering a broader class of functions. Furthermore, the authors develop stochastic variants of the base method and investigate their convergence properties under different noise assumptions.

Research Background and Motivation

  1. Problem to be Addressed: Traditional gradient descent (GD) and stochastic gradient descent (SGD) methods require careful parameter tuning or expensive line search strategies when handling modern machine learning applications that do not satisfy the global Lipschitz gradient assumption.
  2. Problem Importance: Most cost functions in modern deep learning applications do not satisfy the traditional Lipschitz gradient assumption, and gradient clipping techniques have become standard practice in tasks such as language model training to stabilize neural network training.
  3. Limitations of Existing Methods:
    • Standard GD/SGD methods struggle with convergence when handling problems beyond Lipschitz smoothness
    • Theoretical analysis of existing gradient clipping methods is primarily limited to specific smoothness conditions
    • Lack of momentum method analysis in more general settings
  4. Research Motivation: To unify gradient clipping methods within a nonlinear preconditioning framework and extend to more general theoretical analysis including momentum and stochastic variants.

Core Contributions

  1. Extended Anisotropic Gradient Descent Methods: By incorporating heavy-ball momentum into the base iteration, convergence guarantees are studied in general nonconvex settings.
  2. Proposed Stochastic Extensions: Analyzed stochastic versions of the base method under different noise assumptions, including conditions more relaxed than bounded variance.
  3. Theoretical Analysis Contributions:
    • Proved convergence of momentum algorithms under anisotropic descent inequalities
    • Established linear convergence rates under generalized PL conditions
    • Analyzed stochastic methods under new noise assumptions
  4. Experimental Validation: Demonstrated good performance of the proposed method on various machine learning tasks, including neural network training and matrix factorization.

Method Details

Problem Formulation

Consider the general minimization problem: minxRnf(x)\min_{x \in \mathbb{R}^n} f(x) where f:RnRf: \mathbb{R}^n \to \mathbb{R} is a smooth and possibly nonconvex function.

Core Framework: Nonlinearly Preconditioned Gradient Methods

Base Method: xk+1=xkγϕ(f(xk))x^{k+1} = x^k - \gamma \nabla \phi^*(\nabla f(x^k))

where ϕ:RnR\phi: \mathbb{R}^n \to \mathbb{R} is a convex reference function, ϕ\phi^* is its convex conjugate, and ϕ\nabla \phi^* generates the preconditioner.

Key Idea: By choosing a strongly convex reference function ϕ\phi with bounded domain, the mapping ϕ\nabla \phi^* maps Rn\mathbb{R}^n to the unit nn-ball, naturally implementing gradient clipping.

Algorithm 1: Nonlinearly Preconditioned Gradient Method with Momentum (m-NPGM)

Input: Choose x⁰ ∈ ℝⁿ, γ, β > 0, set m⁻¹ = 0ⁿ
Repeat k = 0, 1, ... until convergence:
1. Compute mᵏ = βmᵏ⁻¹ + (1-β)∇φ*(∇f(xᵏ))
2. Compute xᵏ⁺¹ = xᵏ - γmᵏ

Equivalent Form: xk+1=xk(1β)γϕ(f(xk))+β(xkxk1)x^{k+1} = x^k - (1-\beta)\gamma\nabla\phi^*(\nabla f(x^k)) + \beta(x^k - x^{k-1})

Anisotropic Descent Inequality

Definition: A function ff satisfies the anisotropic descent property with respect to ϕ\phi if for all x,xˉRnx, \bar{x} \in \mathbb{R}^n: f(x)f(xˉ)+1Lϕ(xyˉ)1Lϕ(xˉyˉ)f(x) \leq f(\bar{x}) + \frac{1}{L} \star \phi(x - \bar{y}) - \frac{1}{L} \star \phi(\bar{x} - \bar{y}) where yˉ=xˉ1Lϕ(f(xˉ))\bar{y} = \bar{x} - \frac{1}{L}\nabla\phi^*(\nabla f(\bar{x})).

Technical Innovations

  1. Momentum Design: Unlike standard methods, the momentum estimate in this paper consists of a convex combination of preconditioned gradients, rather than aggregating gradients first and then preconditioning.
  2. Generalized Smoothness: Anisotropic smoothness is less restrictive than (L0,L1)(L_0, L_1)-smoothness, covering a broader class of functions.
  3. Unified Analysis Framework: Provides unified convergence analysis based on the convexity of the reference function ϕ\phi.

Theoretical Results

Main Convergence Theorems

Theorem 2.2: Under anisotropic smoothness conditions, for β[0,0.5)\beta \in [0, 0.5) and γ=α/L\gamma = \alpha/L, α1\alpha \leq 1: min0kKϕ(ϕ(f(xk)))L(f(x0)f)α(K+1)(12β)\min_{0 \leq k \leq K} \phi(\nabla\phi^*(\nabla f(x^k))) \leq \frac{L(f(x^0) - f^*)}{α(K+1)(1-2\beta)}

Theorem 2.4: Under generalized PL conditions, for 2-homogeneous reference functions: f(xk)fαk(f(x0)f)f(x^k) - f^* \leq \alpha^k(f(x^0) - f^*) where α=max{1γμ(β2β2),β+2β2}\alpha = \max\{1 - \gamma\mu(\beta - 2\beta^2), \beta + 2\beta^2\}.

Stochastic Method Analysis

Theorem 3.1: Under noise condition E[ϕ(ϕ(f(x))ϕ(g(x)))]σ2\mathbb{E}[\phi(\nabla\phi^*(\nabla f(x)) - \nabla\phi^*(g(x)))] \leq \sigma^2: E[1Kk=0K1ϕ(ϕ(f(xk)))]f(x0)fγK+σ2\mathbb{E}\left[\frac{1}{K}\sum_{k=0}^{K-1} \phi(\nabla\phi^*(\nabla f(x^k)))\right] \leq \frac{f(x^0) - f^*}{\gamma K} + \sigma^2

Experimental Setup

Datasets

  1. MNIST: Handwritten digit classification using two-layer fully connected networks
  2. CIFAR-10/100: Image classification using ResNet-18/34 architectures
  3. MovieLens 100K: Matrix factorization problem
  4. Phase Retrieval: Nonconvex optimization problem

Evaluation Metrics

  • Training loss convergence speed
  • Test accuracy
  • Gradient norm f(xk)\|\nabla f(x^k)\|

Comparison Methods

  • SGD/SGDm: Standard stochastic gradient descent and its momentum variant
  • Adam: Adaptive learning rate method
  • GD/GDm: Standard gradient descent and its momentum variant
  • AdGD-accel: Accelerated variant of adaptive gradient method

Implementation Details

  • Fixed step size
  • Hyperbolic Gradient Descent (HGD): ϕ(x)=cosh(x)1\phi(x) = \cosh(\|x\|) - 1
  • Separable variant: ϕ(x)=i=1ncosh(xi)1\phi(x) = \sum_{i=1}^n \cosh(x_i) - 1

Experimental Results

Main Results

  1. MNIST Classification: iHGD quickly achieves small training loss, outperforming SGD and Adam
  2. CIFAR-10 Classification: Proposed method performs comparably to SGD and SGDm, which are state-of-the-art for this problem
  3. Matrix Factorization: iHGDm significantly outperforms other methods and shows greater stability across different random initializations
  4. Phase Retrieval: sHGD performs similarly to gradient clipping methods

Key Findings

  1. Adaptive Step Size: For reference functions with growth rate exceeding quadratic, the preconditioner naturally forms a sigmoid shape, providing an implicit adaptive step size rule
  2. Stability: On nonconvex problems such as matrix factorization, the proposed method exhibits better stability
  3. Broad Applicability: The method performs well across different types of machine learning tasks

Dual-Space Preconditioning/Anisotropic Gradient Descent

  • Originally introduced in 32 for convex essentially smooth problems
  • Anisotropic descent inequalities introduced in 24
  • 36 showed this method encompasses many popular algorithms

Gradient Clipping and Generalized Smoothness

  • (L0,L1)(L_0, L_1)-smoothness concept introduced in 48
  • General clipping framework with momentum analyzed in 47
  • Extensive work on studying such methods under relaxed noise and smoothness assumptions

Conclusions and Discussion

Main Conclusions

  1. Successfully extended anisotropic gradient descent methods to include heavy-ball momentum
  2. Provided convergence guarantees under conditions less restrictive than traditional Lipschitz smoothness
  3. Developed stochastic variants and analyzed them under new noise assumptions
  4. Experimental validation demonstrated method effectiveness on various machine learning tasks

Limitations

  1. Momentum parameter restricted to β[0,0.5)\beta \in [0, 0.5), cannot be extended to β[0,1)\beta \in [0, 1)
  2. Preconditioner Lipschitz continuity assumption is more restrictive than anisotropic smoothness
  3. Complete analysis of stochastic momentum methods not provided

Future Directions

  1. Unified analysis of momentum algorithms under relaxed reference function assumptions
  2. Extension to arbitrary momentum parameters β[0,1)\beta \in [0, 1)
  3. Extension of complete proximal gradient-type algorithms to include momentum
  4. Removal of batch size dependence for stochastic algorithms and inclusion of momentum

In-Depth Evaluation

Strengths

  1. Theoretical Innovation: Provides the first momentum method analysis under anisotropic smoothness conditions
  2. Unified Framework: Unifies multiple methods including gradient clipping within the nonlinear preconditioning framework
  3. Practical Value: Method performs well on practical machine learning tasks
  4. Analysis Depth: Provides complete theoretical analysis in both deterministic and stochastic settings

Weaknesses

  1. Parameter Restrictions: Momentum parameter restrictions (β<0.5\beta < 0.5) are more stringent than standard analysis
  2. Assumption Strength: Some theoretical results require additional technical assumptions
  3. Experimental Scope: Experiments primarily focus on standard machine learning tasks, lacking broader application validation

Impact

  1. Theoretical Contribution: Provides new tools and insights for theoretical analysis of nonlinear preconditioning methods
  2. Practical Value: Offers new methods for handling optimization problems beyond standard smoothness assumptions
  3. Reproducibility: Authors provide publicly available code implementation

Applicable Scenarios

  1. Neural network training, particularly scenarios where gradients may be large
  2. Nonconvex optimization problems such as matrix factorization
  3. Applications requiring gradient clipping or normalization
  4. Optimization problems beyond standard Lipschitz smoothness

References

The paper includes 48 references covering important works in optimization theory, machine learning, and numerical methods, providing a solid theoretical foundation for the research.