2025-11-19T10:07:13.697330

Nonlinearly Preconditioned Gradient Methods: Momentum and Stochastic Analysis

Oikonomidis, Quan, Patrinos

We study nonlinearly preconditioned gradient methods for smooth nonconvex optimization problems, focusing on sigmoid preconditioners that inherently perform a form of gradient clipping akin to the widely used gradient clipping technique. Building upon this idea, we introduce a novel heavy ball-type algorithm and provide convergence guarantees under a generalized smoothness condition that is less restrictive than traditional Lipschitz smoothness, thus covering a broader class of functions. Additionally, we develop a stochastic variant of the base method and study its convergence properties under different noise assumptions. We compare the proposed algorithms with baseline methods on diverse tasks from machine learning including neural network training.

academic

Nonlinearly Preconditioned Gradient Methods: Momentum and Stochastic Analysis

Basic Information

Paper ID: 2510.11312
Title: Nonlinearly Preconditioned Gradient Methods: Momentum and Stochastic Analysis
Authors: Konstantinos Oikonomidis, Jan Quan, Panagiotis Patrinos (KU Leuven)
Classification: math.OC (Optimization and Control)
Conference: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Paper Link: https://arxiv.org/abs/2510.11312

Abstract

This paper investigates nonlinearly preconditioned gradient methods for smooth nonconvex optimization problems, with emphasis on sigmoid preconditioning that essentially performs gradient clipping—a widely used technique. Building on this idea, the authors introduce a novel heavy-ball algorithm and provide convergence guarantees under generalized smoothness conditions that are less restrictive than traditional Lipschitz smoothness, thereby covering a broader class of functions. Furthermore, the authors develop stochastic variants of the base method and investigate their convergence properties under different noise assumptions.

Research Background and Motivation

Problem to be Addressed: Traditional gradient descent (GD) and stochastic gradient descent (SGD) methods require careful parameter tuning or expensive line search strategies when handling modern machine learning applications that do not satisfy the global Lipschitz gradient assumption.
Problem Importance: Most cost functions in modern deep learning applications do not satisfy the traditional Lipschitz gradient assumption, and gradient clipping techniques have become standard practice in tasks such as language model training to stabilize neural network training.
Limitations of Existing Methods:
- Standard GD/SGD methods struggle with convergence when handling problems beyond Lipschitz smoothness
- Theoretical analysis of existing gradient clipping methods is primarily limited to specific smoothness conditions
- Lack of momentum method analysis in more general settings
Research Motivation: To unify gradient clipping methods within a nonlinear preconditioning framework and extend to more general theoretical analysis including momentum and stochastic variants.

Core Contributions

Extended Anisotropic Gradient Descent Methods: By incorporating heavy-ball momentum into the base iteration, convergence guarantees are studied in general nonconvex settings.
Proposed Stochastic Extensions: Analyzed stochastic versions of the base method under different noise assumptions, including conditions more relaxed than bounded variance.
Theoretical Analysis Contributions:
- Proved convergence of momentum algorithms under anisotropic descent inequalities
- Established linear convergence rates under generalized PL conditions
- Analyzed stochastic methods under new noise assumptions
Experimental Validation: Demonstrated good performance of the proposed method on various machine learning tasks, including neural network training and matrix factorization.

Method Details

Problem Formulation

Consider the general minimization problem: $\min_{x \in \mathbb{R}^n} f(x)$ where $f: \mathbb{R}^n \to \mathbb{R}$ is a smooth and possibly nonconvex function.

Core Framework: Nonlinearly Preconditioned Gradient Methods

Base Method: $x^{k+1} = x^k - \gamma \nabla \phi^*(\nabla f(x^k))$

where $\phi: \mathbb{R}^n \to \mathbb{R}$ is a convex reference function, $\phi^*$ is its convex conjugate, and $\nabla \phi^*$ generates the preconditioner.

Key Idea: By choosing a strongly convex reference function $\phi$ with bounded domain, the mapping $\nabla \phi^*$ maps $\mathbb{R}^n$ to the unit $n$ -ball, naturally implementing gradient clipping.

Algorithm 1: Nonlinearly Preconditioned Gradient Method with Momentum (m-NPGM)

Input: Choose x⁰ ∈ ℝⁿ, γ, β > 0, set m⁻¹ = 0ⁿ
Repeat k = 0, 1, ... until convergence:
1. Compute mᵏ = βmᵏ⁻¹ + (1-β)∇φ*(∇f(xᵏ))
2. Compute xᵏ⁺¹ = xᵏ - γmᵏ

Equivalent Form: $x^{k+1} = x^k - (1-\beta)\gamma\nabla\phi^*(\nabla f(x^k)) + \beta(x^k - x^{k-1})$

Anisotropic Descent Inequality

Definition: A function $f$ satisfies the anisotropic descent property with respect to $\phi$ if for all $x, \bar{x} \in \mathbb{R}^n$ : $f(x) \leq f(\bar{x}) + \frac{1}{L} \star \phi(x - \bar{y}) - \frac{1}{L} \star \phi(\bar{x} - \bar{y})$ where $\bar{y} = \bar{x} - \frac{1}{L}\nabla\phi^*(\nabla f(\bar{x}))$ .

Technical Innovations

Momentum Design: Unlike standard methods, the momentum estimate in this paper consists of a convex combination of preconditioned gradients, rather than aggregating gradients first and then preconditioning.
Generalized Smoothness: Anisotropic smoothness is less restrictive than $(L_0, L_1)$ -smoothness, covering a broader class of functions.
Unified Analysis Framework: Provides unified convergence analysis based on the convexity of the reference function $\phi$ .

Theoretical Results

Main Convergence Theorems

Theorem 2.2: Under anisotropic smoothness conditions, for $\beta \in [0, 0.5)$ and $\gamma = \alpha/L$ , $\alpha \leq 1$ : $\min_{0 \leq k \leq K} \phi(\nabla\phi^*(\nabla f(x^k))) \leq \frac{L(f(x^0) - f^*)}{α(K+1)(1-2\beta)}$

Theorem 2.4: Under generalized PL conditions, for 2-homogeneous reference functions: $f(x^k) - f^* \leq \alpha^k(f(x^0) - f^*)$ where $\alpha = \max\{1 - \gamma\mu(\beta - 2\beta^2), \beta + 2\beta^2\}$ .

Stochastic Method Analysis

Theorem 3.1: Under noise condition $\mathbb{E}[\phi(\nabla\phi^*(\nabla f(x)) - \nabla\phi^*(g(x)))] \leq \sigma^2$ : $\mathbb{E}\left[\frac{1}{K}\sum_{k=0}^{K-1} \phi(\nabla\phi^*(\nabla f(x^k)))\right] \leq \frac{f(x^0) - f^*}{\gamma K} + \sigma^2$

Experimental Setup

Datasets

MNIST: Handwritten digit classification using two-layer fully connected networks
CIFAR-10/100: Image classification using ResNet-18/34 architectures
MovieLens 100K: Matrix factorization problem
Phase Retrieval: Nonconvex optimization problem

Evaluation Metrics

Training loss convergence speed
Test accuracy
Gradient norm $\|\nabla f(x^k)\|$

Comparison Methods

SGD/SGDm: Standard stochastic gradient descent and its momentum variant
Adam: Adaptive learning rate method
GD/GDm: Standard gradient descent and its momentum variant
AdGD-accel: Accelerated variant of adaptive gradient method

Implementation Details

Fixed step size
Hyperbolic Gradient Descent (HGD): $\phi(x) = \cosh(\|x\|) - 1$
Separable variant: $\phi(x) = \sum_{i=1}^n \cosh(x_i) - 1$

Experimental Results

Main Results

MNIST Classification: iHGD quickly achieves small training loss, outperforming SGD and Adam
CIFAR-10 Classification: Proposed method performs comparably to SGD and SGDm, which are state-of-the-art for this problem
Matrix Factorization: iHGDm significantly outperforms other methods and shows greater stability across different random initializations
Phase Retrieval: sHGD performs similarly to gradient clipping methods

Key Findings

Adaptive Step Size: For reference functions with growth rate exceeding quadratic, the preconditioner naturally forms a sigmoid shape, providing an implicit adaptive step size rule
Stability: On nonconvex problems such as matrix factorization, the proposed method exhibits better stability
Broad Applicability: The method performs well across different types of machine learning tasks

Dual-Space Preconditioning/Anisotropic Gradient Descent

Originally introduced in 32 for convex essentially smooth problems
Anisotropic descent inequalities introduced in 24
36 showed this method encompasses many popular algorithms

Gradient Clipping and Generalized Smoothness

$(L_0, L_1)$ -smoothness concept introduced in 48
General clipping framework with momentum analyzed in 47
Extensive work on studying such methods under relaxed noise and smoothness assumptions

Conclusions and Discussion

Main Conclusions

Successfully extended anisotropic gradient descent methods to include heavy-ball momentum
Provided convergence guarantees under conditions less restrictive than traditional Lipschitz smoothness
Developed stochastic variants and analyzed them under new noise assumptions
Experimental validation demonstrated method effectiveness on various machine learning tasks

Limitations

Momentum parameter restricted to $\beta \in [0, 0.5)$ , cannot be extended to $\beta \in [0, 1)$
Preconditioner Lipschitz continuity assumption is more restrictive than anisotropic smoothness
Complete analysis of stochastic momentum methods not provided

Future Directions

Unified analysis of momentum algorithms under relaxed reference function assumptions
Extension to arbitrary momentum parameters $\beta \in [0, 1)$
Extension of complete proximal gradient-type algorithms to include momentum
Removal of batch size dependence for stochastic algorithms and inclusion of momentum

In-Depth Evaluation

Strengths

Theoretical Innovation: Provides the first momentum method analysis under anisotropic smoothness conditions
Unified Framework: Unifies multiple methods including gradient clipping within the nonlinear preconditioning framework
Practical Value: Method performs well on practical machine learning tasks
Analysis Depth: Provides complete theoretical analysis in both deterministic and stochastic settings

Weaknesses

Parameter Restrictions: Momentum parameter restrictions ( $\beta < 0.5$ ) are more stringent than standard analysis
Assumption Strength: Some theoretical results require additional technical assumptions
Experimental Scope: Experiments primarily focus on standard machine learning tasks, lacking broader application validation

Impact

Theoretical Contribution: Provides new tools and insights for theoretical analysis of nonlinear preconditioning methods
Practical Value: Offers new methods for handling optimization problems beyond standard smoothness assumptions
Reproducibility: Authors provide publicly available code implementation

Applicable Scenarios

Neural network training, particularly scenarios where gradients may be large
Nonconvex optimization problems such as matrix factorization
Applications requiring gradient clipping or normalization
Optimization problems beyond standard Lipschitz smoothness

References

The paper includes 48 references covering important works in optimization theory, machine learning, and numerical methods, providing a solid theoretical foundation for the research.