2025-11-11T07:19:09.204233

Generalized Exponentiated Gradient Algorithms Using the Euler Two-Parameter Logarithm

Cichocki

IIn this paper we propose and investigate a new class of Generalized Exponentiated Gradient (GEG) algorithms using Mirror Descent (MD) updates, and applying the Bregman divergence with a two--parameter deformation of the logarithm as a link function. This link function (referred here to as the Euler logarithm) is associated with a relatively wide class of trace--form entropies. In order to derive novel GEG/MD updates, we estimate a deformed exponential function, which closely approximates the inverse of the Euler two--parameter deformed logarithm. The characteristic shape and properties of the Euler logarithm and its inverse--deformed exponential functions, are tuned by two hyperparameters. By learning these hyperparameters, we can adapt to the distribution of training data and adjust them to achieve desired properties of gradient descent algorithms. In the literature, there exist nowadays more than fifty mathematically well-established entropic functionals and associated deformed logarithms, so it is impossible to investigate all of them in one research paper. Therefore, we focus here on a class of trace-form entropies and the associated deformed two--parameters logarithms.

academic

Generalized Exponentiated Gradient Algorithms Using the Euler Two-Parameter Logarithm

Basic Information

Paper ID: 2502.17500
Title: Generalized Exponentiated Gradient Algorithms Using the Euler Two-Parameter Logarithm
Author: Andrzej Cichocki (Polish Academy of Science, UMK Torun Poland, Tokyo University of Agriculture and Technology, Riken AIP)
Classification: cs.LG cs.AI
Publication Date: arXiv preprint (February 2025)
Paper Link: https://arxiv.org/abs/2502.17500

Abstract

This paper proposes and investigates a class of novel generalized exponentiated gradient (GEG) algorithms that employ mirror descent (MD) updates and apply Bregman divergence with a two-parameter logarithmic deformation as a link function. This link function, termed the Euler logarithm, is associated with a relatively broad class of trace-form entropies. To derive new GEG/MD updates, the authors estimate a deformed exponential function that closely approximates the inverse of the Euler two-parameter deformed logarithm. By learning these hyperparameters, the algorithm can adapt to the distribution of training data and be adjusted to achieve desirable properties of gradient descent algorithms.

Research Background and Motivation

Problem Definition

Existing gradient descent methods suffer from the following limitations:

Standard additive gradient descent is inapplicable when all weights must be non-negative
Vanishing and exploding gradient problems require precise learning rate adjustment
Lack of adaptivity: Existing EG updates cannot adapt to data of different distributions and lack hyperparameters to control convergence performance

Research Motivation

Biological plausibility: Recent neuronal synaptic studies suggest that EG updates are more consistent with biological learning processes than additive GD
Geometric adaptivity: By selecting appropriate link functions, mirror descent can adapt to the geometric structure of optimization problems
Theoretical richness: The literature contains over 50 mathematically mature entropy functions and associated deformed logarithms, providing a rich theoretical foundation for algorithm design

Core Contributions

Proposed generalized EG algorithms based on Euler two-parameter logarithm: First application of Euler (a,b)-logarithm to mirror descent and exponentiated gradient updates
Established approximation theory for deformed exponential functions: Provided two solution methods via Lagrange inversion theorem and Lambert-Tsallis W function
Unified multiple known algorithms: Demonstrated that various existing algorithms (Tsallis, Kaniadakis, Amari, etc.) are special cases of this framework
Extended to bipolar weights: Proposed normalized MD/GEG algorithms for handling bipolar weight vectors
Provided complete mathematical theoretical foundation: Including function properties, convergence analysis, and computational stability considerations

Method Details

Task Definition

The optimization problem is defined as: $w_{t+1} = \arg\min_{w \in \mathbb{R}_+^N} \left\{ L(w_t) + \langle\nabla L(w_t), w - w_t\rangle + \frac{1}{\eta} D_F(w||w_t) \right\}$

where $D_F(w||w_t)$ is the Bregman divergence and $L(w)$ is a differentiable loss function.

Core Mathematical Framework

Euler (a,b)-Logarithm

$\log^E_{a,b}(x) = \frac{x^a - x^b}{a - b}, \quad x > 0, a \neq b$

Parameter constraints: $a < 0, 0 < b < 1$ or $b < 0, 0 < a < 1$

Deformed Exponential Function

Power series approximation obtained via Lagrange inversion theorem: $\exp_{a,b}(x) \approx \exp(x) - \frac{1}{2}(a+b)x^2 - \frac{1}{6}(3a+3b-2a^2-5ab-2b^2)x^3 + O(x^4)$

Algorithm Architecture

Non-normalized GEG Update

$w_{t+1} = \exp_{a,b}[\log_{a,b}(w_t) - \eta_t \nabla L(w_t)] = w_t \otimes_{a,b} \exp_{a,b}[-\eta_t \nabla L(w_t)]$

where $\otimes_{a,b}$ is the deformed multiplication operation.

Normalized GEG Update

For unit simplex constraints: $\tilde{w}_{t+1} = w_t \otimes_{a,b} \exp_{a,b}(-\eta_t \nabla \hat{L}(w_t))$ $w_{t+1} = \frac{\tilde{w}_{t+1}}{||\tilde{w}_{t+1}||_1}$

where $\hat{L}(w) = L(w/||w||_1)$ is the normalized loss function.

Technical Innovations

Two-parameter flexibility: Parameters (a,b) adjust the algorithm to adapt to different data distributions
Unified framework: Incorporates multiple known algorithms into a unified mathematical framework
Numerical stability: Provides computationally stable implementation methods
Theoretical completeness: Establishes complete mathematical theory including function properties and convergence analysis

Experimental Setup

Theoretical Verification

The paper primarily conducts theoretical analysis and mathematical derivation, including:

Function property verification: Monotonicity, concavity, normalization, and other basic properties
Special case verification: Validates the correctness of known algorithms as special cases
Numerical stability analysis: Analyzes parameter sensitivity and computational stability

Parameter Range Analysis

Valid parameter domain: $a < 0, 0 < b < 1$ or $b < 0, 0 < a < 1$
Numerically stable region: Most stable when $x \to 1$ , requires special handling when far from 1
Convergence properties: Requires L'Hospital's rule for handling singular cases

Experimental Results

Theoretical Results

Function Property Verification

Domain: $\log_{a,b}(x): \mathbb{R}_+ \to \mathbb{R}$
Monotonicity: $\frac{d\log_{a,b}(x)}{dx} > 0$
Concavity: $\frac{d^2\log_{a,b}(x)}{dx^2} < 0$ (within specified parameter ranges)
Normalization: $\log_{a,b}(1) = 0$ , $\frac{d\log_{a,b}(x)}{dx}|_{x=1} = 1$

Special Case Recovery

Successfully verified the following special cases:

$a = b = 0$ : Standard natural logarithm $\ln(x)$
$a = 0, b = -\alpha$ : Amari α-logarithm
$a = 1-q, b = 0$ : Tsallis q-logarithm
$a = \kappa, b = -\kappa$ : Kaniadakis κ-logarithm

Numerical Analysis Results

Computational Stability

Parameter sensitivity: Small $x$ values are more sensitive to parameter changes
Numerical stability: Algorithm is most stable when $x \to 1$
Convergence properties: Limiting behavior requires special computational treatment

Power Series Approximation Accuracy

Comparison with exact solutions validates that the power series approximation exhibits good accuracy within reasonable parameter ranges.

Optimization Algorithm Development

Classical methods: Additive gradient descent (GD), stochastic gradient descent (SGD)
Multiplicative updates: Exponentiated gradient (EG) descent, mirror descent (MD)
Information geometric methods: Amari's natural gradient, α-divergence

Deformed Logarithm Research

Physics applications: Tsallis entropy, Kaniadakis entropy in statistical physics
Information theory development: Sharma-Taneja-Mittal entropy, generalized information measures
Mathematical theory: Abel exponentials, Tempesta multi-parameter logarithms

Paper Positioning

This paper is the first to apply Euler two-parameter logarithm to machine learning optimization, filling a theoretical gap in this field.

Conclusions and Discussion

Main Conclusions

Theoretical completeness: Establishes a complete GEG theoretical framework based on Euler logarithm
Algorithm flexibility: Two-parameter design provides adaptability to different data distributions
Unification: Multiple known algorithms become special cases of this framework
Practicality: Provides numerically stable implementation methods

Limitations

Parameter selection: Lacks systematic hyperparameter optimization methods
Convergence analysis: Requires further establishment of convergence theory across different parameter domains
Practical application verification: Paper is primarily theoretical work lacking experimental validation on concrete application scenarios
Computational complexity: Deformed function computation is more complex than standard functions

Future Directions

Hyperparameter learning: Develop systematic parameter optimization methods
Convergence theory: Establish comprehensive convergence analysis
Application verification: Verify effectiveness on concrete tasks such as deep learning and portfolio selection
Computational optimization: Develop more efficient numerical implementation methods

In-Depth Evaluation

Strengths

Theoretical Innovation

Mathematical rigor: Provides complete mathematical derivations and theoretical analysis
Unified framework: Unifies seemingly unrelated algorithms under a single theoretical framework
Historical connection: Connects Euler's 1779 mathematical work with modern machine learning

Method Completeness

Multiple implementation approaches: Provides both Lambert-Tsallis function and power series solution methods
Strong extensibility: Supports bipolar weights and various constraint conditions
Numerical considerations: Adequately addresses computational stability issues

Weaknesses

Missing Experimental Validation

Lack of practical applications: Paper is primarily theoretical work lacking verification on real problems
Missing performance comparisons: No performance comparison with existing methods
Parameter sensitivity: Lacks systematic guidance for parameter selection

Theoretical Limitations

Incomplete convergence analysis: Requires more rigorous convergence proofs
Restrictive applicability conditions: Parameter constraints are relatively strict
Computational complexity: Higher computational overhead compared to standard methods

Impact

Academic Value

Theoretical contribution: Provides new mathematical tools for optimization algorithm theory
Interdisciplinary connection: Connects statistical physics, information geometry, and machine learning
Inspirational value: Provides rich theoretical foundation for subsequent research

Practical Potential

Adaptive optimization: Has potential value in scenarios requiring adaptation to different data distributions
Sparse learning: May have advantages in sparse representation learning
Bio-inspired: Aligns with biological plausibility discovered in neuroscience

Applicable Scenarios

Non-negative constrained optimization: Optimization problems where weights must be non-negative
Sparse learning: Machine learning tasks requiring sparse solutions
Probability distribution optimization: Optimization on probability simplex such as online portfolio selection
Deep learning: May have advantages in certain neural network training scenarios

References

The paper cites abundant relevant literature, including:

Classical optimization theory: Nemirovsky & Yudin (1983), Beck & Teboulle (2003)
Information geometry foundations: Amari & Nagaoka (2000), Bregman (1967)
Deformed logarithm theory: Tsallis (1988), Kaniadakis (2002), Tempesta (2015)
Machine learning applications: Kivinen & Warmuth (1997), Cichocki et al. (2009)

Overall Assessment: This is a highly theoretical paper that provides a new mathematical framework for optimization algorithms. Although lacking practical application validation, its theoretical contributions and unifying nature make it academically significant. The paper's primary value lies in establishing a bridge connecting historical mathematical theory with modern machine learning, providing rich theoretical tools for subsequent research.