2025-11-11T07:19:09.204233

Generalized Exponentiated Gradient Algorithms Using the Euler Two-Parameter Logarithm

Cichocki
IIn this paper we propose and investigate a new class of Generalized Exponentiated Gradient (GEG) algorithms using Mirror Descent (MD) updates, and applying the Bregman divergence with a two--parameter deformation of the logarithm as a link function. This link function (referred here to as the Euler logarithm) is associated with a relatively wide class of trace--form entropies. In order to derive novel GEG/MD updates, we estimate a deformed exponential function, which closely approximates the inverse of the Euler two--parameter deformed logarithm. The characteristic shape and properties of the Euler logarithm and its inverse--deformed exponential functions, are tuned by two hyperparameters. By learning these hyperparameters, we can adapt to the distribution of training data and adjust them to achieve desired properties of gradient descent algorithms. In the literature, there exist nowadays more than fifty mathematically well-established entropic functionals and associated deformed logarithms, so it is impossible to investigate all of them in one research paper. Therefore, we focus here on a class of trace-form entropies and the associated deformed two--parameters logarithms.
academic

Generalized Exponentiated Gradient Algorithms Using the Euler Two-Parameter Logarithm

Basic Information

  • Paper ID: 2502.17500
  • Title: Generalized Exponentiated Gradient Algorithms Using the Euler Two-Parameter Logarithm
  • Author: Andrzej Cichocki (Polish Academy of Science, UMK Torun Poland, Tokyo University of Agriculture and Technology, Riken AIP)
  • Classification: cs.LG cs.AI
  • Publication Date: arXiv preprint (February 2025)
  • Paper Link: https://arxiv.org/abs/2502.17500

Abstract

This paper proposes and investigates a class of novel generalized exponentiated gradient (GEG) algorithms that employ mirror descent (MD) updates and apply Bregman divergence with a two-parameter logarithmic deformation as a link function. This link function, termed the Euler logarithm, is associated with a relatively broad class of trace-form entropies. To derive new GEG/MD updates, the authors estimate a deformed exponential function that closely approximates the inverse of the Euler two-parameter deformed logarithm. By learning these hyperparameters, the algorithm can adapt to the distribution of training data and be adjusted to achieve desirable properties of gradient descent algorithms.

Research Background and Motivation

Problem Definition

Existing gradient descent methods suffer from the following limitations:

  1. Standard additive gradient descent is inapplicable when all weights must be non-negative
  2. Vanishing and exploding gradient problems require precise learning rate adjustment
  3. Lack of adaptivity: Existing EG updates cannot adapt to data of different distributions and lack hyperparameters to control convergence performance

Research Motivation

  1. Biological plausibility: Recent neuronal synaptic studies suggest that EG updates are more consistent with biological learning processes than additive GD
  2. Geometric adaptivity: By selecting appropriate link functions, mirror descent can adapt to the geometric structure of optimization problems
  3. Theoretical richness: The literature contains over 50 mathematically mature entropy functions and associated deformed logarithms, providing a rich theoretical foundation for algorithm design

Core Contributions

  1. Proposed generalized EG algorithms based on Euler two-parameter logarithm: First application of Euler (a,b)-logarithm to mirror descent and exponentiated gradient updates
  2. Established approximation theory for deformed exponential functions: Provided two solution methods via Lagrange inversion theorem and Lambert-Tsallis W function
  3. Unified multiple known algorithms: Demonstrated that various existing algorithms (Tsallis, Kaniadakis, Amari, etc.) are special cases of this framework
  4. Extended to bipolar weights: Proposed normalized MD/GEG algorithms for handling bipolar weight vectors
  5. Provided complete mathematical theoretical foundation: Including function properties, convergence analysis, and computational stability considerations

Method Details

Task Definition

The optimization problem is defined as: wt+1=argminwR+N{L(wt)+L(wt),wwt+1ηDF(wwt)}w_{t+1} = \arg\min_{w \in \mathbb{R}_+^N} \left\{ L(w_t) + \langle\nabla L(w_t), w - w_t\rangle + \frac{1}{\eta} D_F(w||w_t) \right\}

where DF(wwt)D_F(w||w_t) is the Bregman divergence and L(w)L(w) is a differentiable loss function.

Core Mathematical Framework

Euler (a,b)-Logarithm

loga,bE(x)=xaxbab,x>0,ab\log^E_{a,b}(x) = \frac{x^a - x^b}{a - b}, \quad x > 0, a \neq b

Parameter constraints: a<0,0<b<1a < 0, 0 < b < 1 or b<0,0<a<1b < 0, 0 < a < 1

Deformed Exponential Function

Power series approximation obtained via Lagrange inversion theorem: expa,b(x)exp(x)12(a+b)x216(3a+3b2a25ab2b2)x3+O(x4)\exp_{a,b}(x) \approx \exp(x) - \frac{1}{2}(a+b)x^2 - \frac{1}{6}(3a+3b-2a^2-5ab-2b^2)x^3 + O(x^4)

Algorithm Architecture

Non-normalized GEG Update

wt+1=expa,b[loga,b(wt)ηtL(wt)]=wta,bexpa,b[ηtL(wt)]w_{t+1} = \exp_{a,b}[\log_{a,b}(w_t) - \eta_t \nabla L(w_t)] = w_t \otimes_{a,b} \exp_{a,b}[-\eta_t \nabla L(w_t)]

where a,b\otimes_{a,b} is the deformed multiplication operation.

Normalized GEG Update

For unit simplex constraints: w~t+1=wta,bexpa,b(ηtL^(wt))\tilde{w}_{t+1} = w_t \otimes_{a,b} \exp_{a,b}(-\eta_t \nabla \hat{L}(w_t))wt+1=w~t+1w~t+11w_{t+1} = \frac{\tilde{w}_{t+1}}{||\tilde{w}_{t+1}||_1}

where L^(w)=L(w/w1)\hat{L}(w) = L(w/||w||_1) is the normalized loss function.

Technical Innovations

  1. Two-parameter flexibility: Parameters (a,b) adjust the algorithm to adapt to different data distributions
  2. Unified framework: Incorporates multiple known algorithms into a unified mathematical framework
  3. Numerical stability: Provides computationally stable implementation methods
  4. Theoretical completeness: Establishes complete mathematical theory including function properties and convergence analysis

Experimental Setup

Theoretical Verification

The paper primarily conducts theoretical analysis and mathematical derivation, including:

  1. Function property verification: Monotonicity, concavity, normalization, and other basic properties
  2. Special case verification: Validates the correctness of known algorithms as special cases
  3. Numerical stability analysis: Analyzes parameter sensitivity and computational stability

Parameter Range Analysis

  • Valid parameter domain: a<0,0<b<1a < 0, 0 < b < 1 or b<0,0<a<1b < 0, 0 < a < 1
  • Numerically stable region: Most stable when x1x \to 1, requires special handling when far from 1
  • Convergence properties: Requires L'Hospital's rule for handling singular cases

Experimental Results

Theoretical Results

Function Property Verification

  • Domain: loga,b(x):R+R\log_{a,b}(x): \mathbb{R}_+ \to \mathbb{R}
  • Monotonicity: dloga,b(x)dx>0\frac{d\log_{a,b}(x)}{dx} > 0
  • Concavity: d2loga,b(x)dx2<0\frac{d^2\log_{a,b}(x)}{dx^2} < 0 (within specified parameter ranges)
  • Normalization: loga,b(1)=0\log_{a,b}(1) = 0, dloga,b(x)dxx=1=1\frac{d\log_{a,b}(x)}{dx}|_{x=1} = 1

Special Case Recovery

Successfully verified the following special cases:

  • a=b=0a = b = 0: Standard natural logarithm ln(x)\ln(x)
  • a=0,b=αa = 0, b = -\alpha: Amari α-logarithm
  • a=1q,b=0a = 1-q, b = 0: Tsallis q-logarithm
  • a=κ,b=κa = \kappa, b = -\kappa: Kaniadakis κ-logarithm

Numerical Analysis Results

Computational Stability

  1. Parameter sensitivity: Small xx values are more sensitive to parameter changes
  2. Numerical stability: Algorithm is most stable when x1x \to 1
  3. Convergence properties: Limiting behavior requires special computational treatment

Power Series Approximation Accuracy

Comparison with exact solutions validates that the power series approximation exhibits good accuracy within reasonable parameter ranges.

Optimization Algorithm Development

  1. Classical methods: Additive gradient descent (GD), stochastic gradient descent (SGD)
  2. Multiplicative updates: Exponentiated gradient (EG) descent, mirror descent (MD)
  3. Information geometric methods: Amari's natural gradient, α-divergence

Deformed Logarithm Research

  1. Physics applications: Tsallis entropy, Kaniadakis entropy in statistical physics
  2. Information theory development: Sharma-Taneja-Mittal entropy, generalized information measures
  3. Mathematical theory: Abel exponentials, Tempesta multi-parameter logarithms

Paper Positioning

This paper is the first to apply Euler two-parameter logarithm to machine learning optimization, filling a theoretical gap in this field.

Conclusions and Discussion

Main Conclusions

  1. Theoretical completeness: Establishes a complete GEG theoretical framework based on Euler logarithm
  2. Algorithm flexibility: Two-parameter design provides adaptability to different data distributions
  3. Unification: Multiple known algorithms become special cases of this framework
  4. Practicality: Provides numerically stable implementation methods

Limitations

  1. Parameter selection: Lacks systematic hyperparameter optimization methods
  2. Convergence analysis: Requires further establishment of convergence theory across different parameter domains
  3. Practical application verification: Paper is primarily theoretical work lacking experimental validation on concrete application scenarios
  4. Computational complexity: Deformed function computation is more complex than standard functions

Future Directions

  1. Hyperparameter learning: Develop systematic parameter optimization methods
  2. Convergence theory: Establish comprehensive convergence analysis
  3. Application verification: Verify effectiveness on concrete tasks such as deep learning and portfolio selection
  4. Computational optimization: Develop more efficient numerical implementation methods

In-Depth Evaluation

Strengths

Theoretical Innovation

  1. Mathematical rigor: Provides complete mathematical derivations and theoretical analysis
  2. Unified framework: Unifies seemingly unrelated algorithms under a single theoretical framework
  3. Historical connection: Connects Euler's 1779 mathematical work with modern machine learning

Method Completeness

  1. Multiple implementation approaches: Provides both Lambert-Tsallis function and power series solution methods
  2. Strong extensibility: Supports bipolar weights and various constraint conditions
  3. Numerical considerations: Adequately addresses computational stability issues

Weaknesses

Missing Experimental Validation

  1. Lack of practical applications: Paper is primarily theoretical work lacking verification on real problems
  2. Missing performance comparisons: No performance comparison with existing methods
  3. Parameter sensitivity: Lacks systematic guidance for parameter selection

Theoretical Limitations

  1. Incomplete convergence analysis: Requires more rigorous convergence proofs
  2. Restrictive applicability conditions: Parameter constraints are relatively strict
  3. Computational complexity: Higher computational overhead compared to standard methods

Impact

Academic Value

  1. Theoretical contribution: Provides new mathematical tools for optimization algorithm theory
  2. Interdisciplinary connection: Connects statistical physics, information geometry, and machine learning
  3. Inspirational value: Provides rich theoretical foundation for subsequent research

Practical Potential

  1. Adaptive optimization: Has potential value in scenarios requiring adaptation to different data distributions
  2. Sparse learning: May have advantages in sparse representation learning
  3. Bio-inspired: Aligns with biological plausibility discovered in neuroscience

Applicable Scenarios

  1. Non-negative constrained optimization: Optimization problems where weights must be non-negative
  2. Sparse learning: Machine learning tasks requiring sparse solutions
  3. Probability distribution optimization: Optimization on probability simplex such as online portfolio selection
  4. Deep learning: May have advantages in certain neural network training scenarios

References

The paper cites abundant relevant literature, including:

  • Classical optimization theory: Nemirovsky & Yudin (1983), Beck & Teboulle (2003)
  • Information geometry foundations: Amari & Nagaoka (2000), Bregman (1967)
  • Deformed logarithm theory: Tsallis (1988), Kaniadakis (2002), Tempesta (2015)
  • Machine learning applications: Kivinen & Warmuth (1997), Cichocki et al. (2009)

Overall Assessment: This is a highly theoretical paper that provides a new mathematical framework for optimization algorithms. Although lacking practical application validation, its theoretical contributions and unifying nature make it academically significant. The paper's primary value lies in establishing a bridge connecting historical mathematical theory with modern machine learning, providing rich theoretical tools for subsequent research.