IIn this paper we propose and investigate a new class of Generalized Exponentiated Gradient (GEG) algorithms using Mirror Descent (MD) updates, and applying the Bregman divergence with a two--parameter
deformation of the logarithm as a link function. This link function (referred here to as the Euler logarithm) is associated with a relatively wide class of trace--form entropies. In order to derive novel GEG/MD updates, we estimate a deformed exponential function, which closely approximates the inverse of the Euler two--parameter deformed logarithm. The characteristic shape and properties of the Euler logarithm and its inverse--deformed exponential functions, are tuned by two hyperparameters. By learning these hyperparameters, we can adapt to the distribution of training data and adjust them to achieve desired properties of gradient descent algorithms. In the literature, there exist nowadays more than fifty mathematically well-established entropic functionals and associated deformed logarithms, so it is impossible to investigate all of them in one research paper. Therefore, we focus here on a class of trace-form entropies and the associated deformed two--parameters logarithms.
- Paper ID: 2502.17500
- Title: Generalized Exponentiated Gradient Algorithms Using the Euler Two-Parameter Logarithm
- Author: Andrzej Cichocki (Polish Academy of Science, UMK Torun Poland, Tokyo University of Agriculture and Technology, Riken AIP)
- Classification: cs.LG cs.AI
- Publication Date: arXiv preprint (February 2025)
- Paper Link: https://arxiv.org/abs/2502.17500
This paper proposes and investigates a class of novel generalized exponentiated gradient (GEG) algorithms that employ mirror descent (MD) updates and apply Bregman divergence with a two-parameter logarithmic deformation as a link function. This link function, termed the Euler logarithm, is associated with a relatively broad class of trace-form entropies. To derive new GEG/MD updates, the authors estimate a deformed exponential function that closely approximates the inverse of the Euler two-parameter deformed logarithm. By learning these hyperparameters, the algorithm can adapt to the distribution of training data and be adjusted to achieve desirable properties of gradient descent algorithms.
Existing gradient descent methods suffer from the following limitations:
- Standard additive gradient descent is inapplicable when all weights must be non-negative
- Vanishing and exploding gradient problems require precise learning rate adjustment
- Lack of adaptivity: Existing EG updates cannot adapt to data of different distributions and lack hyperparameters to control convergence performance
- Biological plausibility: Recent neuronal synaptic studies suggest that EG updates are more consistent with biological learning processes than additive GD
- Geometric adaptivity: By selecting appropriate link functions, mirror descent can adapt to the geometric structure of optimization problems
- Theoretical richness: The literature contains over 50 mathematically mature entropy functions and associated deformed logarithms, providing a rich theoretical foundation for algorithm design
- Proposed generalized EG algorithms based on Euler two-parameter logarithm: First application of Euler (a,b)-logarithm to mirror descent and exponentiated gradient updates
- Established approximation theory for deformed exponential functions: Provided two solution methods via Lagrange inversion theorem and Lambert-Tsallis W function
- Unified multiple known algorithms: Demonstrated that various existing algorithms (Tsallis, Kaniadakis, Amari, etc.) are special cases of this framework
- Extended to bipolar weights: Proposed normalized MD/GEG algorithms for handling bipolar weight vectors
- Provided complete mathematical theoretical foundation: Including function properties, convergence analysis, and computational stability considerations
The optimization problem is defined as:
wt+1=argminw∈R+N{L(wt)+⟨∇L(wt),w−wt⟩+η1DF(w∣∣wt)}
where DF(w∣∣wt) is the Bregman divergence and L(w) is a differentiable loss function.
loga,bE(x)=a−bxa−xb,x>0,a=b
Parameter constraints: a<0,0<b<1 or b<0,0<a<1
Power series approximation obtained via Lagrange inversion theorem:
expa,b(x)≈exp(x)−21(a+b)x2−61(3a+3b−2a2−5ab−2b2)x3+O(x4)
wt+1=expa,b[loga,b(wt)−ηt∇L(wt)]=wt⊗a,bexpa,b[−ηt∇L(wt)]
where ⊗a,b is the deformed multiplication operation.
For unit simplex constraints:
w~t+1=wt⊗a,bexpa,b(−ηt∇L^(wt))wt+1=∣∣w~t+1∣∣1w~t+1
where L^(w)=L(w/∣∣w∣∣1) is the normalized loss function.
- Two-parameter flexibility: Parameters (a,b) adjust the algorithm to adapt to different data distributions
- Unified framework: Incorporates multiple known algorithms into a unified mathematical framework
- Numerical stability: Provides computationally stable implementation methods
- Theoretical completeness: Establishes complete mathematical theory including function properties and convergence analysis
The paper primarily conducts theoretical analysis and mathematical derivation, including:
- Function property verification: Monotonicity, concavity, normalization, and other basic properties
- Special case verification: Validates the correctness of known algorithms as special cases
- Numerical stability analysis: Analyzes parameter sensitivity and computational stability
- Valid parameter domain: a<0,0<b<1 or b<0,0<a<1
- Numerically stable region: Most stable when x→1, requires special handling when far from 1
- Convergence properties: Requires L'Hospital's rule for handling singular cases
- Domain: loga,b(x):R+→R
- Monotonicity: dxdloga,b(x)>0
- Concavity: dx2d2loga,b(x)<0 (within specified parameter ranges)
- Normalization: loga,b(1)=0, dxdloga,b(x)∣x=1=1
Successfully verified the following special cases:
- a=b=0: Standard natural logarithm ln(x)
- a=0,b=−α: Amari α-logarithm
- a=1−q,b=0: Tsallis q-logarithm
- a=κ,b=−κ: Kaniadakis κ-logarithm
- Parameter sensitivity: Small x values are more sensitive to parameter changes
- Numerical stability: Algorithm is most stable when x→1
- Convergence properties: Limiting behavior requires special computational treatment
Comparison with exact solutions validates that the power series approximation exhibits good accuracy within reasonable parameter ranges.
- Classical methods: Additive gradient descent (GD), stochastic gradient descent (SGD)
- Multiplicative updates: Exponentiated gradient (EG) descent, mirror descent (MD)
- Information geometric methods: Amari's natural gradient, α-divergence
- Physics applications: Tsallis entropy, Kaniadakis entropy in statistical physics
- Information theory development: Sharma-Taneja-Mittal entropy, generalized information measures
- Mathematical theory: Abel exponentials, Tempesta multi-parameter logarithms
This paper is the first to apply Euler two-parameter logarithm to machine learning optimization, filling a theoretical gap in this field.
- Theoretical completeness: Establishes a complete GEG theoretical framework based on Euler logarithm
- Algorithm flexibility: Two-parameter design provides adaptability to different data distributions
- Unification: Multiple known algorithms become special cases of this framework
- Practicality: Provides numerically stable implementation methods
- Parameter selection: Lacks systematic hyperparameter optimization methods
- Convergence analysis: Requires further establishment of convergence theory across different parameter domains
- Practical application verification: Paper is primarily theoretical work lacking experimental validation on concrete application scenarios
- Computational complexity: Deformed function computation is more complex than standard functions
- Hyperparameter learning: Develop systematic parameter optimization methods
- Convergence theory: Establish comprehensive convergence analysis
- Application verification: Verify effectiveness on concrete tasks such as deep learning and portfolio selection
- Computational optimization: Develop more efficient numerical implementation methods
- Mathematical rigor: Provides complete mathematical derivations and theoretical analysis
- Unified framework: Unifies seemingly unrelated algorithms under a single theoretical framework
- Historical connection: Connects Euler's 1779 mathematical work with modern machine learning
- Multiple implementation approaches: Provides both Lambert-Tsallis function and power series solution methods
- Strong extensibility: Supports bipolar weights and various constraint conditions
- Numerical considerations: Adequately addresses computational stability issues
- Lack of practical applications: Paper is primarily theoretical work lacking verification on real problems
- Missing performance comparisons: No performance comparison with existing methods
- Parameter sensitivity: Lacks systematic guidance for parameter selection
- Incomplete convergence analysis: Requires more rigorous convergence proofs
- Restrictive applicability conditions: Parameter constraints are relatively strict
- Computational complexity: Higher computational overhead compared to standard methods
- Theoretical contribution: Provides new mathematical tools for optimization algorithm theory
- Interdisciplinary connection: Connects statistical physics, information geometry, and machine learning
- Inspirational value: Provides rich theoretical foundation for subsequent research
- Adaptive optimization: Has potential value in scenarios requiring adaptation to different data distributions
- Sparse learning: May have advantages in sparse representation learning
- Bio-inspired: Aligns with biological plausibility discovered in neuroscience
- Non-negative constrained optimization: Optimization problems where weights must be non-negative
- Sparse learning: Machine learning tasks requiring sparse solutions
- Probability distribution optimization: Optimization on probability simplex such as online portfolio selection
- Deep learning: May have advantages in certain neural network training scenarios
The paper cites abundant relevant literature, including:
- Classical optimization theory: Nemirovsky & Yudin (1983), Beck & Teboulle (2003)
- Information geometry foundations: Amari & Nagaoka (2000), Bregman (1967)
- Deformed logarithm theory: Tsallis (1988), Kaniadakis (2002), Tempesta (2015)
- Machine learning applications: Kivinen & Warmuth (1997), Cichocki et al. (2009)
Overall Assessment: This is a highly theoretical paper that provides a new mathematical framework for optimization algorithms. Although lacking practical application validation, its theoretical contributions and unifying nature make it academically significant. The paper's primary value lies in establishing a bridge connecting historical mathematical theory with modern machine learning, providing rich theoretical tools for subsequent research.