2025-11-17T11:07:14.013317

On the impact of the parametrization of deep convolutional neural networks on post-training quantization

Houache, Aujol, Traonmilin
This paper introduces novel theoretical approximation bounds for the output of quantized neural networks, with a focus on convolutional neural networks (CNN). By considering layerwise parametrization and focusing on the quantization of weights, we provide bounds that gain several orders of magnitude compared to state-of-the-art results on classical deep convolutional neural networks such as MobileNetV2 or ResNets. These gains are achieved by improving the behaviour of the approximation bounds with respect to the depth parameter, which has the most impact on the approximation error induced by quantization. To complement our theoretical result, we provide a numerical exploration of our bounds on MobileNetV2 and ResNets.
academic

On the impact of the parametrization of deep convolutional neural networks on post-training quantization

Basic Information

  • Paper ID: 2502.01156
  • Title: On the impact of the parametrization of deep convolutional neural networks on post-training quantization
  • Authors: Samy Houache (Univ. Bordeaux, Thales AVS), Jean-François Aujol (Univ. Bordeaux), Yann Traonmilin (Univ. Bordeaux)
  • Classification: cs.IT (Information Theory), math.IT (Mathematical Information Theory)
  • Publication Date: February 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2502.01156

Abstract

This paper introduces novel theoretical approximation bounds for the output of quantized neural networks, with particular focus on convolutional neural networks (CNNs). By considering layer-wise parametrization and focusing on weight quantization, the authors provide bounds that achieve several orders of magnitude improvement over existing state-of-the-art results on classical deep convolutional neural networks (such as MobileNetV2 or ResNets). These improvements are achieved through enhanced approximation bounds with respect to the depth parameter, which has the most significant impact on approximation errors induced by quantization. To complement the theoretical results, the authors provide numerical exploration on MobileNetV2 and ResNets.

Research Background and Motivation

Problem Definition

  1. Core Problem: When deploying deep neural networks in resource-constrained environments, quantization techniques introduce performance degradation. Theoretical bounds are needed to quantify this degradation.
  2. Significance:
    • Growing demand for neural network deployment on mobile devices and embedded systems
    • Safety-critical applications require robust theoretical guarantees
    • Quantization is a key technique for reducing model size and computational cost
  3. Limitations of Existing Methods:
    • Bounds from Gonon et al. (2023) are overly pessimistic with limited practical value
    • Strict assumptions requiring maximum parameter norm r > 1 limit applicability
    • Constant C exhibits O(NL²) dependence, impractical for modern deep architectures
  4. Research Motivation:
    • Existing bounds are too conservative for deep networks
    • Tighter theoretical bounds are needed to guide practical quantization strategies
    • Weight regularization commonly results in r < 1, requiring relaxed constraints

Core Contributions

  1. Tighter Approximation Bounds: Improves the NL² factor from Gonon et al. to ∑ᴸₗ₌₁Nₗ₋₁, simplifying to NL for constant-width networks
  2. Relaxed Norm Constraints: Allows arbitrary positive values for rₗ (operator norm of layer l), making results applicable to networks with smaller parameter norms
  3. Improved Geometric Mean Term: Replaces maximum parameter norm r with rmean, providing less pessimistic estimates
  4. Convolutional Network Specialization: Provides specialized bounds for convolutional structures, considering only filter size and channel count
  5. Practical Validation: Verifies theoretical improvements on classical pre-trained CNN models, demonstrating several orders of magnitude improvement

Methodology Details

Task Definition

For neural network Rθ and its quantized version Rθ', seek bounds of the form:

sup_{x∈Ω} ||Rθ(x) - Rθ'(x)||∞ ≤ C||θ - θ'||∞

where Ω is the input domain and C is a constant depending on network architecture.

Core Theoretical Results

General Approximation Bound (Theorem 4.1)

For architecture (L,N), assuming two networks have identical biases with only weight quantization:

sup_{x∈Ω} ||Rθ(x̃) - Rθ'(x̃)||∞ ≤ max(D,1) ∑ᴸₗ₌₁ Nₗ₋₁ × r^{L-1}_{mean} ||θ - θ'||∞

where the geometric mean term is defined as:

r_mean := ^{L-1}√(max_{l=1,...,L} max_{i=1,...,l-1} ∏_{j=i,j≠l}^L r_j)

Convolutional Network Specialized Bound (Theorem 4.4)

For pure convolutional networks (without biases), applying cₗ filters of size pₗ×pₗ at each layer:

sup_{x∈Ω} ||Rθ(x) - Rθ'(x)||∞ ≤ D × ∑ᴸₗ₌₁ p²ₗcₗ₋₁ × r^{L-1}_{conv} ||θ - θ'||∞

where:

r_conv := ^{L-1}√(max_{l=1,...,L} ∏_{k=1,k≠l}^L r^{conv}_k)

Technical Innovations

  1. Layer-wise Parametrization Method: Analyzes parameter norms layer-by-layer, avoiding global maximum values
  2. Sparse Structure Exploitation: Effectively utilizes sparsity of convolutional matrices, replacing full Nₗ₋₁ with p²ₗcₗ₋₁
  3. Geometric Mean Strategy: rmean accounts for variability in cross-layer parameter norms, providing more precision than simple maximum values

Experimental Setup

Datasets

  • Tiny ImageNet: 110,000 64×64 images across 200 classes
  • MNIST: Handwritten digit recognition for MLP experiments
  • CIFAR-10: 32×32 color images, 10 classes

Model Architectures

  • ResNet18/50: Residual networks with BatchNorm removed
  • MobileNetV2: Lightweight network with BatchNorm removed
  • Multi-layer Perceptron: Various depths (5, 7, 9, 11 layers) for depth impact analysis

Quantization Methods

  1. Uniform Quantization: Q_unif(θ) = ⌊θ/η⌋η
  2. Rounding Quantization: Q_round(θ) = round(θ/η)η
  3. AdaRound: Adaptive rounding optimizing rounding offsets

Evaluation Metrics

  • Tightness comparison of theoretical bounds
  • Quantized model accuracy
  • Performance across different bit widths

Experimental Results

Main Results

Bound Improvement Effects

  • ResNet18: New bounds are 10⁸ times tighter than Gonon et al.'s results
  • MobileNetV2: Improvement reaches 10⁵⁶ times
  • ResNet50: Improvement reaches 10²⁷ times

Parameter Analysis Comparison

ModelDepth LPrevious Bound WidthPrevious Bound Norm rNew Bound WidthNew Bound Norm r_convImprovement Ratio
MobileNetV2531.2×10⁶≈1018641≈9≈10⁵⁶
ResNet18188×10⁵≈844609≈44≈10⁸
ResNet50508×10⁵≈1084609≈37≈10²⁷

Depth Impact Analysis

MLP experiments verify exponential growth of bound improvement with depth:

  • Depth 5: Improvement approximately 10³ times
  • Depth 11: Improvement approximately 10⁸ times

Quantization Performance Analysis

Performance of different quantization methods on Tiny ImageNet:

  • AdaRound performs best under extreme quantization (≤4 bits)
  • MobileNetV2 demonstrates better tolerance to quantization than ResNets
  • Depth significantly affects quantization error, validating theoretical predictions

Weight Distribution Impact

Experiments demonstrate importance of weight norm distribution:

  • MobileNetV2: r≈101 vs r_conv≈9 (11-fold improvement)
  • ResNet50: r≈108 vs r_conv≈37 (3-fold improvement)
  • Greater variability in weight distribution yields larger advantages of r_conv over r

Approximation Bound Research

  • Gonon et al. (2023): Provides general upper bounds for ReLU networks, but overly pessimistic for deep networks
  • Neyshabur et al. (2018): Addresses specific cases with controlled perturbations, not applicable to arbitrary quantization
  • Berner et al. (2020): L∞ norm case, but restricted to d_out=1

Quantization Techniques

  • AdaRound (Nagel et al. 2020): Data-driven adaptive rounding
  • Cross-Layer Equalization: Homogenizing cross-layer weight distribution
  • Low-bit Quantization: Binary weights, ultra-low precision inference

Theoretical Analysis

  • Topological Properties: Lipschitz continuity of implementation mappings
  • Approximation Capacity: Extensions of universal approximation theorems for neural networks

Conclusions and Discussion

Main Conclusions

  1. Significant Theoretical Improvement: New bounds are several orders of magnitude tighter than existing results on practical networks
  2. Optimized Depth Dependency: Improved from L² dependence to more moderate growth
  3. Enhanced Practicality: Relaxed parameter constraints, applicable to regularized networks
  4. Architecture-Aware: Effectively exploits sparsity of convolutional structures

Limitations

  1. Still Conservative: Bounds remain several orders of magnitude away from actual observed errors
  2. Worst-Case Analysis: Theoretical bounds based on extreme cases rarely occurring in practice
  3. Architectural Constraints: Primarily focused on CNNs, not extended to modern architectures like Transformers
  4. BatchNorm Handling: Experiments removed BatchNorm to satisfy theoretical conditions

Future Directions

  1. Transformer Extension: Handle layer normalization and multi-head attention mechanisms
  2. Probabilistic Methods: Develop probabilistic bounds reflecting typical operating conditions
  3. Tighter Bounds: Further narrow the gap between theoretical bounds and actual errors
  4. Practical Tools: Convert theoretical results into quantization strategy guidance tools

In-Depth Evaluation

Strengths

  1. Outstanding Theoretical Contribution: Achieves significant progress in quantization theory bounds with orders of magnitude improvement
  2. Mathematical Rigor: Complete proofs with rigorous and reliable mathematical derivations
  3. Practical Value: Relaxes strict assumptions of existing methods, improving applicability
  4. Sufficient Experimental Validation: Verifies theoretical improvements across multiple classical architectures
  5. Clear Writing: Well-structured paper with accurate technical exposition

Weaknesses

  1. Still Loose Bounds: Despite significant improvements, theoretical bounds remain substantially distant from actual errors
  2. Architectural Limitations: Primarily focuses on CNNs with limited extensibility to modern Transformer architectures
  3. Assumption Constraints: Removal of BatchNorm and other components may impact practical application value
  4. Missing Probabilistic Analysis: Lacks probabilistic analysis of performance under typical conditions

Impact

  1. Theoretical Value: Provides new analytical frameworks and tools for quantization theory
  2. Practical Guidance: Can guide quantization strategy design, particularly for techniques like Cross-Layer Equalization
  3. Research Inspiration: Provides improvement directions and foundation for subsequent research
  4. Reproducibility: Clear experimental setup with reproducible results

Applicable Scenarios

  1. Safety-Critical Applications: Quantization deployment requiring theoretical guarantees
  2. Embedded Systems: Model compression in resource-constrained environments
  3. Quantization Strategy Design: Guiding layer-wise quantization and preprocessing techniques
  4. Theoretical Research: Providing foundation for further quantization theory research

References

  1. Gonon, A., et al. (2023). Approximation speed of quantized vs. unquantized relu neural networks and beyond. IEEE Transactions on Information Theory.
  2. Nagel, M., et al. (2020). Up or down? adaptive rounding for post-training quantization. ICML.
  3. Sandler, M., et al. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. CVPR.
  4. He, K., et al. (2016). Deep residual learning for image recognition. CVPR.

Summary: This paper achieves important progress in theoretical analysis of neural network quantization. Through more refined layer-wise analysis and geometric mean strategies, it significantly improves existing approximation bounds. While bounds remain relatively conservative, the orders of magnitude improvement and relaxed constraints provide important theoretical value and practical significance.