2025-11-17T11:07:14.013317

On the impact of the parametrization of deep convolutional neural networks on post-training quantization

Houache, Aujol, Traonmilin

This paper introduces novel theoretical approximation bounds for the output of quantized neural networks, with a focus on convolutional neural networks (CNN). By considering layerwise parametrization and focusing on the quantization of weights, we provide bounds that gain several orders of magnitude compared to state-of-the-art results on classical deep convolutional neural networks such as MobileNetV2 or ResNets. These gains are achieved by improving the behaviour of the approximation bounds with respect to the depth parameter, which has the most impact on the approximation error induced by quantization. To complement our theoretical result, we provide a numerical exploration of our bounds on MobileNetV2 and ResNets.

academic

On the impact of the parametrization of deep convolutional neural networks on post-training quantization

Basic Information

Paper ID: 2502.01156
Title: On the impact of the parametrization of deep convolutional neural networks on post-training quantization
Authors: Samy Houache (Univ. Bordeaux, Thales AVS), Jean-François Aujol (Univ. Bordeaux), Yann Traonmilin (Univ. Bordeaux)
Classification: cs.IT (Information Theory), math.IT (Mathematical Information Theory)
Publication Date: February 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2502.01156

Abstract

This paper introduces novel theoretical approximation bounds for the output of quantized neural networks, with particular focus on convolutional neural networks (CNNs). By considering layer-wise parametrization and focusing on weight quantization, the authors provide bounds that achieve several orders of magnitude improvement over existing state-of-the-art results on classical deep convolutional neural networks (such as MobileNetV2 or ResNets). These improvements are achieved through enhanced approximation bounds with respect to the depth parameter, which has the most significant impact on approximation errors induced by quantization. To complement the theoretical results, the authors provide numerical exploration on MobileNetV2 and ResNets.

Research Background and Motivation

Problem Definition

Core Problem: When deploying deep neural networks in resource-constrained environments, quantization techniques introduce performance degradation. Theoretical bounds are needed to quantify this degradation.
Significance:
- Growing demand for neural network deployment on mobile devices and embedded systems
- Safety-critical applications require robust theoretical guarantees
- Quantization is a key technique for reducing model size and computational cost
Limitations of Existing Methods:
- Bounds from Gonon et al. (2023) are overly pessimistic with limited practical value
- Strict assumptions requiring maximum parameter norm r > 1 limit applicability
- Constant C exhibits O(NL²) dependence, impractical for modern deep architectures
Research Motivation:
- Existing bounds are too conservative for deep networks
- Tighter theoretical bounds are needed to guide practical quantization strategies
- Weight regularization commonly results in r < 1, requiring relaxed constraints

Core Contributions

Tighter Approximation Bounds: Improves the NL² factor from Gonon et al. to ∑ᴸₗ₌₁Nₗ₋₁, simplifying to NL for constant-width networks
Relaxed Norm Constraints: Allows arbitrary positive values for rₗ (operator norm of layer l), making results applicable to networks with smaller parameter norms
Improved Geometric Mean Term: Replaces maximum parameter norm r with rmean, providing less pessimistic estimates
Convolutional Network Specialization: Provides specialized bounds for convolutional structures, considering only filter size and channel count
Practical Validation: Verifies theoretical improvements on classical pre-trained CNN models, demonstrating several orders of magnitude improvement

Methodology Details

Task Definition

For neural network Rθ and its quantized version Rθ', seek bounds of the form:

sup_{x∈Ω} ||Rθ(x) - Rθ'(x)||∞ ≤ C||θ - θ'||∞

where Ω is the input domain and C is a constant depending on network architecture.

Core Theoretical Results

General Approximation Bound (Theorem 4.1)

For architecture (L,N), assuming two networks have identical biases with only weight quantization:

sup_{x∈Ω} ||Rθ(x̃) - Rθ'(x̃)||∞ ≤ max(D,1) ∑ᴸₗ₌₁ Nₗ₋₁ × r^{L-1}_{mean} ||θ - θ'||∞

where the geometric mean term is defined as:

r_mean := ^{L-1}√(max_{l=1,...,L} max_{i=1,...,l-1} ∏_{j=i,j≠l}^L r_j)

Convolutional Network Specialized Bound (Theorem 4.4)

For pure convolutional networks (without biases), applying cₗ filters of size pₗ×pₗ at each layer:

sup_{x∈Ω} ||Rθ(x) - Rθ'(x)||∞ ≤ D × ∑ᴸₗ₌₁ p²ₗcₗ₋₁ × r^{L-1}_{conv} ||θ - θ'||∞

where:

r_conv := ^{L-1}√(max_{l=1,...,L} ∏_{k=1,k≠l}^L r^{conv}_k)

Technical Innovations

Layer-wise Parametrization Method: Analyzes parameter norms layer-by-layer, avoiding global maximum values
Sparse Structure Exploitation: Effectively utilizes sparsity of convolutional matrices, replacing full Nₗ₋₁ with p²ₗcₗ₋₁
Geometric Mean Strategy: rmean accounts for variability in cross-layer parameter norms, providing more precision than simple maximum values

Experimental Setup

Datasets

Tiny ImageNet: 110,000 64×64 images across 200 classes
MNIST: Handwritten digit recognition for MLP experiments
CIFAR-10: 32×32 color images, 10 classes

Model Architectures

ResNet18/50: Residual networks with BatchNorm removed
MobileNetV2: Lightweight network with BatchNorm removed
Multi-layer Perceptron: Various depths (5, 7, 9, 11 layers) for depth impact analysis

Quantization Methods

Uniform Quantization: Q_unif(θ) = ⌊θ/η⌋η
Rounding Quantization: Q_round(θ) = round(θ/η)η
AdaRound: Adaptive rounding optimizing rounding offsets

Evaluation Metrics

Tightness comparison of theoretical bounds
Quantized model accuracy
Performance across different bit widths

Experimental Results

Main Results

Bound Improvement Effects

ResNet18: New bounds are 10⁸ times tighter than Gonon et al.'s results
MobileNetV2: Improvement reaches 10⁵⁶ times
ResNet50: Improvement reaches 10²⁷ times

Parameter Analysis Comparison

Model	Depth L	Previous Bound Width	Previous Bound Norm r	New Bound Width	New Bound Norm r_conv	Improvement Ratio
MobileNetV2	53	1.2×10⁶	≈101	8641	≈9	≈10⁵⁶
ResNet18	18	8×10⁵	≈84	4609	≈44	≈10⁸
ResNet50	50	8×10⁵	≈108	4609	≈37	≈10²⁷

Depth Impact Analysis

MLP experiments verify exponential growth of bound improvement with depth:

Depth 5: Improvement approximately 10³ times
Depth 11: Improvement approximately 10⁸ times

Quantization Performance Analysis

Performance of different quantization methods on Tiny ImageNet:

AdaRound performs best under extreme quantization (≤4 bits)
MobileNetV2 demonstrates better tolerance to quantization than ResNets
Depth significantly affects quantization error, validating theoretical predictions

Weight Distribution Impact

Experiments demonstrate importance of weight norm distribution:

MobileNetV2: r≈101 vs r_conv≈9 (11-fold improvement)
ResNet50: r≈108 vs r_conv≈37 (3-fold improvement)
Greater variability in weight distribution yields larger advantages of r_conv over r

Approximation Bound Research

Gonon et al. (2023): Provides general upper bounds for ReLU networks, but overly pessimistic for deep networks
Neyshabur et al. (2018): Addresses specific cases with controlled perturbations, not applicable to arbitrary quantization
Berner et al. (2020): L∞ norm case, but restricted to d_out=1

Quantization Techniques

AdaRound (Nagel et al. 2020): Data-driven adaptive rounding
Cross-Layer Equalization: Homogenizing cross-layer weight distribution
Low-bit Quantization: Binary weights, ultra-low precision inference

Theoretical Analysis

Topological Properties: Lipschitz continuity of implementation mappings
Approximation Capacity: Extensions of universal approximation theorems for neural networks

Conclusions and Discussion

Main Conclusions

Significant Theoretical Improvement: New bounds are several orders of magnitude tighter than existing results on practical networks
Optimized Depth Dependency: Improved from L² dependence to more moderate growth
Enhanced Practicality: Relaxed parameter constraints, applicable to regularized networks
Architecture-Aware: Effectively exploits sparsity of convolutional structures

Limitations

Still Conservative: Bounds remain several orders of magnitude away from actual observed errors
Worst-Case Analysis: Theoretical bounds based on extreme cases rarely occurring in practice
Architectural Constraints: Primarily focused on CNNs, not extended to modern architectures like Transformers
BatchNorm Handling: Experiments removed BatchNorm to satisfy theoretical conditions

Future Directions

Transformer Extension: Handle layer normalization and multi-head attention mechanisms
Probabilistic Methods: Develop probabilistic bounds reflecting typical operating conditions
Tighter Bounds: Further narrow the gap between theoretical bounds and actual errors
Practical Tools: Convert theoretical results into quantization strategy guidance tools

In-Depth Evaluation

Strengths

Outstanding Theoretical Contribution: Achieves significant progress in quantization theory bounds with orders of magnitude improvement
Mathematical Rigor: Complete proofs with rigorous and reliable mathematical derivations
Practical Value: Relaxes strict assumptions of existing methods, improving applicability
Sufficient Experimental Validation: Verifies theoretical improvements across multiple classical architectures
Clear Writing: Well-structured paper with accurate technical exposition

Weaknesses

Still Loose Bounds: Despite significant improvements, theoretical bounds remain substantially distant from actual errors
Architectural Limitations: Primarily focuses on CNNs with limited extensibility to modern Transformer architectures
Assumption Constraints: Removal of BatchNorm and other components may impact practical application value
Missing Probabilistic Analysis: Lacks probabilistic analysis of performance under typical conditions

Impact

Theoretical Value: Provides new analytical frameworks and tools for quantization theory
Practical Guidance: Can guide quantization strategy design, particularly for techniques like Cross-Layer Equalization
Research Inspiration: Provides improvement directions and foundation for subsequent research
Reproducibility: Clear experimental setup with reproducible results

Applicable Scenarios

Safety-Critical Applications: Quantization deployment requiring theoretical guarantees
Embedded Systems: Model compression in resource-constrained environments
Quantization Strategy Design: Guiding layer-wise quantization and preprocessing techniques
Theoretical Research: Providing foundation for further quantization theory research

References

Gonon, A., et al. (2023). Approximation speed of quantized vs. unquantized relu neural networks and beyond. IEEE Transactions on Information Theory.
Nagel, M., et al. (2020). Up or down? adaptive rounding for post-training quantization. ICML.
Sandler, M., et al. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. CVPR.
He, K., et al. (2016). Deep residual learning for image recognition. CVPR.

Summary: This paper achieves important progress in theoretical analysis of neural network quantization. Through more refined layer-wise analysis and geometric mean strategies, it significantly improves existing approximation bounds. While bounds remain relatively conservative, the orders of magnitude improvement and relaxed constraints provide important theoretical value and practical significance.