Hardware optimization on Android for inference of AI models
Gherasim, Sánchez
The pervasive integration of Artificial Intelligence models into contemporary mobile computing is notable across numerous use cases, from virtual assistants to advanced image processing. Optimizing the mobile user experience involves minimal latency and high responsiveness from deployed AI models with challenges from execution strategies that fully leverage real time constraints to the exploitation of heterogeneous hardware architecture. In this paper, we research and propose the optimal execution configurations for AI models on an Android system, focusing on two critical tasks: object detection (YOLO family) and image classification (ResNet). These configurations evaluate various model quantization schemes and the utilization of on device accelerators, specifically the GPU and NPU. Our core objective is to empirically determine the combination that achieves the best trade-off between minimal accuracy degradation and maximal inference speed-up.
academic
Hardware optimization on Android for inference of AI models
This paper investigates hardware optimization for AI model inference on Android systems. Given the widespread integration of AI models in mobile computing—from virtual assistants to advanced image processing—researchers focus on two critical tasks: object detection (YOLO series) and image classification (ResNet). By evaluating different model quantization schemes and device accelerators (GPU and NPU), the paper's core objective is to empirically determine optimal configuration combinations that achieve the best trade-off between minimal accuracy loss and maximum inference acceleration.
As AI models become increasingly prevalent on mobile devices, achieving low-latency, high-responsiveness inference while maintaining model accuracy presents a key challenge. Specific concerns include:
How to fully leverage heterogeneous hardware architectures (CPU, GPU, NPU) on mobile devices
How to select appropriate model quantization schemes that balance accuracy and speed
How to optimize execution configurations for different AI tasks (classification vs. detection)
Energy Consumption: Google estimates that AI-related tasks accounted for 10-15% of total energy consumption between 2019-2021, with inference consuming 60% of that energy; Meta reports that inference accounts for 70% of AI energy consumption
Growth Trends: Google's energy consumption grows 21% annually, while Meta experiences 32% growth
User Experience: Mobile AI performance has become a core differentiator, requiring strict real-time and accuracy requirements
Systematic Hardware Evaluation: First comprehensive evaluation of CPU, GPU, and NPU performance on commercial Android devices (Samsung Galaxy Tab S9) for AI inference tasks
Quantization Scheme Analysis: Comprehensive comparison of 7 quantization schemes (FP32, FP16, INT8, INT16, FINT8, FINT16, Dynamic) across different hardware platforms for accuracy-speed trade-offs
Task-Specific Optimization Recommendations:
For ResNet classification: NPU + INT8 quantization achieves 130× acceleration with <3% accuracy loss
For YOLO detection: NPU + FP16 quantization is optimal, avoiding 6.5 mAP accuracy loss from INT8
Pareto Frontier Analysis: Provides multi-objective optimization perspective, identifying optimal trade-off points in accuracy-latency space for different configurations
Practical Findings:
NPU demonstrates superior performance across all configurations, achieving up to 298× acceleration (YOLOv8x)
Dynamic quantization fails on NPU, revealing hardware compatibility issues
CPU multi-threading scalability is limited (maximum 3.4×), attributed to asymmetric core architecture
This study evaluates 7 quantization configurations (see Table II):
Scheme Name
I/O Data Type
Operation Precision
Activations
Weights
FP32
FP32
FP32
FP32
FP32
FP16
FP32
FP32
FP32
FP16
INT8
FP32
INT8
INT8
INT8
INT16
FP32
INT8
INT16
INT16
FINT8
INT8
INT8
INT8
INT8
FINT16
INT16
INT8
INT16
INT16
DYN
FP32
Mixed
FP32
Mixed
Key Technical Points:
Static Quantization: Weights converted offline to target data type (e.g., INT8) with fixed storage
Dynamic Quantization (DYN): Weights stored as 8-bit, but activations quantized at runtime, introducing runtime overhead but maintaining better accuracy
Hybrid Framework Approach: Addressing software compatibility constraints, employs mixed LiteRT Next (CPU/GPU) and standard LiteRT (NPU) strategy for comprehensive evaluation
Systematic Configuration Space Exploration:
3 hardware × 7 quantization schemes × multiple model sizes
Covers 5 ResNet variants (18/34/50/101/152)
Covers 5 YOLOv8 variants (n/s/m/l/x)
Covers 5 YOLO11 variants (n/s/m/l/x)
Pareto Optimization Perspective: Rather than pursuing single optimality, provides Pareto frontier of accuracy-latency trade-offs supporting multi-objective decision-making
Framework Conversion Loss Quantification: Explicitly measures accuracy loss introduced by PyTorch to LiteRT conversion (ResNet: 0.83-1.77%; YOLO11: 0.2-0.4 mAP)
MLPerf Benchmarking: This paper adopts MLPerf principles for evaluating ML inference systems, from embedded devices to data centers, achieving software framework and architecture-neutral evaluation
Mobile AI Framework Evolution:
PyTorch, ONNX, TensorFlow: General-purpose AI development frameworks
TensorFlow Lite → LiteRT: Lightweight mobile runtime
LiteRT Next: Native accelerator offloading support
Heterogeneous Computing Paradigm:
Edge-to-Cloud model: Local edge processing optimizes latency, complex tasks offloaded to cloud
DSA (Domain-Specific Architecture): NPU as specialized tensor computation accelerator
Quantization Techniques:
Post-training quantization (adopted in this paper)
NPU is Optimal Execution Device: Achieves up to 120× acceleration compared to CPU single-core baseline, confirming its critical role in low-latency edge AI
Optimal Quantization is Trade-off Problem:
ResNet: INT8 optimal, speed gains on NPU outweigh accuracy loss
YOLO: FP16 optimal, INT8 accuracy loss (6.5 mAP) unacceptable
GPU: Quantization minimally affects speed, FP16 balances accuracy and speed
Model Performance and Scalability:
YOLO11s excels on Pareto frontier, providing best speed/accuracy trade-off with FP16 quantization
YOLO11 achieves higher accuracy on small models, with slightly increased complexity
System Limitation Identification:
Dynamic quantization fails on NPU (lacks native support)
CPU multi-threading scalability poor (maximum 3.4×), attributed to asymmetric core architecture
MLPerf Benchmark: Reddi et al. (2020) - "MLPerf inference benchmark," defines evaluation principles adopted in this paper
Energy Research:
Google Environmental Report (2023): AI accounts for 10-15% of energy consumption
Meta Sustainability Report (2023): Inference accounts for 70% of AI energy consumption
ResNet: He et al. (2016) - "Deep Residual Learning for Image Recognition," ILSVRC 2015 champion
YOLO: Ramos & Sappa (2025) - "A decade of you only look once (yolo) for object detection: A review"
Edge-to-Cloud: Moreschini et al. (2024) - "Edge to cloud tools: A multivocal literature review"
Overall Assessment: This is a solid empirical research paper providing valuable configuration guidance for mobile AI inference optimization. Its main strengths lie in systematic experimental design and comprehensive quantitative results, clearly revealing NPU advantages and task-specific quantization strategies. Primary limitations include generalizability constraints to single hardware platform and missing energy analysis. Offers high reference value for Android mobile developers and edge AI researchers, though conclusions require validation across broader hardware and task domains. Recommend future work supplement energy measurements, extend to other platforms and tasks, and open-source experimental code for improved reproducibility.