2025-11-20T22:43:14.952401

Hardware optimization on Android for inference of AI models

Gherasim, SÃ¡nchez

The pervasive integration of Artificial Intelligence models into contemporary mobile computing is notable across numerous use cases, from virtual assistants to advanced image processing. Optimizing the mobile user experience involves minimal latency and high responsiveness from deployed AI models with challenges from execution strategies that fully leverage real time constraints to the exploitation of heterogeneous hardware architecture. In this paper, we research and propose the optimal execution configurations for AI models on an Android system, focusing on two critical tasks: object detection (YOLO family) and image classification (ResNet). These configurations evaluate various model quantization schemes and the utilization of on device accelerators, specifically the GPU and NPU. Our core objective is to empirically determine the combination that achieves the best trade-off between minimal accuracy degradation and maximal inference speed-up.

academic

Hardware optimization on Android for inference of AI models

Basic Information

Paper ID: 2511.13453
Title: Hardware optimization on Android for inference of AI models
Authors: Iulius Gherasim, Carlos García Sánchez (Complutense University of Madrid)
Classification: cs.LG (Machine Learning), cs.PF (Performance)
Publication Date: November 17, 2025 (arXiv submission)
Paper Link: https://arxiv.org/abs/2511.13453

Abstract

This paper investigates hardware optimization for AI model inference on Android systems. Given the widespread integration of AI models in mobile computing—from virtual assistants to advanced image processing—researchers focus on two critical tasks: object detection (YOLO series) and image classification (ResNet). By evaluating different model quantization schemes and device accelerators (GPU and NPU), the paper's core objective is to empirically determine optimal configuration combinations that achieve the best trade-off between minimal accuracy loss and maximum inference acceleration.

Research Background and Motivation

1. Problem Statement

As AI models become increasingly prevalent on mobile devices, achieving low-latency, high-responsiveness inference while maintaining model accuracy presents a key challenge. Specific concerns include:

How to fully leverage heterogeneous hardware architectures (CPU, GPU, NPU) on mobile devices
How to select appropriate model quantization schemes that balance accuracy and speed
How to optimize execution configurations for different AI tasks (classification vs. detection)

2. Problem Significance

Energy Consumption: Google estimates that AI-related tasks accounted for 10-15% of total energy consumption between 2019-2021, with inference consuming 60% of that energy; Meta reports that inference accounts for 70% of AI energy consumption
Growth Trends: Google's energy consumption grows 21% annually, while Meta experiences 32% growth
User Experience: Mobile AI performance has become a core differentiator, requiring strict real-time and accuracy requirements

3. Limitations of Existing Approaches

Early solutions primarily relied on GPU offloading without fully leveraging specialized NPU accelerators
Lack of systematic optimization research targeting mobile heterogeneous architectures
Quantization scheme selection lacks empirical guidance for different tasks and hardware

4. Research Motivation

Adopt MLPerf benchmark principles for systematic performance evaluation on commercial Android devices
Select industry-standard models (ResNet for classification, YOLO for detection) as representative benchmarks
Fill the gap in empirical research on mobile AI inference optimization

Core Contributions

Systematic Hardware Evaluation: First comprehensive evaluation of CPU, GPU, and NPU performance on commercial Android devices (Samsung Galaxy Tab S9) for AI inference tasks
Quantization Scheme Analysis: Comprehensive comparison of 7 quantization schemes (FP32, FP16, INT8, INT16, FINT8, FINT16, Dynamic) across different hardware platforms for accuracy-speed trade-offs
Task-Specific Optimization Recommendations:
- For ResNet classification: NPU + INT8 quantization achieves 130× acceleration with <3% accuracy loss
- For YOLO detection: NPU + FP16 quantization is optimal, avoiding 6.5 mAP accuracy loss from INT8
Pareto Frontier Analysis: Provides multi-objective optimization perspective, identifying optimal trade-off points in accuracy-latency space for different configurations
Practical Findings:
- NPU demonstrates superior performance across all configurations, achieving up to 298× acceleration (YOLOv8x)
- Dynamic quantization fails on NPU, revealing hardware compatibility issues
- CPU multi-threading scalability is limited (maximum 3.4×), attributed to asymmetric core architecture

Methodology Details

Task Definition

This research focuses on two core computer vision tasks:

Image Classification: Input single image, output class label and confidence score (using ResNet series)
Object Detection: Input single image, output multiple bounding boxes, classes, and confidence scores (using YOLO series)

The objective is to identify optimal hardware configuration and quantization scheme combinations for Android mobile devices.

Experimental Architecture

Hardware Platform

Device: Samsung Galaxy Tab S9 SoC: Qualcomm Snapdragon 8 Gen 2 (SM8550-AC)

CPU (Kryo): 8-core big.LITTLE configuration

3 small cores: ARM Cortex-A510 @ 2.0 GHz
4 medium cores: 2×Cortex-A710 + 2×Cortex-A715 @ 2.8 GHz
1 large core: Cortex-X3 @ 3.36 GHz

GPU: Qualcomm Adreno 740

12 shader processing units @ 719 MHz
Supports FP32 and FP16 precision execution

NPU (Hexagon Processor):

Dedicated tensor, scalar, and vector computation units
Shared internal memory architecture
Supports Micro Tile Inferencing technology (partitions and executes model layers in parallel)

Software Environment

Framework: LiteRT (TensorFlow Lite rebranding)

CPU/GPU: LiteRT Next 2.0.2
NPU: LiteRT 1.4.0 (due to NPU pipeline issues in 2.0.2)

Model Conversion Pipeline:

PyTorch Model → ONNX Format → TFLite Format

PyTorch built-in export tools generate ONNX
Katsuya Hyodo's onnx2tf package converts to TFLite
Quantization completed during onnx2tf conversion

Quantization Schemes Detailed

This study evaluates 7 quantization configurations (see Table II):

Scheme Name	I/O Data Type	Operation Precision	Activations	Weights
FP32	FP32	FP32	FP32	FP32
FP16	FP32	FP32	FP32	FP16
INT8	FP32	INT8	INT8	INT8
INT16	FP32	INT8	INT16	INT16
FINT8	INT8	INT8	INT8	INT8
FINT16	INT16	INT8	INT16	INT16
DYN	FP32	Mixed	FP32	Mixed

Key Technical Points:

Static Quantization: Weights converted offline to target data type (e.g., INT8) with fixed storage
Dynamic Quantization (DYN): Weights stored as 8-bit, but activations quantized at runtime, introducing runtime overhead but maintaining better accuracy
INT16 Limitation: LiteRT lacks optimized INT16 kernel implementations, resulting in poor performance

Technical Innovations

Hybrid Framework Approach: Addressing software compatibility constraints, employs mixed LiteRT Next (CPU/GPU) and standard LiteRT (NPU) strategy for comprehensive evaluation
Systematic Configuration Space Exploration:
- 3 hardware × 7 quantization schemes × multiple model sizes
- Covers 5 ResNet variants (18/34/50/101/152)
- Covers 5 YOLOv8 variants (n/s/m/l/x)
- Covers 5 YOLO11 variants (n/s/m/l/x)
Pareto Optimization Perspective: Rather than pursuing single optimality, provides Pareto frontier of accuracy-latency trade-offs supporting multi-objective decision-making
Framework Conversion Loss Quantification: Explicitly measures accuracy loss introduced by PyTorch to LiteRT conversion (ResNet: 0.83-1.77%; YOLO11: 0.2-0.4 mAP)

Experimental Setup

Datasets

ResNet Classification: Standard ImageNet validation set
YOLO Detection: COCO validation set

Evaluation Metrics

Inference Latency: Average inference time (milliseconds)
Acceleration Ratio: Speed improvement relative to FP32 CPU single-thread baseline
Classification Accuracy: Top-1 accuracy (ResNet)
Detection Accuracy: mean Average Precision (mAP) @ IoU=0.5:0.95 (YOLO)
Accuracy Loss: Accuracy degradation percentage relative to FP32 baseline

Comparison Configurations

Execution Devices:

CPU-SC: CPU single-thread
CPU-MC: CPU multi-thread (8 cores)
GPU32: GPU FP32 mode
GPU16: GPU FP16 mode
NPU: Neural Processing Unit

Quantization Schemes: FP32, FP16, INT8, INT16, FINT8, FINT16, DYN

Implementation Details

Develop custom Android application to execute models and record results
Execute multiple inferences per configuration and average results
Use pycocotools to compute mAP
Use standard top-1 method for classification accuracy evaluation

Experimental Results

Main Results

ResNet Performance

ResNet18 Inference Time (milliseconds):

Configuration	CPU-SC	CPU-MC	GPU32	GPU16	NPU
FP32	79.06	26.34	13.68	5.54	1.20
INT8	23.26	5.63	21.77	22.68	0.61

Key Findings:

NPU achieves 65.9× acceleration on FP32, reaching 129.6× acceleration on INT8
INT16 quantization performs extremely poorly (>800ms), excluded from subsequent analysis
FINT8 quantization shows catastrophic accuracy degradation to 0.08% Top-1, also excluded

ResNet50 Performance Analysis:

NPU + INT8: 121.5× acceleration, accuracy loss only 0.41%
GPU16 mode provides approximately 2× acceleration compared to GPU32
CPU multi-threading achieves maximum 3.4× acceleration (INT8), far below theoretical 8×

Quantization Impact (Table X):

Model	INT8 Accuracy Loss	DYN Accuracy Loss
ResNet18	2.94%	0.10%
ResNet50	0.41%	0.19%
ResNet152	0.20%	0.07%

Trend: Larger models are more robust to INT8 quantization, with accuracy loss decreasing from 2.94% to 0.20%

YOLO Performance

YOLOv8n Inference Time Comparison:

NPU demonstrates best performance
FP32: 29× acceleration
INT8: 46.8× acceleration
Latency higher than ResNet (higher task complexity)

YOLOv8 Accuracy Loss (Table XII):

Model	INT8 Loss (mAP)	DYN Loss (mAP)
YOLOv8n	6.5	0.1
YOLOv8s	6.2	0.0
YOLOv8x	6.1	0.1

Key Insights:

INT8 significantly harms detection tasks (average 6.5 mAP loss)
Dynamic quantization nearly lossless (≤0.1 mAP)
Detection tasks require more information (localization + classification), more sensitive to quantization

YOLO11 vs YOLOv8:

YOLO11 achieves higher accuracy on small models
NPU execution slightly slower (more complex architecture)
Dynamic quantization completely fails on NPU
INT8 loss increases slightly to average 7.2 mAP

Ablation Studies

CPU Multi-threading Scalability (Table XV)

Model	FP32	FP16	INT8	DYN
ResNet18	3.0×	3.0×	14.0×	10.6×
ResNet50	2.0×	2.0×	9.5×	7.2×
YOLOv8x	2.7×	2.1×	13.4×	10.1×

Analysis:

INT8 provides best multi-threading acceleration
Floating-point precision shows poor scalability (2-3×)
Asymmetric core architecture limits parallel efficiency

GPU Precision Mode Impact (Table VIII)

GPU32 vs GPU16 on ResNet50:

Quantization schemes have minimal impact on GPU speed
GPU16 mode provides stable 2× acceleration
Larger models show greater advantages in GPU16 mode

NPU Dynamic Quantization Failure Analysis

Dynamic quantization models contain mixed-precision layers
NPU lacks native runtime data type conversion support
Requires frequent NPU-CPU data transfers
Results in severe performance degradation (ResNet50: only 2.3× acceleration vs. 121.5× for INT8)

Pareto Frontier Analysis

ResNet Pareto Frontier (Figure 6):

INT8 configurations dominate frontier: significant latency reduction with acceptable accuracy loss
Optimal configuration: NPU + INT8, applicable to all ResNet sizes
FP16 on GPU provides accuracy-speed balance point

YOLO Pareto Frontier (Figure 7):

FP16 configurations dominate frontier: INT8 accuracy loss too large
Optimal configuration: NPU + FP16
YOLO11s excels among small models
Differences between YOLOv8 and YOLO11 diminish in large models (l/x)

Experimental Findings Summary

NPU Absolute Advantage: NPU provides best performance across all scenarios, achieving up to 298× acceleration (YOLOv8x + INT8)
Task-Specific Quantization Strategies:
- Classification tasks (ResNet): INT8 optimal
- Detection tasks (YOLO): FP16 optimal
Hardware Characteristics:
- GPU: Quantization has minimal impact, FP16 mode critical
- CPU: Multi-threading scalability limited, INT8 provides best parallelism
- NPU: Does not support dynamic quantization, requires static optimization
Model Size Effects:
- Larger models more robust to quantization
- GPU achieves higher acceleration ratios on large models (YOLOv8x: 39×)
Framework Conversion Loss: Non-negligible accuracy degradation (1-2%), must be considered in optimization

Major Research Directions

MLPerf Benchmarking: This paper adopts MLPerf principles for evaluating ML inference systems, from embedded devices to data centers, achieving software framework and architecture-neutral evaluation
Mobile AI Framework Evolution:
- PyTorch, ONNX, TensorFlow: General-purpose AI development frameworks
- TensorFlow Lite → LiteRT: Lightweight mobile runtime
- LiteRT Next: Native accelerator offloading support
Heterogeneous Computing Paradigm:
- Edge-to-Cloud model: Local edge processing optimizes latency, complex tasks offloaded to cloud
- DSA (Domain-Specific Architecture): NPU as specialized tensor computation accelerator
Quantization Techniques:
- Post-training quantization (adopted in this paper)
- Quantization-aware training
- Mixed-precision strategies

Relative Advantages of This Work

Systematic Evaluation: First comprehensive evaluation of CPU/GPU/NPU on commercial Android devices
Empirical Guidance: Provides specific configuration recommendations for different tasks, not just theoretical analysis
Pareto Perspective: Multi-objective optimization approach revealing accuracy-speed trade-off space
Problem Discovery: Identifies practical deployment issues like NPU dynamic quantization incompatibility and CPU scalability limitations
Industrial Relevance: Uses MLPerf standard models with results directly applicable to production environments

Conclusions and Discussion

Main Conclusions

NPU is Optimal Execution Device: Achieves up to 120× acceleration compared to CPU single-core baseline, confirming its critical role in low-latency edge AI
Optimal Quantization is Trade-off Problem:
- ResNet: INT8 optimal, speed gains on NPU outweigh accuracy loss
- YOLO: FP16 optimal, INT8 accuracy loss (6.5 mAP) unacceptable
- GPU: Quantization minimally affects speed, FP16 balances accuracy and speed
Model Performance and Scalability:
- YOLO11s excels on Pareto frontier, providing best speed/accuracy trade-off with FP16 quantization
- YOLO11 achieves higher accuracy on small models, with slightly increased complexity
System Limitation Identification:
- Dynamic quantization fails on NPU (lacks native support)
- CPU multi-threading scalability poor (maximum 3.4×), attributed to asymmetric core architecture
- Framework conversion introduces ~1% accuracy loss

Limitations

Single Hardware Platform: Testing only on Snapdragon 8 Gen 2, generalizability to other SoCs unverified
Limited Task Scope: Covers only computer vision (classification and detection), excludes NLP, speech, and other AI tasks
Missing Energy Analysis: No power consumption measurements, Pareto analysis lacks energy efficiency dimension
Software Version Dependency: NPU requires legacy LiteRT 1.4.0, potentially affecting performance
Static Workloads: Ignores dynamic batching, model switching, and other real-world application scenarios
Incomplete INT16 Evaluation: Early exclusion due to LiteRT kernel limitations prevents thorough analysis

Future Directions

Energy Integration: Complete three-dimensional Pareto analysis including power consumption (accuracy-latency-energy efficiency)
Software Optimization:
- Mitigate NPU dynamic quantization compatibility issues
- Eliminate framework conversion accuracy loss
Task Extension: Research other MLPerf benchmark tasks (NLP, image segmentation)
Hardware Generalization: Validate conclusions across multiple mobile SoCs
Quantization-Aware Training: Explore training-time quantization to reduce INT8 accuracy loss
Real-time Applications: Evaluate video streaming, multi-model concurrency, and other practical scenarios

In-Depth Evaluation

Strengths

Rigorous Experimental Design:
- Systematic configuration space exploration (3 hardware × 7 quantization × 15 model variants)
- Clear baselines and comparison dimensions
- Multiple measurements with averaging for reliability
High Practical Value:
- Targets commercial devices and industry-standard models
- Provides actionable configuration recommendations
- Identifies practical deployment issues (e.g., dynamic quantization failure)
In-Depth Analysis:
- Pareto frontier supports multi-objective decision-making
- Quantifies framework conversion loss
- Reveals hardware characteristics (e.g., CPU asymmetric architecture effects)
Comprehensive Results:
- Extensive quantitative data (multiple tables)
- Clear visualizations (Pareto graphs, speed comparison charts)
- Trend analysis across model sizes
Transparent Methodology:
- Detailed hardware specifications
- Software versions and conversion pipeline documentation
- Acknowledges limitations (e.g., software compatibility issues)

Weaknesses

Limited Generalizability:
- Single hardware platform (Snapdragon 8 Gen 2)
- Applicability to other mobile chips (Apple A-series, Huawei Kirin) unknown
Missing Energy Analysis:
- Title emphasizes "optimization" but lacks power measurements
- Energy efficiency equally important for mobile devices
- Incomplete Pareto analysis
Statistical Significance:
- No confidence intervals or standard deviations reported
- Lacks significance testing
- Sample size for multiple runs unclear
Insufficient Comparisons:
- No comparison with other quantization methods (quantization-aware training)
- No comparison with other mobile AI frameworks (NCNN, MNN)
- Missing latency comparison with cloud inference
Simplified Real-World Scenarios:
- Single image inference, ignores batch processing
- Doesn't test model warm-up, cache effects
- Ignores Android system interference
Weak Theoretical Explanation:
- Lacks architectural-level explanation for NPU INT8 superiority
- Insufficient analysis of CPU multi-threading scalability limitations
- No latency prediction model established

Impact

Contribution to Field:

Fills empirical research gap in mobile AI inference optimization
Provides configuration selection guide for mobile developers
Reveals actual performance characteristics of commercial hardware

Practical Value:

Directly applicable to Android application development
Supports model deployment strategy decisions
Identifies framework improvement directions

Reproducibility:

Uses commercial devices and public models
Detailed conversion pipeline description
Code open-sourcing not mentioned

Expected Impact:

Moderate impact: Platform-specific empirical research
Valuable for mobile AI community
May drive LiteRT framework improvements for NPU support

Applicable Scenarios

Best Suited For:

Android Application Development: Developers needing to deploy ResNet or YOLO on devices
Model Selection: Decision support for accuracy-latency trade-offs
Hardware Evaluation: Assessing Snapdragon 8 Gen 2 AI performance
Quantization Strategy Selection: Choosing quantization schemes based on task type

Not Suitable For:

Other Mobile Platforms: iOS, other Android SoCs require re-evaluation
Non-Vision Tasks: NLP, speech require additional research
Cloud Deployment: Completely different hardware characteristics
Real-Time Video: Doesn't address continuous frame processing

Extension Directions:

Combine findings with energy optimization
Input for AutoML hardware-aware search
Guide edge AI chip design

References

Key Citations:

MLPerf Benchmark: Reddi et al. (2020) - "MLPerf inference benchmark," defines evaluation principles adopted in this paper
Energy Research:
- Google Environmental Report (2023): AI accounts for 10-15% of energy consumption
- Meta Sustainability Report (2023): Inference accounts for 70% of AI energy consumption
ResNet: He et al. (2016) - "Deep Residual Learning for Image Recognition," ILSVRC 2015 champion
YOLO: Ramos & Sappa (2025) - "A decade of you only look once (yolo) for object detection: A review"
Edge-to-Cloud: Moreschini et al. (2024) - "Edge to cloud tools: A multivocal literature review"

Overall Assessment: This is a solid empirical research paper providing valuable configuration guidance for mobile AI inference optimization. Its main strengths lie in systematic experimental design and comprehensive quantitative results, clearly revealing NPU advantages and task-specific quantization strategies. Primary limitations include generalizability constraints to single hardware platform and missing energy analysis. Offers high reference value for Android mobile developers and edge AI researchers, though conclusions require validation across broader hardware and task domains. Recommend future work supplement energy measurements, extend to other platforms and tasks, and open-source experimental code for improved reproducibility.