2025-11-20T22:43:14.952401

Hardware optimization on Android for inference of AI models

Gherasim, Sánchez
The pervasive integration of Artificial Intelligence models into contemporary mobile computing is notable across numerous use cases, from virtual assistants to advanced image processing. Optimizing the mobile user experience involves minimal latency and high responsiveness from deployed AI models with challenges from execution strategies that fully leverage real time constraints to the exploitation of heterogeneous hardware architecture. In this paper, we research and propose the optimal execution configurations for AI models on an Android system, focusing on two critical tasks: object detection (YOLO family) and image classification (ResNet). These configurations evaluate various model quantization schemes and the utilization of on device accelerators, specifically the GPU and NPU. Our core objective is to empirically determine the combination that achieves the best trade-off between minimal accuracy degradation and maximal inference speed-up.
academic

Hardware optimization on Android for inference of AI models

Basic Information

  • Paper ID: 2511.13453
  • Title: Hardware optimization on Android for inference of AI models
  • Authors: Iulius Gherasim, Carlos García Sánchez (Complutense University of Madrid)
  • Classification: cs.LG (Machine Learning), cs.PF (Performance)
  • Publication Date: November 17, 2025 (arXiv submission)
  • Paper Link: https://arxiv.org/abs/2511.13453

Abstract

This paper investigates hardware optimization for AI model inference on Android systems. Given the widespread integration of AI models in mobile computing—from virtual assistants to advanced image processing—researchers focus on two critical tasks: object detection (YOLO series) and image classification (ResNet). By evaluating different model quantization schemes and device accelerators (GPU and NPU), the paper's core objective is to empirically determine optimal configuration combinations that achieve the best trade-off between minimal accuracy loss and maximum inference acceleration.

Research Background and Motivation

1. Problem Statement

As AI models become increasingly prevalent on mobile devices, achieving low-latency, high-responsiveness inference while maintaining model accuracy presents a key challenge. Specific concerns include:

  • How to fully leverage heterogeneous hardware architectures (CPU, GPU, NPU) on mobile devices
  • How to select appropriate model quantization schemes that balance accuracy and speed
  • How to optimize execution configurations for different AI tasks (classification vs. detection)

2. Problem Significance

  • Energy Consumption: Google estimates that AI-related tasks accounted for 10-15% of total energy consumption between 2019-2021, with inference consuming 60% of that energy; Meta reports that inference accounts for 70% of AI energy consumption
  • Growth Trends: Google's energy consumption grows 21% annually, while Meta experiences 32% growth
  • User Experience: Mobile AI performance has become a core differentiator, requiring strict real-time and accuracy requirements

3. Limitations of Existing Approaches

  • Early solutions primarily relied on GPU offloading without fully leveraging specialized NPU accelerators
  • Lack of systematic optimization research targeting mobile heterogeneous architectures
  • Quantization scheme selection lacks empirical guidance for different tasks and hardware

4. Research Motivation

  • Adopt MLPerf benchmark principles for systematic performance evaluation on commercial Android devices
  • Select industry-standard models (ResNet for classification, YOLO for detection) as representative benchmarks
  • Fill the gap in empirical research on mobile AI inference optimization

Core Contributions

  1. Systematic Hardware Evaluation: First comprehensive evaluation of CPU, GPU, and NPU performance on commercial Android devices (Samsung Galaxy Tab S9) for AI inference tasks
  2. Quantization Scheme Analysis: Comprehensive comparison of 7 quantization schemes (FP32, FP16, INT8, INT16, FINT8, FINT16, Dynamic) across different hardware platforms for accuracy-speed trade-offs
  3. Task-Specific Optimization Recommendations:
    • For ResNet classification: NPU + INT8 quantization achieves 130× acceleration with <3% accuracy loss
    • For YOLO detection: NPU + FP16 quantization is optimal, avoiding 6.5 mAP accuracy loss from INT8
  4. Pareto Frontier Analysis: Provides multi-objective optimization perspective, identifying optimal trade-off points in accuracy-latency space for different configurations
  5. Practical Findings:
    • NPU demonstrates superior performance across all configurations, achieving up to 298× acceleration (YOLOv8x)
    • Dynamic quantization fails on NPU, revealing hardware compatibility issues
    • CPU multi-threading scalability is limited (maximum 3.4×), attributed to asymmetric core architecture

Methodology Details

Task Definition

This research focuses on two core computer vision tasks:

  1. Image Classification: Input single image, output class label and confidence score (using ResNet series)
  2. Object Detection: Input single image, output multiple bounding boxes, classes, and confidence scores (using YOLO series)

The objective is to identify optimal hardware configuration and quantization scheme combinations for Android mobile devices.

Experimental Architecture

Hardware Platform

Device: Samsung Galaxy Tab S9 SoC: Qualcomm Snapdragon 8 Gen 2 (SM8550-AC)

CPU (Kryo): 8-core big.LITTLE configuration

  • 3 small cores: ARM Cortex-A510 @ 2.0 GHz
  • 4 medium cores: 2×Cortex-A710 + 2×Cortex-A715 @ 2.8 GHz
  • 1 large core: Cortex-X3 @ 3.36 GHz

GPU: Qualcomm Adreno 740

  • 12 shader processing units @ 719 MHz
  • Supports FP32 and FP16 precision execution

NPU (Hexagon Processor):

  • Dedicated tensor, scalar, and vector computation units
  • Shared internal memory architecture
  • Supports Micro Tile Inferencing technology (partitions and executes model layers in parallel)

Software Environment

Framework: LiteRT (TensorFlow Lite rebranding)

  • CPU/GPU: LiteRT Next 2.0.2
  • NPU: LiteRT 1.4.0 (due to NPU pipeline issues in 2.0.2)

Model Conversion Pipeline:

PyTorch Model → ONNX Format → TFLite Format
  • PyTorch built-in export tools generate ONNX
  • Katsuya Hyodo's onnx2tf package converts to TFLite
  • Quantization completed during onnx2tf conversion

Quantization Schemes Detailed

This study evaluates 7 quantization configurations (see Table II):

Scheme NameI/O Data TypeOperation PrecisionActivationsWeights
FP32FP32FP32FP32FP32
FP16FP32FP32FP32FP16
INT8FP32INT8INT8INT8
INT16FP32INT8INT16INT16
FINT8INT8INT8INT8INT8
FINT16INT16INT8INT16INT16
DYNFP32MixedFP32Mixed

Key Technical Points:

  1. Static Quantization: Weights converted offline to target data type (e.g., INT8) with fixed storage
  2. Dynamic Quantization (DYN): Weights stored as 8-bit, but activations quantized at runtime, introducing runtime overhead but maintaining better accuracy
  3. INT16 Limitation: LiteRT lacks optimized INT16 kernel implementations, resulting in poor performance

Technical Innovations

  1. Hybrid Framework Approach: Addressing software compatibility constraints, employs mixed LiteRT Next (CPU/GPU) and standard LiteRT (NPU) strategy for comprehensive evaluation
  2. Systematic Configuration Space Exploration:
    • 3 hardware × 7 quantization schemes × multiple model sizes
    • Covers 5 ResNet variants (18/34/50/101/152)
    • Covers 5 YOLOv8 variants (n/s/m/l/x)
    • Covers 5 YOLO11 variants (n/s/m/l/x)
  3. Pareto Optimization Perspective: Rather than pursuing single optimality, provides Pareto frontier of accuracy-latency trade-offs supporting multi-objective decision-making
  4. Framework Conversion Loss Quantification: Explicitly measures accuracy loss introduced by PyTorch to LiteRT conversion (ResNet: 0.83-1.77%; YOLO11: 0.2-0.4 mAP)

Experimental Setup

Datasets

  • ResNet Classification: Standard ImageNet validation set
  • YOLO Detection: COCO validation set

Evaluation Metrics

  1. Inference Latency: Average inference time (milliseconds)
  2. Acceleration Ratio: Speed improvement relative to FP32 CPU single-thread baseline
  3. Classification Accuracy: Top-1 accuracy (ResNet)
  4. Detection Accuracy: mean Average Precision (mAP) @ IoU=0.5:0.95 (YOLO)
  5. Accuracy Loss: Accuracy degradation percentage relative to FP32 baseline

Comparison Configurations

Execution Devices:

  • CPU-SC: CPU single-thread
  • CPU-MC: CPU multi-thread (8 cores)
  • GPU32: GPU FP32 mode
  • GPU16: GPU FP16 mode
  • NPU: Neural Processing Unit

Quantization Schemes: FP32, FP16, INT8, INT16, FINT8, FINT16, DYN

Implementation Details

  • Develop custom Android application to execute models and record results
  • Execute multiple inferences per configuration and average results
  • Use pycocotools to compute mAP
  • Use standard top-1 method for classification accuracy evaluation

Experimental Results

Main Results

ResNet Performance

ResNet18 Inference Time (milliseconds):

ConfigurationCPU-SCCPU-MCGPU32GPU16NPU
FP3279.0626.3413.685.541.20
INT823.265.6321.7722.680.61

Key Findings:

  • NPU achieves 65.9× acceleration on FP32, reaching 129.6× acceleration on INT8
  • INT16 quantization performs extremely poorly (>800ms), excluded from subsequent analysis
  • FINT8 quantization shows catastrophic accuracy degradation to 0.08% Top-1, also excluded

ResNet50 Performance Analysis:

  • NPU + INT8: 121.5× acceleration, accuracy loss only 0.41%
  • GPU16 mode provides approximately 2× acceleration compared to GPU32
  • CPU multi-threading achieves maximum 3.4× acceleration (INT8), far below theoretical 8×

Quantization Impact (Table X):

ModelINT8 Accuracy LossDYN Accuracy Loss
ResNet182.94%0.10%
ResNet500.41%0.19%
ResNet1520.20%0.07%

Trend: Larger models are more robust to INT8 quantization, with accuracy loss decreasing from 2.94% to 0.20%

YOLO Performance

YOLOv8n Inference Time Comparison:

  • NPU demonstrates best performance
  • FP32: 29× acceleration
  • INT8: 46.8× acceleration
  • Latency higher than ResNet (higher task complexity)

YOLOv8 Accuracy Loss (Table XII):

ModelINT8 Loss (mAP)DYN Loss (mAP)
YOLOv8n6.50.1
YOLOv8s6.20.0
YOLOv8x6.10.1

Key Insights:

  • INT8 significantly harms detection tasks (average 6.5 mAP loss)
  • Dynamic quantization nearly lossless (≤0.1 mAP)
  • Detection tasks require more information (localization + classification), more sensitive to quantization

YOLO11 vs YOLOv8:

  • YOLO11 achieves higher accuracy on small models
  • NPU execution slightly slower (more complex architecture)
  • Dynamic quantization completely fails on NPU
  • INT8 loss increases slightly to average 7.2 mAP

Ablation Studies

CPU Multi-threading Scalability (Table XV)

ModelFP32FP16INT8DYN
ResNet183.0×3.0×14.0×10.6×
ResNet502.0×2.0×9.5×7.2×
YOLOv8x2.7×2.1×13.4×10.1×

Analysis:

  • INT8 provides best multi-threading acceleration
  • Floating-point precision shows poor scalability (2-3×)
  • Asymmetric core architecture limits parallel efficiency

GPU Precision Mode Impact (Table VIII)

GPU32 vs GPU16 on ResNet50:

  • Quantization schemes have minimal impact on GPU speed
  • GPU16 mode provides stable 2× acceleration
  • Larger models show greater advantages in GPU16 mode

NPU Dynamic Quantization Failure Analysis

  • Dynamic quantization models contain mixed-precision layers
  • NPU lacks native runtime data type conversion support
  • Requires frequent NPU-CPU data transfers
  • Results in severe performance degradation (ResNet50: only 2.3× acceleration vs. 121.5× for INT8)

Pareto Frontier Analysis

ResNet Pareto Frontier (Figure 6):

  • INT8 configurations dominate frontier: significant latency reduction with acceptable accuracy loss
  • Optimal configuration: NPU + INT8, applicable to all ResNet sizes
  • FP16 on GPU provides accuracy-speed balance point

YOLO Pareto Frontier (Figure 7):

  • FP16 configurations dominate frontier: INT8 accuracy loss too large
  • Optimal configuration: NPU + FP16
  • YOLO11s excels among small models
  • Differences between YOLOv8 and YOLO11 diminish in large models (l/x)

Experimental Findings Summary

  1. NPU Absolute Advantage: NPU provides best performance across all scenarios, achieving up to 298× acceleration (YOLOv8x + INT8)
  2. Task-Specific Quantization Strategies:
    • Classification tasks (ResNet): INT8 optimal
    • Detection tasks (YOLO): FP16 optimal
  3. Hardware Characteristics:
    • GPU: Quantization has minimal impact, FP16 mode critical
    • CPU: Multi-threading scalability limited, INT8 provides best parallelism
    • NPU: Does not support dynamic quantization, requires static optimization
  4. Model Size Effects:
    • Larger models more robust to quantization
    • GPU achieves higher acceleration ratios on large models (YOLOv8x: 39×)
  5. Framework Conversion Loss: Non-negligible accuracy degradation (1-2%), must be considered in optimization

Major Research Directions

  1. MLPerf Benchmarking: This paper adopts MLPerf principles for evaluating ML inference systems, from embedded devices to data centers, achieving software framework and architecture-neutral evaluation
  2. Mobile AI Framework Evolution:
    • PyTorch, ONNX, TensorFlow: General-purpose AI development frameworks
    • TensorFlow Lite → LiteRT: Lightweight mobile runtime
    • LiteRT Next: Native accelerator offloading support
  3. Heterogeneous Computing Paradigm:
    • Edge-to-Cloud model: Local edge processing optimizes latency, complex tasks offloaded to cloud
    • DSA (Domain-Specific Architecture): NPU as specialized tensor computation accelerator
  4. Quantization Techniques:
    • Post-training quantization (adopted in this paper)
    • Quantization-aware training
    • Mixed-precision strategies

Relative Advantages of This Work

  1. Systematic Evaluation: First comprehensive evaluation of CPU/GPU/NPU on commercial Android devices
  2. Empirical Guidance: Provides specific configuration recommendations for different tasks, not just theoretical analysis
  3. Pareto Perspective: Multi-objective optimization approach revealing accuracy-speed trade-off space
  4. Problem Discovery: Identifies practical deployment issues like NPU dynamic quantization incompatibility and CPU scalability limitations
  5. Industrial Relevance: Uses MLPerf standard models with results directly applicable to production environments

Conclusions and Discussion

Main Conclusions

  1. NPU is Optimal Execution Device: Achieves up to 120× acceleration compared to CPU single-core baseline, confirming its critical role in low-latency edge AI
  2. Optimal Quantization is Trade-off Problem:
    • ResNet: INT8 optimal, speed gains on NPU outweigh accuracy loss
    • YOLO: FP16 optimal, INT8 accuracy loss (6.5 mAP) unacceptable
    • GPU: Quantization minimally affects speed, FP16 balances accuracy and speed
  3. Model Performance and Scalability:
    • YOLO11s excels on Pareto frontier, providing best speed/accuracy trade-off with FP16 quantization
    • YOLO11 achieves higher accuracy on small models, with slightly increased complexity
  4. System Limitation Identification:
    • Dynamic quantization fails on NPU (lacks native support)
    • CPU multi-threading scalability poor (maximum 3.4×), attributed to asymmetric core architecture
    • Framework conversion introduces ~1% accuracy loss

Limitations

  1. Single Hardware Platform: Testing only on Snapdragon 8 Gen 2, generalizability to other SoCs unverified
  2. Limited Task Scope: Covers only computer vision (classification and detection), excludes NLP, speech, and other AI tasks
  3. Missing Energy Analysis: No power consumption measurements, Pareto analysis lacks energy efficiency dimension
  4. Software Version Dependency: NPU requires legacy LiteRT 1.4.0, potentially affecting performance
  5. Static Workloads: Ignores dynamic batching, model switching, and other real-world application scenarios
  6. Incomplete INT16 Evaluation: Early exclusion due to LiteRT kernel limitations prevents thorough analysis

Future Directions

  1. Energy Integration: Complete three-dimensional Pareto analysis including power consumption (accuracy-latency-energy efficiency)
  2. Software Optimization:
    • Mitigate NPU dynamic quantization compatibility issues
    • Eliminate framework conversion accuracy loss
  3. Task Extension: Research other MLPerf benchmark tasks (NLP, image segmentation)
  4. Hardware Generalization: Validate conclusions across multiple mobile SoCs
  5. Quantization-Aware Training: Explore training-time quantization to reduce INT8 accuracy loss
  6. Real-time Applications: Evaluate video streaming, multi-model concurrency, and other practical scenarios

In-Depth Evaluation

Strengths

  1. Rigorous Experimental Design:
    • Systematic configuration space exploration (3 hardware × 7 quantization × 15 model variants)
    • Clear baselines and comparison dimensions
    • Multiple measurements with averaging for reliability
  2. High Practical Value:
    • Targets commercial devices and industry-standard models
    • Provides actionable configuration recommendations
    • Identifies practical deployment issues (e.g., dynamic quantization failure)
  3. In-Depth Analysis:
    • Pareto frontier supports multi-objective decision-making
    • Quantifies framework conversion loss
    • Reveals hardware characteristics (e.g., CPU asymmetric architecture effects)
  4. Comprehensive Results:
    • Extensive quantitative data (multiple tables)
    • Clear visualizations (Pareto graphs, speed comparison charts)
    • Trend analysis across model sizes
  5. Transparent Methodology:
    • Detailed hardware specifications
    • Software versions and conversion pipeline documentation
    • Acknowledges limitations (e.g., software compatibility issues)

Weaknesses

  1. Limited Generalizability:
    • Single hardware platform (Snapdragon 8 Gen 2)
    • Applicability to other mobile chips (Apple A-series, Huawei Kirin) unknown
  2. Missing Energy Analysis:
    • Title emphasizes "optimization" but lacks power measurements
    • Energy efficiency equally important for mobile devices
    • Incomplete Pareto analysis
  3. Statistical Significance:
    • No confidence intervals or standard deviations reported
    • Lacks significance testing
    • Sample size for multiple runs unclear
  4. Insufficient Comparisons:
    • No comparison with other quantization methods (quantization-aware training)
    • No comparison with other mobile AI frameworks (NCNN, MNN)
    • Missing latency comparison with cloud inference
  5. Simplified Real-World Scenarios:
    • Single image inference, ignores batch processing
    • Doesn't test model warm-up, cache effects
    • Ignores Android system interference
  6. Weak Theoretical Explanation:
    • Lacks architectural-level explanation for NPU INT8 superiority
    • Insufficient analysis of CPU multi-threading scalability limitations
    • No latency prediction model established

Impact

Contribution to Field:

  • Fills empirical research gap in mobile AI inference optimization
  • Provides configuration selection guide for mobile developers
  • Reveals actual performance characteristics of commercial hardware

Practical Value:

  • Directly applicable to Android application development
  • Supports model deployment strategy decisions
  • Identifies framework improvement directions

Reproducibility:

  • Uses commercial devices and public models
  • Detailed conversion pipeline description
  • Code open-sourcing not mentioned

Expected Impact:

  • Moderate impact: Platform-specific empirical research
  • Valuable for mobile AI community
  • May drive LiteRT framework improvements for NPU support

Applicable Scenarios

Best Suited For:

  1. Android Application Development: Developers needing to deploy ResNet or YOLO on devices
  2. Model Selection: Decision support for accuracy-latency trade-offs
  3. Hardware Evaluation: Assessing Snapdragon 8 Gen 2 AI performance
  4. Quantization Strategy Selection: Choosing quantization schemes based on task type

Not Suitable For:

  1. Other Mobile Platforms: iOS, other Android SoCs require re-evaluation
  2. Non-Vision Tasks: NLP, speech require additional research
  3. Cloud Deployment: Completely different hardware characteristics
  4. Real-Time Video: Doesn't address continuous frame processing

Extension Directions:

  • Combine findings with energy optimization
  • Input for AutoML hardware-aware search
  • Guide edge AI chip design

References

Key Citations:

  1. MLPerf Benchmark: Reddi et al. (2020) - "MLPerf inference benchmark," defines evaluation principles adopted in this paper
  2. Energy Research:
    • Google Environmental Report (2023): AI accounts for 10-15% of energy consumption
    • Meta Sustainability Report (2023): Inference accounts for 70% of AI energy consumption
  3. ResNet: He et al. (2016) - "Deep Residual Learning for Image Recognition," ILSVRC 2015 champion
  4. YOLO: Ramos & Sappa (2025) - "A decade of you only look once (yolo) for object detection: A review"
  5. Edge-to-Cloud: Moreschini et al. (2024) - "Edge to cloud tools: A multivocal literature review"

Overall Assessment: This is a solid empirical research paper providing valuable configuration guidance for mobile AI inference optimization. Its main strengths lie in systematic experimental design and comprehensive quantitative results, clearly revealing NPU advantages and task-specific quantization strategies. Primary limitations include generalizability constraints to single hardware platform and missing energy analysis. Offers high reference value for Android mobile developers and edge AI researchers, though conclusions require validation across broader hardware and task domains. Recommend future work supplement energy measurements, extend to other platforms and tasks, and open-source experimental code for improved reproducibility.