2025-11-18T21:55:13.846797

Chiplet-Based RISC-V SoC with Modular AI Acceleration

Ramkumar, Bharadwaj

Achieving high performance, energy efficiency, and cost-effectiveness while maintaining architectural flexibility is a critical challenge in the development and deployment of edge AI devices. Monolithic SoC designs struggle with this complex balance mainly due to low manufacturing yields (below 16%) at advanced 360 mm^2 process nodes. This paper presents a novel chiplet-based RISC-V SoC architecture that addresses these limitations through modular AI acceleration and intelligent system level optimization. Our proposed design integrates 4 different key innovations in a 30mm x 30mm silicon interposer: adaptive cross-chiplet Dynamic Voltage and Frequency Scaling (DVFS); AI-aware Universal Chiplet Interconnect Express (UCIe) protocol extensions featuring streaming flow control units and compression-aware transfers; distributed cryptographic security across heterogeneous chiplets; and intelligent sensor-driven load migration. The proposed architecture integrates a 7nm RISC-V CPU chiplet with dual 5nm AI accelerators (15 TOPS INT8 each), 16GB HBM3 memory stacks, and dedicated power management controllers. Experimental results across industry standard benchmarks like MobileNetV2, ResNet-50 and real-time video processing demonstrate significant performance improvements. The AI-optimized configuration achieves ~14.7% latency reduction, 17.3% throughput improvement, and 16.2% power reduction compared to previous basic chiplet implementations. These improvements collectively translate to a 40.1% efficiency gain corresponding to ~3.5 mJ per MobileNetV2 inference (860 mW/244 images/s), while maintaining sub-5ms real-time capability across all experimented workloads. These performance upgrades demonstrate that modular chiplet designs can achieve near-monolithic computational density while enabling cost efficiency, scalability and upgradeability, crucial for next-generation edge AI device applications.

academic

Chiplet-Based RISC-V SoC with Modular AI Acceleration

Basic Information

Paper ID: 2509.18355
Title: Chiplet-Based RISC-V SoC with Modular AI Acceleration
Authors: Suhas Suresh Bharadwaj (Birla Institute of Technology and Science, Pilani – Dubai), Prerana Ramkumar (American University of Sharjah)
Classification: cs.AR (Computer Architecture), cs.AI (Artificial Intelligence)
Publication Time/Conference: Conference information not explicitly specified
Paper Link: https://arxiv.org/abs/2509.18355

Abstract

This paper proposes a novel chiplet-based RISC-V SoC architecture that addresses the performance-efficiency-cost trade-off challenges in edge AI devices through modular AI acceleration and intelligent system-level optimization. The design integrates four key innovations on a 30mm×30mm silicon interposer: adaptive cross-chiplet dynamic voltage-frequency scaling (DVFS), AI-aware UCIe protocol extensions, distributed cryptographic security, and intelligent sensor-driven workload migration. Experimental results demonstrate that compared to baseline chiplet implementations, the AI-optimized configuration achieves 14.7% latency reduction, 17.3% throughput improvement, and 16.2% power reduction, with an overall efficiency gain of 40.1%.

Research Background and Motivation

Problem Definition

Edge AI platforms must meet stringent performance requirements, including sub-millisecond end-to-end latency and power envelopes below 2W, while executing increasingly complex deep networks such as MobileNetV2 and ResNet-50. However, traditional monolithic system-on-chip (SoC) approaches face manufacturing and yield challenges.

Problem Significance

Market Demand: An estimated 500 billion devices are projected by 2030, with edge AI platforms occupying a significant share
Technical Challenges: At advanced process nodes, yield rates for chips with hundreds of square millimeters of area are extremely low (below 16%)
Application Requirements: Autonomous driving, industrial automation, and medical applications impose strict real-time inference requirements

Limitations of Existing Approaches

Monolithic SoCs: Poor manufacturing yield at advanced process nodes with poor economics
Traditional DVFS: Long voltage conversion times (tens of microseconds) limit fine-grained adjustment
Security Integration: Multi-vendor chiplet integration introduces security risks, including counterfeiting, cloning, and supply chain tampering

Research Motivation

Chiplet-based 2.5D integration technology provides a practical alternative by decomposing large SoCs into smaller heterogeneous chips interconnected through high-density interposers.

Core Contributions

Proposed a chiplet-based RISC-V SoC architecture: Integrating a 7nm RISC-V CPU chiplet, dual 5nm AI accelerators (15 TOPS INT8 each), 16GB HBM3 memory, and a dedicated power management controller
Implemented four key system innovations:
- Adaptive cross-chiplet DVFS system
- AI-aware UCIe protocol extensions
- Distributed cryptographic security framework
- Intelligent thermal management system
Verified significant performance improvements: Achieving 14.7% latency reduction, 17.3% throughput improvement, and 16.2% power reduction compared to baseline chiplet implementations
Demonstrated real-time processing capability: Maintaining sub-5ms real-time capability across all tested workloads

Methodology Details

System Architecture Design

Overall Architecture

The system employs a modular chiplet architecture on a 30mm×30mm silicon interposer, comprising:

RISC-V CPU Chiplet: 5mm×5mm, 7nm process, with embedded custom vector extensions
AI Accelerator Chiplets: Dual 6mm×4mm, 5nm process, each providing 15 TOPS INT8 compute
HBM3 Memory: 16GB capacity, 819 GB/s bandwidth
I/O and Power Management Chiplet: 7mm×3mm
Security Controller: 3mm×2mm

UCIe Interconnect System

Employs UCIe 2.0 chip-to-chip links for chiplet communication:

Bandwidth: ~30 GB/s
Latency: <2ns
Protocol Support: Simultaneously handles CXL memory traffic and other streaming data protocols
Extended Features: Streaming FLITs, predictive prefetching, and compressed sensing transmission

Key Technical Innovations

1. Adaptive Cross-Chiplet DVFS

Technical Characteristics:

Employs on-chip regulators for nanosecond-scale voltage switching
Predicts workload phases and reallocates power through fine-grained voltage islands
Overcomes traditional DVFS limitations of tens of microseconds voltage conversion time

Performance Improvements:

12% energy reduction for memory-intensive workloads
Negligible performance impact

2. AI-Aware UCIe Protocol Extensions

Design Highlights:

Complete chip-to-chip communication stack based on UCIe 2.0 specification
Includes physical layer, adaptation layer, and protocol layer
Supports streaming control units and compressed sensing transmission
Provides standardized architecture for system-level manageability, debugging, and testing

3. Distributed Security Framework (AuthenTree)

Security Strategy:

Employs tree-based multi-party computation (MPC) protocol
Decentralized security architecture avoiding single points of failure
Integrates cryptographic links and cryptographic identity for each chiplet
Scalable distributed security framework in zero-trust environments

4. Intelligent Thermal Management

Predictive Approach:

Sensor-driven workload migration
Goes beyond purely reactive thermal management (performance throttling only after reaching critical temperature)
Intelligent prediction and proactive load distribution

Experimental Setup

Simulation Framework

Developed a Python-based simulator to evaluate the chiplet RISC-V SoC design:

Models interconnect latency, power consumption, and thermal throttling behavior
Applies power efficiency scaling through fixed voltage scaling factors
Parameters sourced from UCIe specifications, power scaling research, and literature-reported measurements

Test Scenarios

Defined four test scenarios:

Scenario	Latency (μs)	Bandwidth (Gbps)	Base Power (mW)	Communication Power (mW/ms)	Efficiency Factor
Monolithic SoC	0.0	∞	1500	0.0	1.0
Baseline Chiplet	1.5	16.0	1200	35	0.95
AI-Optimized Chiplet	0.8	24.0	1100	25	0.90
Poor Integration	8.0	8.0	1800	80	1.10

Workloads

Selected representative edge inference tasks from MLPerf Tiny benchmarks:

Workload	Base Computation (ms)	Input Size (MB)	Complexity Factor	Batch Efficiency
MobileNetV2	3.5	0.57	0.8	0.85
ResNet-50	12.0	0.57	1.2	0.90
Real-time Video	2.0	0.30	1.0	0.70

Evaluation Metrics

Inference Latency: Single inference completion time
Throughput: GFLOPs/s or images/s
Power Consumption: mW
Energy Efficiency: TOPS/W
Scalability: Batch size effects

Experimental Results

Primary Results

MobileNetV2 Benchmark (Batch Size = 1)

Architecture	Latency (ms)	Throughput (imgs/s)	Power (mW)
Monolithic SoC	4.7 ± 0.2	213	1284
Baseline Chiplet	4.8 ± 0.2	208	1026
AI-Optimized	4.1 ± 0.3	244	860
Poor Integration	6.2 ± 0.3	163	1776

Performance Improvement Analysis

AI-optimized configuration compared to baseline chiplet implementation:

Latency Reduction: From 4.8ms to 4.1ms (≈14.7% reduction)
Throughput Improvement: From 208 images/s to 244 images/s (≈17.3% improvement)
Power Reduction: From 1026mW to 860mW (≈16.2% reduction)
Energy Efficiency Improvement: From 0.203 TOPS/W to 0.284 TOPS/W (≈40.1% improvement)

Cross-Workload Performance

Energy Efficiency Metric: ≈3.5 mJ per MobileNetV2 inference (860 mW / 244 images/s)
Real-time Capability: All tested workloads meet sub-5ms requirements
Batch Scaling: AI-optimized configuration maintains highest throughput across batch sizes 1-32

Experimental Findings

Architectural Advantages: Modular chiplet design achieves near-monolithic compute density
Cost-Effectiveness: Achieves cost efficiency, scalability, and upgradeability while maintaining performance
Real-time Guarantees: Consistent performance across all workloads
Power Optimization: Significant power reduction without sacrificing performance

Primary Research Directions

Edge AI Platforms: Supporting real-time inference for autonomous systems, industrial automation, medical applications
Chiplet Technology: 2.5D integration technology enabling heterogeneous chip interconnection through silicon interposers
AI Accelerators: 5nm AI inference accelerators achieving up to 95.6 TOPS/W efficiency
Memory Technology: HBM3 providing up to 819 GB/s bandwidth alleviating external DRAM bottlenecks

Innovations in This Work

System-Level Optimization: Comprehensive solution combining DVFS, UCIe optimization, distributed security, and thermal management
Real-Time Performance: Focused on real-time inference requirements for edge AI
Modular Design: Chiplet architecture balancing performance, cost, and upgradeability

Conclusions and Discussion

Main Conclusions

Technical Feasibility: Chiplet-based RISC-V SoC architecture successfully addresses the performance-efficiency-cost trade-off for edge AI devices
Significant Performance Gains: Integration of four key innovations achieves comprehensive improvements in performance, power, and efficiency
Practical Value: Provides a viable solution for next-generation edge AI device applications

Limitations

Simulation Verification: Results based on Python simulator lacking actual hardware validation
Workload Range: Testing limited to three specific AI workloads
Cost Analysis: Lacks detailed manufacturing cost comparison analysis
Long-term Reliability: Does not evaluate long-term operational reliability and stability

Future Directions

Hardware Prototype: Develop actual hardware prototype for validation
Extended Evaluation: Test performance across broader AI workloads
Manufacturing Optimization: Research further optimization of chiplet manufacturing and integration
Standardization: Promote development of chiplet interconnect and security standards

In-Depth Evaluation

Strengths

Systematic Innovation: Proposes comprehensive solution integrating four key technical innovations, systematically addressing multiple critical issues in chiplet design
Practical Orientation: Addresses actual edge AI requirements, focusing on real-time performance and power efficiency
Quantitative Evaluation: Provides detailed performance data and comparative analysis with convincing results
Technical Depth: Covers multiple levels from hardware architecture to system-level optimization

Weaknesses

Verification Limitations: Relies solely on simulation verification lacking actual hardware implementation and testing
Parameter Sources: Accuracy and representativeness of some simulation parameters may be questionable
Insufficient Cost Analysis: Lacks detailed economic analysis and manufacturing cost comparison
Security Verification: Actual effectiveness of distributed security framework insufficiently verified

Impact

Academic Contribution: Provides important reference for chiplet architecture design in edge AI applications
Technology Advancement: May promote development of UCIe protocol extensions and chiplet security standards
Industrial Value: Provides practical solutions for chiplet technology development in semiconductor industry
Research Direction: Provides foundational framework and evaluation methodology for subsequent related research

Applicable Scenarios

Edge AI Devices: Applications requiring real-time AI inference such as autonomous driving, industrial automation, intelligent surveillance
High-Performance Computing: Scenarios requiring modular, scalable computing capability
Cost-Sensitive Applications: Commercial applications requiring performance-cost balance
Prototype Development: Provides reference for further research and development of chiplet architectures

References

The paper cites 19 related references covering important works in edge AI, chiplet technology, DVFS, security architecture, and other relevant domains, providing solid theoretical foundation for the research.

Overall Assessment: This is a research paper of significant value in the computer architecture domain, proposing innovative chiplet architecture design for edge AI applications. While limited in actual verification aspects, its systematic technical innovations and detailed performance analysis provide important contributions to the field's development.