2025-11-18T21:55:13.846797

Chiplet-Based RISC-V SoC with Modular AI Acceleration

Ramkumar, Bharadwaj
Achieving high performance, energy efficiency, and cost-effectiveness while maintaining architectural flexibility is a critical challenge in the development and deployment of edge AI devices. Monolithic SoC designs struggle with this complex balance mainly due to low manufacturing yields (below 16%) at advanced 360 mm^2 process nodes. This paper presents a novel chiplet-based RISC-V SoC architecture that addresses these limitations through modular AI acceleration and intelligent system level optimization. Our proposed design integrates 4 different key innovations in a 30mm x 30mm silicon interposer: adaptive cross-chiplet Dynamic Voltage and Frequency Scaling (DVFS); AI-aware Universal Chiplet Interconnect Express (UCIe) protocol extensions featuring streaming flow control units and compression-aware transfers; distributed cryptographic security across heterogeneous chiplets; and intelligent sensor-driven load migration. The proposed architecture integrates a 7nm RISC-V CPU chiplet with dual 5nm AI accelerators (15 TOPS INT8 each), 16GB HBM3 memory stacks, and dedicated power management controllers. Experimental results across industry standard benchmarks like MobileNetV2, ResNet-50 and real-time video processing demonstrate significant performance improvements. The AI-optimized configuration achieves ~14.7% latency reduction, 17.3% throughput improvement, and 16.2% power reduction compared to previous basic chiplet implementations. These improvements collectively translate to a 40.1% efficiency gain corresponding to ~3.5 mJ per MobileNetV2 inference (860 mW/244 images/s), while maintaining sub-5ms real-time capability across all experimented workloads. These performance upgrades demonstrate that modular chiplet designs can achieve near-monolithic computational density while enabling cost efficiency, scalability and upgradeability, crucial for next-generation edge AI device applications.
academic

Chiplet-Based RISC-V SoC with Modular AI Acceleration

Basic Information

  • Paper ID: 2509.18355
  • Title: Chiplet-Based RISC-V SoC with Modular AI Acceleration
  • Authors: Suhas Suresh Bharadwaj (Birla Institute of Technology and Science, Pilani – Dubai), Prerana Ramkumar (American University of Sharjah)
  • Classification: cs.AR (Computer Architecture), cs.AI (Artificial Intelligence)
  • Publication Time/Conference: Conference information not explicitly specified
  • Paper Link: https://arxiv.org/abs/2509.18355

Abstract

This paper proposes a novel chiplet-based RISC-V SoC architecture that addresses the performance-efficiency-cost trade-off challenges in edge AI devices through modular AI acceleration and intelligent system-level optimization. The design integrates four key innovations on a 30mm×30mm silicon interposer: adaptive cross-chiplet dynamic voltage-frequency scaling (DVFS), AI-aware UCIe protocol extensions, distributed cryptographic security, and intelligent sensor-driven workload migration. Experimental results demonstrate that compared to baseline chiplet implementations, the AI-optimized configuration achieves 14.7% latency reduction, 17.3% throughput improvement, and 16.2% power reduction, with an overall efficiency gain of 40.1%.

Research Background and Motivation

Problem Definition

Edge AI platforms must meet stringent performance requirements, including sub-millisecond end-to-end latency and power envelopes below 2W, while executing increasingly complex deep networks such as MobileNetV2 and ResNet-50. However, traditional monolithic system-on-chip (SoC) approaches face manufacturing and yield challenges.

Problem Significance

  1. Market Demand: An estimated 500 billion devices are projected by 2030, with edge AI platforms occupying a significant share
  2. Technical Challenges: At advanced process nodes, yield rates for chips with hundreds of square millimeters of area are extremely low (below 16%)
  3. Application Requirements: Autonomous driving, industrial automation, and medical applications impose strict real-time inference requirements

Limitations of Existing Approaches

  1. Monolithic SoCs: Poor manufacturing yield at advanced process nodes with poor economics
  2. Traditional DVFS: Long voltage conversion times (tens of microseconds) limit fine-grained adjustment
  3. Security Integration: Multi-vendor chiplet integration introduces security risks, including counterfeiting, cloning, and supply chain tampering

Research Motivation

Chiplet-based 2.5D integration technology provides a practical alternative by decomposing large SoCs into smaller heterogeneous chips interconnected through high-density interposers.

Core Contributions

  1. Proposed a chiplet-based RISC-V SoC architecture: Integrating a 7nm RISC-V CPU chiplet, dual 5nm AI accelerators (15 TOPS INT8 each), 16GB HBM3 memory, and a dedicated power management controller
  2. Implemented four key system innovations:
    • Adaptive cross-chiplet DVFS system
    • AI-aware UCIe protocol extensions
    • Distributed cryptographic security framework
    • Intelligent thermal management system
  3. Verified significant performance improvements: Achieving 14.7% latency reduction, 17.3% throughput improvement, and 16.2% power reduction compared to baseline chiplet implementations
  4. Demonstrated real-time processing capability: Maintaining sub-5ms real-time capability across all tested workloads

Methodology Details

System Architecture Design

Overall Architecture

The system employs a modular chiplet architecture on a 30mm×30mm silicon interposer, comprising:

  • RISC-V CPU Chiplet: 5mm×5mm, 7nm process, with embedded custom vector extensions
  • AI Accelerator Chiplets: Dual 6mm×4mm, 5nm process, each providing 15 TOPS INT8 compute
  • HBM3 Memory: 16GB capacity, 819 GB/s bandwidth
  • I/O and Power Management Chiplet: 7mm×3mm
  • Security Controller: 3mm×2mm

UCIe Interconnect System

Employs UCIe 2.0 chip-to-chip links for chiplet communication:

  • Bandwidth: ~30 GB/s
  • Latency: <2ns
  • Protocol Support: Simultaneously handles CXL memory traffic and other streaming data protocols
  • Extended Features: Streaming FLITs, predictive prefetching, and compressed sensing transmission

Key Technical Innovations

1. Adaptive Cross-Chiplet DVFS

Technical Characteristics:

  • Employs on-chip regulators for nanosecond-scale voltage switching
  • Predicts workload phases and reallocates power through fine-grained voltage islands
  • Overcomes traditional DVFS limitations of tens of microseconds voltage conversion time

Performance Improvements:

  • 12% energy reduction for memory-intensive workloads
  • Negligible performance impact

2. AI-Aware UCIe Protocol Extensions

Design Highlights:

  • Complete chip-to-chip communication stack based on UCIe 2.0 specification
  • Includes physical layer, adaptation layer, and protocol layer
  • Supports streaming control units and compressed sensing transmission
  • Provides standardized architecture for system-level manageability, debugging, and testing

3. Distributed Security Framework (AuthenTree)

Security Strategy:

  • Employs tree-based multi-party computation (MPC) protocol
  • Decentralized security architecture avoiding single points of failure
  • Integrates cryptographic links and cryptographic identity for each chiplet
  • Scalable distributed security framework in zero-trust environments

4. Intelligent Thermal Management

Predictive Approach:

  • Sensor-driven workload migration
  • Goes beyond purely reactive thermal management (performance throttling only after reaching critical temperature)
  • Intelligent prediction and proactive load distribution

Experimental Setup

Simulation Framework

Developed a Python-based simulator to evaluate the chiplet RISC-V SoC design:

  • Models interconnect latency, power consumption, and thermal throttling behavior
  • Applies power efficiency scaling through fixed voltage scaling factors
  • Parameters sourced from UCIe specifications, power scaling research, and literature-reported measurements

Test Scenarios

Defined four test scenarios:

ScenarioLatency (μs)Bandwidth (Gbps)Base Power (mW)Communication Power (mW/ms)Efficiency Factor
Monolithic SoC0.015000.01.0
Baseline Chiplet1.516.01200350.95
AI-Optimized Chiplet0.824.01100250.90
Poor Integration8.08.01800801.10

Workloads

Selected representative edge inference tasks from MLPerf Tiny benchmarks:

WorkloadBase Computation (ms)Input Size (MB)Complexity FactorBatch Efficiency
MobileNetV23.50.570.80.85
ResNet-5012.00.571.20.90
Real-time Video2.00.301.00.70

Evaluation Metrics

  • Inference Latency: Single inference completion time
  • Throughput: GFLOPs/s or images/s
  • Power Consumption: mW
  • Energy Efficiency: TOPS/W
  • Scalability: Batch size effects

Experimental Results

Primary Results

MobileNetV2 Benchmark (Batch Size = 1)

ArchitectureLatency (ms)Throughput (imgs/s)Power (mW)
Monolithic SoC4.7 ± 0.22131284
Baseline Chiplet4.8 ± 0.22081026
AI-Optimized4.1 ± 0.3244860
Poor Integration6.2 ± 0.31631776

Performance Improvement Analysis

AI-optimized configuration compared to baseline chiplet implementation:

  • Latency Reduction: From 4.8ms to 4.1ms (≈14.7% reduction)
  • Throughput Improvement: From 208 images/s to 244 images/s (≈17.3% improvement)
  • Power Reduction: From 1026mW to 860mW (≈16.2% reduction)
  • Energy Efficiency Improvement: From 0.203 TOPS/W to 0.284 TOPS/W (≈40.1% improvement)

Cross-Workload Performance

  • Energy Efficiency Metric: ≈3.5 mJ per MobileNetV2 inference (860 mW / 244 images/s)
  • Real-time Capability: All tested workloads meet sub-5ms requirements
  • Batch Scaling: AI-optimized configuration maintains highest throughput across batch sizes 1-32

Experimental Findings

  1. Architectural Advantages: Modular chiplet design achieves near-monolithic compute density
  2. Cost-Effectiveness: Achieves cost efficiency, scalability, and upgradeability while maintaining performance
  3. Real-time Guarantees: Consistent performance across all workloads
  4. Power Optimization: Significant power reduction without sacrificing performance

Primary Research Directions

  1. Edge AI Platforms: Supporting real-time inference for autonomous systems, industrial automation, medical applications
  2. Chiplet Technology: 2.5D integration technology enabling heterogeneous chip interconnection through silicon interposers
  3. AI Accelerators: 5nm AI inference accelerators achieving up to 95.6 TOPS/W efficiency
  4. Memory Technology: HBM3 providing up to 819 GB/s bandwidth alleviating external DRAM bottlenecks

Innovations in This Work

  1. System-Level Optimization: Comprehensive solution combining DVFS, UCIe optimization, distributed security, and thermal management
  2. Real-Time Performance: Focused on real-time inference requirements for edge AI
  3. Modular Design: Chiplet architecture balancing performance, cost, and upgradeability

Conclusions and Discussion

Main Conclusions

  1. Technical Feasibility: Chiplet-based RISC-V SoC architecture successfully addresses the performance-efficiency-cost trade-off for edge AI devices
  2. Significant Performance Gains: Integration of four key innovations achieves comprehensive improvements in performance, power, and efficiency
  3. Practical Value: Provides a viable solution for next-generation edge AI device applications

Limitations

  1. Simulation Verification: Results based on Python simulator lacking actual hardware validation
  2. Workload Range: Testing limited to three specific AI workloads
  3. Cost Analysis: Lacks detailed manufacturing cost comparison analysis
  4. Long-term Reliability: Does not evaluate long-term operational reliability and stability

Future Directions

  1. Hardware Prototype: Develop actual hardware prototype for validation
  2. Extended Evaluation: Test performance across broader AI workloads
  3. Manufacturing Optimization: Research further optimization of chiplet manufacturing and integration
  4. Standardization: Promote development of chiplet interconnect and security standards

In-Depth Evaluation

Strengths

  1. Systematic Innovation: Proposes comprehensive solution integrating four key technical innovations, systematically addressing multiple critical issues in chiplet design
  2. Practical Orientation: Addresses actual edge AI requirements, focusing on real-time performance and power efficiency
  3. Quantitative Evaluation: Provides detailed performance data and comparative analysis with convincing results
  4. Technical Depth: Covers multiple levels from hardware architecture to system-level optimization

Weaknesses

  1. Verification Limitations: Relies solely on simulation verification lacking actual hardware implementation and testing
  2. Parameter Sources: Accuracy and representativeness of some simulation parameters may be questionable
  3. Insufficient Cost Analysis: Lacks detailed economic analysis and manufacturing cost comparison
  4. Security Verification: Actual effectiveness of distributed security framework insufficiently verified

Impact

  1. Academic Contribution: Provides important reference for chiplet architecture design in edge AI applications
  2. Technology Advancement: May promote development of UCIe protocol extensions and chiplet security standards
  3. Industrial Value: Provides practical solutions for chiplet technology development in semiconductor industry
  4. Research Direction: Provides foundational framework and evaluation methodology for subsequent related research

Applicable Scenarios

  1. Edge AI Devices: Applications requiring real-time AI inference such as autonomous driving, industrial automation, intelligent surveillance
  2. High-Performance Computing: Scenarios requiring modular, scalable computing capability
  3. Cost-Sensitive Applications: Commercial applications requiring performance-cost balance
  4. Prototype Development: Provides reference for further research and development of chiplet architectures

References

The paper cites 19 related references covering important works in edge AI, chiplet technology, DVFS, security architecture, and other relevant domains, providing solid theoretical foundation for the research.


Overall Assessment: This is a research paper of significant value in the computer architecture domain, proposing innovative chiplet architecture design for edge AI applications. While limited in actual verification aspects, its systematic technical innovations and detailed performance analysis provide important contributions to the field's development.