2025-11-23T22:22:17.433145

CAPSim: A Fast CPU Performance Simulator Using Attention-based Predictor

Xu, Zhu, Zhang et al.

CPU simulators are vital for computer architecture research, primarily for estimating performance under different programs. This poses challenges for fast and accurate simulation of modern CPUs, especially in multi-core systems. Modern CPU peformance simulators such as GEM5 adopt the cycle-accurate and event-driven approach, which is timeconsuming to simulate the extensive microarchitectural behavior of a real benchmark running on out-of-order CPUs. Recently, machine leaning based approach has been proposed to improve simulation speed, but they are currently limited to estimating the cycles of basic blocks rather than the complete benchmark program. This paper introduces a novel ML-based CPU simulator named CAPSim, which uses an attention-based neural network performance predictor and instruction trace sampling method annotated with context. The attention mechanism effectively captures long-range influence within the instruction trace, emphasizing critical context information. This allows the model to improve performance prediction accuracy by focusing on important code instruction. CAPSim can predict the execution time of unseen benchmarks at a significantly fast speed compared with an accurate O3 simulator built with gem5. Our evaluation on a commercial Intel Xeon CPU demonstrates that CAPSim achieves a 2.2 - 8.3x speedup compared to using gem5 built simulator, which is superior to the cutting-edge deep learning approach

academic

CAPSim: A Fast CPU Performance Simulator Using Attention-based Predictor

Basic Information

Paper ID: 2510.10484
Title: CAPSim: A Fast CPU Performance Simulator Using Attention-based Predictor
Authors: Buqing Xu, Jianfeng Zhu, Yichi Zhang, Qinyi Cai, Guanhua Li, Shaojun Wei, Leibo Liu
Classification: cs.PF (Performance)
Publication Date: October 12, 2025
Institution: School of Integrated Circuits, Tsinghua University
Paper Link: https://arxiv.org/abs/2510.10484v1

Abstract

CPU simulators are critical for computer architecture research, primarily used to evaluate the performance of different programs. Modern CPU performance simulators such as GEM5 employ cycle-accurate and event-driven approaches, but they are prohibitively time-consuming when simulating complex microarchitectural behaviors of out-of-order CPUs on real benchmarks. This paper proposes CAPSim, a novel machine learning-driven CPU simulator based on an attention mechanism neural network performance predictor, employing an instruction trace sampling method with contextual annotations. The attention mechanism effectively captures long-range dependencies in instruction traces, emphasizing critical contextual information. Experimental results demonstrate that CAPSim achieves 2.2-8.3× speedup compared to the O3 simulator built with gem5.

Research Background and Motivation

Core Problems

Performance Bottleneck of Traditional Simulators: Modern cycle-level simulators (such as gem5) are too slow when simulating complete benchmark programs, with primary causes including:
- Cycle-accurate simulation is inherently a serial process, difficult to parallelize
- Simulating modern out-of-order CPUs requires modeling all microarchitectural details, resulting in enormous computational overhead
Limitations of Existing ML Methods: Existing machine learning approaches (such as Ithemal and Granite) are limited to predicting basic block throughput and cannot handle performance prediction for complete programs
Accuracy-Speed Trade-off: Need to significantly improve simulation speed while maintaining prediction accuracy

Research Significance

CPU simulators are critical tools for computer architecture research
With increasing CPU microarchitectural complexity and widespread adoption of multi-core systems, traditional simulation methods face severe efficiency challenges
Fast and accurate performance prediction is essential for hardware-software co-design and optimization

Core Contributions

Proposed Attention Mechanism-based CPU Performance Prediction Method: First application of attention mechanisms to instruction-level performance prediction, capable of capturing long-range dependencies between instructions, extending prediction capability from basic block level to complete program level
Designed Complete CAPSim Simulator Framework: Integrating a fast functional simulator with fine-grained code block performance predictor, achieving balance between speed and accuracy
Developed Accelerated Training Methods: Through clustering and sampling techniques, training datasets are partitioned into categories such as compute-intensive, memory-intensive, and control-intensive, significantly reducing training time and preventing overfitting
Achieved Significant Performance Improvements: Achieving maximum 8.3× speedup on SPEC2017 benchmarks with average 4.9× speedup, while maintaining acceptable prediction accuracy

Methodology Details

Task Definition

Input: Instruction trace sequence and CPU context information (register state) Output: Execution time prediction for code fragments Objective: Significantly improve the speed of performance evaluation for complete benchmark programs while ensuring prediction accuracy

Model Architecture

1. Overall Architecture Design

CAPSim employs an end-to-end architecture containing the following components:

AtomicSimple CPU Simulator: Rapidly generates instruction traces
Instruction Sequence Slicer: Partitions long instruction sequences into manageable code fragments
Sampler: Reduces training data volume and accelerates training process
Attention-based Performance Predictor: Core prediction module

2. Theoretical Foundation

The paper models total execution time as:

$T_{total} = \sum_{n=1}^{N} t_i \cdot \alpha_i$

where $t_i$ is the ideal execution time of the i-th instruction and $\alpha_i$ is the impact factor. Through introducing vector representations and attention mechanisms, the final formulation becomes:

$T_{total} = \sum_{i=1}^{M} MLP(Attention(context_{M \times E}, T_E^T, T_E^T))$

3. Performance Predictor Detailed Design

Normalization Transformation Layer: Converts raw assembly instructions into normalized token sequences, containing four segments:

<OPCODE>: Operation code
<DSTS>: Destination operands
<SRCS>: Source operands
<MEM>: Memory access information

Context Information Construction: Constructs context matrix containing CPU state information, such as various registers shown in Table I:

Register Type	Quantity	Bit Width	Description
General Purpose Registers (GPR)	32	64	Primary storage registers
Vector Scalar Registers (VSR)	64	128	Floating-point operation registers
Condition Registers (CR)	1	32	Reflect operation results
Program Counter (CIA/NIA)	2	64	Instruction addresses

Multi-layer Attention Network:

Instruction Encoder: Applies self-attention mechanism to each instruction
Block Encoder: Processes dependencies between instruction sequences
MLP Layer: Final output of execution time prediction

Technical Innovations

Long-range Dependency Modeling: Compared to sequence models such as LSTM, attention mechanisms better capture long-range dependencies between instructions
Context-aware Prediction: Incorporates CPU register state as contextual information, improving prediction accuracy
Hierarchical Attention Design: Dual-level attention mechanisms at instruction and block levels, considering both intra-instruction token relationships and inter-instruction dependencies
Parallelized Processing: Partitions long instruction sequences into small fragments, supporting GPU parallel processing and significantly improving inference speed

Experimental Setup

Dataset

Benchmark Suite: SPEC2017, containing 24 benchmark programs
Instruction Set Architecture: Power ISA
Interval Size: 5,000,000 instructions, warmup size 1,000,000 instructions
Code Fragment Length: 100-200 instructions
Total Checkpoints: 623

Evaluation Metrics

Speed Metric: Speedup ratio relative to gem5 simulator
Accuracy Metric: Mean Absolute Percentage Error (MAPE)

Comparison Methods

Traditional Method: gem5 O3 superscalar processor simulator
ML Baseline: LSTM-based Ithemal model
Ablation Study: CAPSim variant without context information

Implementation Details

Hardware Platform: NVIDIA GeForce RTX 4090 (24GB), Intel Xeon CPU E5-2623 v4
Model Parameters: Embedding vector dimension 128, attention heads 4, encoder layers 4
Training Settings: SGD optimizer, learning rate 0.001, momentum 0.9
Sampling Parameters: Threshold 200, sampling coefficient 0.02

Experimental Results

Main Results

Speed Improvement:

Maximum speedup: 8.3× (510.parest benchmark)
Average speedup: 4.9×
Speedup effect correlates with checkpoint quantity, demonstrating GPU parallelization advantages

Accuracy Performance:

Improvement of 9.5%-21.2% compared to LSTM baseline, average improvement of 15.8%
Accuracy improvement of 1.3%-9.6% after incorporating context information, average improvement of 6.2%
Average MAPE of 12.0% on mixed training set

Ablation Studies

Attention Mechanism vs LSTM: Attention mechanism significantly outperforms LSTM when handling long code fragments
Context Information Impact: Context information plays a crucial role in improving prediction accuracy
Categorical Training Effect: Categorical training improves accuracy by 0.5% compared to mixed training

Generalization Ability Testing

Cross-benchmark Testing:

6×6 cross-validation experiments, 36 training-test combinations
Training set accuracy 91.3%, overall average accuracy 88.3%
Demonstrates good generalization capability to unseen benchmarks

Cross-architecture Parameter Testing: Accuracy performance under different microarchitecture parameter configurations:

Parameter Configuration	FetchWidth	IssueWidth	CommitWidth	ROBEntry	Error
Baseline Configuration	8	8	8	192	12.0%
Variant 1	4	8	8	192	12.2%
Variant 2	8	4	8	192	12.9%

Experimental Findings

Significant Parallelization Effect: GPU parallel processing shows obvious advantages over CPU serial simulation
Long-range Dependencies Important: Attention mechanisms effectively capture complex dependencies between instructions
Context Information Critical: CPU state information is essential for accurate execution time prediction
Categorical Training Effective: Categorical training based on program characteristics improves model generalization

Traditional Simulators

Cycle-level Simulators: gem5, SimpleScalar, Sniper, etc., high accuracy but slow speed
Basic Block Level Tools: llvm-mca, uiCA, IACA, etc., fast speed but limited functionality

Machine Learning Methods

Regression Models: Using linear/nonlinear regression to predict CPI and power consumption
Deep Learning Methods:
- Ithemal: LSTM predicting basic block throughput
- Difftune: Optimizing llvm-mca parameters
- Granite: Graph neural network predicting basic block performance

Sampling Techniques

Statistical Sampling: SMARTS periodic sampling
Targeted Sampling: SimPoint sampling based on program behavior

Main advantages of this work compared to existing research:

First to achieve complete program-level performance prediction (rather than basic block level only)
Uses cycle-level simulator as ground truth (rather than simple compiler tools)
Attention mechanisms better model long-range dependencies

Conclusions and Discussion

Main Conclusions

Technical Feasibility: Attention mechanism-based methods can effectively predict CPU performance for complete programs
Performance Advantages: Significant speedup compared to traditional gem5 simulator (2.2-8.3×)
Accuracy Assurance: Maintains acceptable prediction accuracy while significantly improving speed
Generalization Capability: Model demonstrates good adaptability to unseen benchmarks and different architecture parameters

Limitations

Accuracy Trade-off: While speed is significantly improved, prediction accuracy still lags behind specialized cycle-level simulators (12% average error)
Architecture Dependency: Current implementation is based on Power ISA; extension to other instruction sets requires re-adaptation
Training Data Requirements: Requires substantial annotated data for training, with high initial cost
Complex Scenario Handling: Prediction capability may be limited for extremely complex program behaviors and microarchitectural characteristics

Future Directions

Multi-architecture Support: Extension to mainstream instruction set architectures such as x86 and ARM
Accuracy Improvement: Exploring more advanced attention mechanisms and context modeling methods
Multi-core Support: Extension to multi-core and heterogeneous system performance prediction
Online Learning: Support for runtime adaptive learning and model updates

In-depth Evaluation

Strengths

Technical Innovation:

First application of Transformer attention mechanisms to CPU performance prediction domain
Innovative combination of context information and instruction sequence modeling
Designed a complete end-to-end prediction framework

Experimental Comprehensiveness:

Comprehensive evaluation on standard SPEC2017 benchmarks
Includes detailed ablation studies and generalization ability testing
Comparison with multiple baseline methods

Result Convincingness:

Significant speed improvement (maximum 8.3× speedup)
Accuracy improvement compared to existing ML methods
Good cross-benchmark generalization capability

Writing Clarity:

Clear problem motivation articulation
Detailed method description including mathematical formulas
Complete experimental setup and result presentation

Weaknesses

Method Limitations:

Prediction accuracy still has room for improvement (12% average error)
Verified only on Power ISA, lacking multi-architecture verification
Handling capability for extremely complex scenarios not fully verified

Experimental Setup Defects:

Hardware platform comparison may not be entirely fair (GPU vs CPU)
Lack of comparison with more recent ML methods
Insufficient analysis of prediction effectiveness differences across different program types

Analysis Insufficiency:

Insufficient in-depth analysis of attention mechanism interpretability
Limited error case analysis
Insufficient computational resource consumption analysis

Impact

Contribution to the Field:

Provides new technical pathway for CPU performance prediction
Advances ML applications in computer architecture domain
Provides tools for rapid architecture design space exploration

Practical Value:

Significantly improves efficiency of large-scale benchmark evaluation
Provides rapid performance feedback for compiler optimization and hardware design
Reduces time cost of computer architecture research

Reproducibility:

Relatively detailed method description
Uses standard benchmark test suites
However, some implementation details and code are not publicly available

Applicable Scenarios

Architecture Design Space Exploration: Rapidly evaluate performance impact of different design parameters
Compiler Optimization: Provide rapid performance feedback for code optimization
Benchmark Test Acceleration: Significantly reduce runtime of standard benchmark tests
Teaching and Research: Provide efficient simulation tools for architecture courses and research

References

The paper cites 61 related references, primarily including:

Classical Simulators:

gem5: The gem5 simulator (Binkert et al.)
SimpleScalar, Sniper, Zesto and other traditional simulators

Machine Learning Methods:

Ithemal: Accurate, portable and fast basic block throughput estimation (Mendis et al.)
Granite: A graph neural network model for basic block throughput estimation (Sýkora et al.)

Attention Mechanisms:

Attention is all you need (Vaswani et al.)
Transformer-related research

Benchmarks:

SPEC CPU2017 benchmark suite

Overall Assessment: This is an innovative and practically valuable paper in the CPU performance prediction domain. The authors successfully introduce attention mechanisms into CPU performance prediction, achieving a breakthrough from basic block level to complete program-level prediction, and obtaining significant speed improvements. Although there is room for improvement in prediction accuracy and method generalization, this work provides valuable tools and insights for computer architecture research with good application prospects.