CPU simulators are vital for computer architecture research, primarily for estimating performance under different programs. This poses challenges for fast and accurate simulation of modern CPUs, especially in multi-core systems. Modern CPU peformance simulators such as GEM5 adopt the cycle-accurate and event-driven approach, which is timeconsuming to simulate the extensive microarchitectural behavior of a real benchmark running on out-of-order CPUs. Recently, machine leaning based approach has been proposed to improve simulation speed, but they are currently limited to estimating the cycles of basic blocks rather than the complete benchmark program. This paper introduces a novel ML-based CPU simulator named CAPSim, which uses an attention-based neural network performance predictor and instruction trace sampling method annotated with context. The attention mechanism effectively captures long-range influence within the instruction trace, emphasizing critical context information. This allows the model to improve performance prediction accuracy by focusing on important code instruction. CAPSim can predict the execution time of unseen benchmarks at a significantly fast speed compared with an accurate O3 simulator built with gem5. Our evaluation on a commercial Intel Xeon CPU demonstrates that CAPSim achieves a 2.2 - 8.3x speedup compared to using gem5 built simulator, which is superior to the cutting-edge deep learning approach
- Paper ID: 2510.10484
- Title: CAPSim: A Fast CPU Performance Simulator Using Attention-based Predictor
- Authors: Buqing Xu, Jianfeng Zhu, Yichi Zhang, Qinyi Cai, Guanhua Li, Shaojun Wei, Leibo Liu
- Classification: cs.PF (Performance)
- Publication Date: October 12, 2025
- Institution: School of Integrated Circuits, Tsinghua University
- Paper Link: https://arxiv.org/abs/2510.10484v1
CPU simulators are critical for computer architecture research, primarily used to evaluate the performance of different programs. Modern CPU performance simulators such as GEM5 employ cycle-accurate and event-driven approaches, but they are prohibitively time-consuming when simulating complex microarchitectural behaviors of out-of-order CPUs on real benchmarks. This paper proposes CAPSim, a novel machine learning-driven CPU simulator based on an attention mechanism neural network performance predictor, employing an instruction trace sampling method with contextual annotations. The attention mechanism effectively captures long-range dependencies in instruction traces, emphasizing critical contextual information. Experimental results demonstrate that CAPSim achieves 2.2-8.3× speedup compared to the O3 simulator built with gem5.
- Performance Bottleneck of Traditional Simulators: Modern cycle-level simulators (such as gem5) are too slow when simulating complete benchmark programs, with primary causes including:
- Cycle-accurate simulation is inherently a serial process, difficult to parallelize
- Simulating modern out-of-order CPUs requires modeling all microarchitectural details, resulting in enormous computational overhead
- Limitations of Existing ML Methods: Existing machine learning approaches (such as Ithemal and Granite) are limited to predicting basic block throughput and cannot handle performance prediction for complete programs
- Accuracy-Speed Trade-off: Need to significantly improve simulation speed while maintaining prediction accuracy
- CPU simulators are critical tools for computer architecture research
- With increasing CPU microarchitectural complexity and widespread adoption of multi-core systems, traditional simulation methods face severe efficiency challenges
- Fast and accurate performance prediction is essential for hardware-software co-design and optimization
- Proposed Attention Mechanism-based CPU Performance Prediction Method: First application of attention mechanisms to instruction-level performance prediction, capable of capturing long-range dependencies between instructions, extending prediction capability from basic block level to complete program level
- Designed Complete CAPSim Simulator Framework: Integrating a fast functional simulator with fine-grained code block performance predictor, achieving balance between speed and accuracy
- Developed Accelerated Training Methods: Through clustering and sampling techniques, training datasets are partitioned into categories such as compute-intensive, memory-intensive, and control-intensive, significantly reducing training time and preventing overfitting
- Achieved Significant Performance Improvements: Achieving maximum 8.3× speedup on SPEC2017 benchmarks with average 4.9× speedup, while maintaining acceptable prediction accuracy
Input: Instruction trace sequence and CPU context information (register state)
Output: Execution time prediction for code fragments
Objective: Significantly improve the speed of performance evaluation for complete benchmark programs while ensuring prediction accuracy
CAPSim employs an end-to-end architecture containing the following components:
- AtomicSimple CPU Simulator: Rapidly generates instruction traces
- Instruction Sequence Slicer: Partitions long instruction sequences into manageable code fragments
- Sampler: Reduces training data volume and accelerates training process
- Attention-based Performance Predictor: Core prediction module
The paper models total execution time as:
Ttotal=∑n=1Nti⋅αi
where ti is the ideal execution time of the i-th instruction and αi is the impact factor. Through introducing vector representations and attention mechanisms, the final formulation becomes:
Ttotal=∑i=1MMLP(Attention(contextM×E,TET,TET))
Normalization Transformation Layer:
Converts raw assembly instructions into normalized token sequences, containing four segments:
<OPCODE>: Operation code<DSTS>: Destination operands<SRCS>: Source operands<MEM>: Memory access information
Context Information Construction:
Constructs context matrix containing CPU state information, such as various registers shown in Table I:
| Register Type | Quantity | Bit Width | Description |
|---|
| General Purpose Registers (GPR) | 32 | 64 | Primary storage registers |
| Vector Scalar Registers (VSR) | 64 | 128 | Floating-point operation registers |
| Condition Registers (CR) | 1 | 32 | Reflect operation results |
| Program Counter (CIA/NIA) | 2 | 64 | Instruction addresses |
Multi-layer Attention Network:
- Instruction Encoder: Applies self-attention mechanism to each instruction
- Block Encoder: Processes dependencies between instruction sequences
- MLP Layer: Final output of execution time prediction
- Long-range Dependency Modeling: Compared to sequence models such as LSTM, attention mechanisms better capture long-range dependencies between instructions
- Context-aware Prediction: Incorporates CPU register state as contextual information, improving prediction accuracy
- Hierarchical Attention Design: Dual-level attention mechanisms at instruction and block levels, considering both intra-instruction token relationships and inter-instruction dependencies
- Parallelized Processing: Partitions long instruction sequences into small fragments, supporting GPU parallel processing and significantly improving inference speed
- Benchmark Suite: SPEC2017, containing 24 benchmark programs
- Instruction Set Architecture: Power ISA
- Interval Size: 5,000,000 instructions, warmup size 1,000,000 instructions
- Code Fragment Length: 100-200 instructions
- Total Checkpoints: 623
- Speed Metric: Speedup ratio relative to gem5 simulator
- Accuracy Metric: Mean Absolute Percentage Error (MAPE)
- Traditional Method: gem5 O3 superscalar processor simulator
- ML Baseline: LSTM-based Ithemal model
- Ablation Study: CAPSim variant without context information
- Hardware Platform: NVIDIA GeForce RTX 4090 (24GB), Intel Xeon CPU E5-2623 v4
- Model Parameters: Embedding vector dimension 128, attention heads 4, encoder layers 4
- Training Settings: SGD optimizer, learning rate 0.001, momentum 0.9
- Sampling Parameters: Threshold 200, sampling coefficient 0.02
Speed Improvement:
- Maximum speedup: 8.3× (510.parest benchmark)
- Average speedup: 4.9×
- Speedup effect correlates with checkpoint quantity, demonstrating GPU parallelization advantages
Accuracy Performance:
- Improvement of 9.5%-21.2% compared to LSTM baseline, average improvement of 15.8%
- Accuracy improvement of 1.3%-9.6% after incorporating context information, average improvement of 6.2%
- Average MAPE of 12.0% on mixed training set
- Attention Mechanism vs LSTM: Attention mechanism significantly outperforms LSTM when handling long code fragments
- Context Information Impact: Context information plays a crucial role in improving prediction accuracy
- Categorical Training Effect: Categorical training improves accuracy by 0.5% compared to mixed training
Cross-benchmark Testing:
- 6×6 cross-validation experiments, 36 training-test combinations
- Training set accuracy 91.3%, overall average accuracy 88.3%
- Demonstrates good generalization capability to unseen benchmarks
Cross-architecture Parameter Testing:
Accuracy performance under different microarchitecture parameter configurations:
| Parameter Configuration | FetchWidth | IssueWidth | CommitWidth | ROBEntry | Error |
|---|
| Baseline Configuration | 8 | 8 | 8 | 192 | 12.0% |
| Variant 1 | 4 | 8 | 8 | 192 | 12.2% |
| Variant 2 | 8 | 4 | 8 | 192 | 12.9% |
- Significant Parallelization Effect: GPU parallel processing shows obvious advantages over CPU serial simulation
- Long-range Dependencies Important: Attention mechanisms effectively capture complex dependencies between instructions
- Context Information Critical: CPU state information is essential for accurate execution time prediction
- Categorical Training Effective: Categorical training based on program characteristics improves model generalization
- Cycle-level Simulators: gem5, SimpleScalar, Sniper, etc., high accuracy but slow speed
- Basic Block Level Tools: llvm-mca, uiCA, IACA, etc., fast speed but limited functionality
- Regression Models: Using linear/nonlinear regression to predict CPI and power consumption
- Deep Learning Methods:
- Ithemal: LSTM predicting basic block throughput
- Difftune: Optimizing llvm-mca parameters
- Granite: Graph neural network predicting basic block performance
- Statistical Sampling: SMARTS periodic sampling
- Targeted Sampling: SimPoint sampling based on program behavior
Main advantages of this work compared to existing research:
- First to achieve complete program-level performance prediction (rather than basic block level only)
- Uses cycle-level simulator as ground truth (rather than simple compiler tools)
- Attention mechanisms better model long-range dependencies
- Technical Feasibility: Attention mechanism-based methods can effectively predict CPU performance for complete programs
- Performance Advantages: Significant speedup compared to traditional gem5 simulator (2.2-8.3×)
- Accuracy Assurance: Maintains acceptable prediction accuracy while significantly improving speed
- Generalization Capability: Model demonstrates good adaptability to unseen benchmarks and different architecture parameters
- Accuracy Trade-off: While speed is significantly improved, prediction accuracy still lags behind specialized cycle-level simulators (12% average error)
- Architecture Dependency: Current implementation is based on Power ISA; extension to other instruction sets requires re-adaptation
- Training Data Requirements: Requires substantial annotated data for training, with high initial cost
- Complex Scenario Handling: Prediction capability may be limited for extremely complex program behaviors and microarchitectural characteristics
- Multi-architecture Support: Extension to mainstream instruction set architectures such as x86 and ARM
- Accuracy Improvement: Exploring more advanced attention mechanisms and context modeling methods
- Multi-core Support: Extension to multi-core and heterogeneous system performance prediction
- Online Learning: Support for runtime adaptive learning and model updates
Technical Innovation:
- First application of Transformer attention mechanisms to CPU performance prediction domain
- Innovative combination of context information and instruction sequence modeling
- Designed a complete end-to-end prediction framework
Experimental Comprehensiveness:
- Comprehensive evaluation on standard SPEC2017 benchmarks
- Includes detailed ablation studies and generalization ability testing
- Comparison with multiple baseline methods
Result Convincingness:
- Significant speed improvement (maximum 8.3× speedup)
- Accuracy improvement compared to existing ML methods
- Good cross-benchmark generalization capability
Writing Clarity:
- Clear problem motivation articulation
- Detailed method description including mathematical formulas
- Complete experimental setup and result presentation
Method Limitations:
- Prediction accuracy still has room for improvement (12% average error)
- Verified only on Power ISA, lacking multi-architecture verification
- Handling capability for extremely complex scenarios not fully verified
Experimental Setup Defects:
- Hardware platform comparison may not be entirely fair (GPU vs CPU)
- Lack of comparison with more recent ML methods
- Insufficient analysis of prediction effectiveness differences across different program types
Analysis Insufficiency:
- Insufficient in-depth analysis of attention mechanism interpretability
- Limited error case analysis
- Insufficient computational resource consumption analysis
Contribution to the Field:
- Provides new technical pathway for CPU performance prediction
- Advances ML applications in computer architecture domain
- Provides tools for rapid architecture design space exploration
Practical Value:
- Significantly improves efficiency of large-scale benchmark evaluation
- Provides rapid performance feedback for compiler optimization and hardware design
- Reduces time cost of computer architecture research
Reproducibility:
- Relatively detailed method description
- Uses standard benchmark test suites
- However, some implementation details and code are not publicly available
- Architecture Design Space Exploration: Rapidly evaluate performance impact of different design parameters
- Compiler Optimization: Provide rapid performance feedback for code optimization
- Benchmark Test Acceleration: Significantly reduce runtime of standard benchmark tests
- Teaching and Research: Provide efficient simulation tools for architecture courses and research
The paper cites 61 related references, primarily including:
Classical Simulators:
- gem5: The gem5 simulator (Binkert et al.)
- SimpleScalar, Sniper, Zesto and other traditional simulators
Machine Learning Methods:
- Ithemal: Accurate, portable and fast basic block throughput estimation (Mendis et al.)
- Granite: A graph neural network model for basic block throughput estimation (Sýkora et al.)
Attention Mechanisms:
- Attention is all you need (Vaswani et al.)
- Transformer-related research
Benchmarks:
- SPEC CPU2017 benchmark suite
Overall Assessment: This is an innovative and practically valuable paper in the CPU performance prediction domain. The authors successfully introduce attention mechanisms into CPU performance prediction, achieving a breakthrough from basic block level to complete program-level prediction, and obtaining significant speed improvements. Although there is room for improvement in prediction accuracy and method generalization, this work provides valuable tools and insights for computer architecture research with good application prospects.