2025-11-23T22:22:17.433145

CAPSim: A Fast CPU Performance Simulator Using Attention-based Predictor

Xu, Zhu, Zhang et al.
CPU simulators are vital for computer architecture research, primarily for estimating performance under different programs. This poses challenges for fast and accurate simulation of modern CPUs, especially in multi-core systems. Modern CPU peformance simulators such as GEM5 adopt the cycle-accurate and event-driven approach, which is timeconsuming to simulate the extensive microarchitectural behavior of a real benchmark running on out-of-order CPUs. Recently, machine leaning based approach has been proposed to improve simulation speed, but they are currently limited to estimating the cycles of basic blocks rather than the complete benchmark program. This paper introduces a novel ML-based CPU simulator named CAPSim, which uses an attention-based neural network performance predictor and instruction trace sampling method annotated with context. The attention mechanism effectively captures long-range influence within the instruction trace, emphasizing critical context information. This allows the model to improve performance prediction accuracy by focusing on important code instruction. CAPSim can predict the execution time of unseen benchmarks at a significantly fast speed compared with an accurate O3 simulator built with gem5. Our evaluation on a commercial Intel Xeon CPU demonstrates that CAPSim achieves a 2.2 - 8.3x speedup compared to using gem5 built simulator, which is superior to the cutting-edge deep learning approach
academic

CAPSim: A Fast CPU Performance Simulator Using Attention-based Predictor

Basic Information

  • Paper ID: 2510.10484
  • Title: CAPSim: A Fast CPU Performance Simulator Using Attention-based Predictor
  • Authors: Buqing Xu, Jianfeng Zhu, Yichi Zhang, Qinyi Cai, Guanhua Li, Shaojun Wei, Leibo Liu
  • Classification: cs.PF (Performance)
  • Publication Date: October 12, 2025
  • Institution: School of Integrated Circuits, Tsinghua University
  • Paper Link: https://arxiv.org/abs/2510.10484v1

Abstract

CPU simulators are critical for computer architecture research, primarily used to evaluate the performance of different programs. Modern CPU performance simulators such as GEM5 employ cycle-accurate and event-driven approaches, but they are prohibitively time-consuming when simulating complex microarchitectural behaviors of out-of-order CPUs on real benchmarks. This paper proposes CAPSim, a novel machine learning-driven CPU simulator based on an attention mechanism neural network performance predictor, employing an instruction trace sampling method with contextual annotations. The attention mechanism effectively captures long-range dependencies in instruction traces, emphasizing critical contextual information. Experimental results demonstrate that CAPSim achieves 2.2-8.3× speedup compared to the O3 simulator built with gem5.

Research Background and Motivation

Core Problems

  1. Performance Bottleneck of Traditional Simulators: Modern cycle-level simulators (such as gem5) are too slow when simulating complete benchmark programs, with primary causes including:
    • Cycle-accurate simulation is inherently a serial process, difficult to parallelize
    • Simulating modern out-of-order CPUs requires modeling all microarchitectural details, resulting in enormous computational overhead
  2. Limitations of Existing ML Methods: Existing machine learning approaches (such as Ithemal and Granite) are limited to predicting basic block throughput and cannot handle performance prediction for complete programs
  3. Accuracy-Speed Trade-off: Need to significantly improve simulation speed while maintaining prediction accuracy

Research Significance

  • CPU simulators are critical tools for computer architecture research
  • With increasing CPU microarchitectural complexity and widespread adoption of multi-core systems, traditional simulation methods face severe efficiency challenges
  • Fast and accurate performance prediction is essential for hardware-software co-design and optimization

Core Contributions

  1. Proposed Attention Mechanism-based CPU Performance Prediction Method: First application of attention mechanisms to instruction-level performance prediction, capable of capturing long-range dependencies between instructions, extending prediction capability from basic block level to complete program level
  2. Designed Complete CAPSim Simulator Framework: Integrating a fast functional simulator with fine-grained code block performance predictor, achieving balance between speed and accuracy
  3. Developed Accelerated Training Methods: Through clustering and sampling techniques, training datasets are partitioned into categories such as compute-intensive, memory-intensive, and control-intensive, significantly reducing training time and preventing overfitting
  4. Achieved Significant Performance Improvements: Achieving maximum 8.3× speedup on SPEC2017 benchmarks with average 4.9× speedup, while maintaining acceptable prediction accuracy

Methodology Details

Task Definition

Input: Instruction trace sequence and CPU context information (register state) Output: Execution time prediction for code fragments Objective: Significantly improve the speed of performance evaluation for complete benchmark programs while ensuring prediction accuracy

Model Architecture

1. Overall Architecture Design

CAPSim employs an end-to-end architecture containing the following components:

  • AtomicSimple CPU Simulator: Rapidly generates instruction traces
  • Instruction Sequence Slicer: Partitions long instruction sequences into manageable code fragments
  • Sampler: Reduces training data volume and accelerates training process
  • Attention-based Performance Predictor: Core prediction module

2. Theoretical Foundation

The paper models total execution time as:

Ttotal=n=1NtiαiT_{total} = \sum_{n=1}^{N} t_i \cdot \alpha_i

where tit_i is the ideal execution time of the i-th instruction and αi\alpha_i is the impact factor. Through introducing vector representations and attention mechanisms, the final formulation becomes:

Ttotal=i=1MMLP(Attention(contextM×E,TET,TET))T_{total} = \sum_{i=1}^{M} MLP(Attention(context_{M \times E}, T_E^T, T_E^T))

3. Performance Predictor Detailed Design

Normalization Transformation Layer: Converts raw assembly instructions into normalized token sequences, containing four segments:

  • <OPCODE>: Operation code
  • <DSTS>: Destination operands
  • <SRCS>: Source operands
  • <MEM>: Memory access information

Context Information Construction: Constructs context matrix containing CPU state information, such as various registers shown in Table I:

Register TypeQuantityBit WidthDescription
General Purpose Registers (GPR)3264Primary storage registers
Vector Scalar Registers (VSR)64128Floating-point operation registers
Condition Registers (CR)132Reflect operation results
Program Counter (CIA/NIA)264Instruction addresses

Multi-layer Attention Network:

  • Instruction Encoder: Applies self-attention mechanism to each instruction
  • Block Encoder: Processes dependencies between instruction sequences
  • MLP Layer: Final output of execution time prediction

Technical Innovations

  1. Long-range Dependency Modeling: Compared to sequence models such as LSTM, attention mechanisms better capture long-range dependencies between instructions
  2. Context-aware Prediction: Incorporates CPU register state as contextual information, improving prediction accuracy
  3. Hierarchical Attention Design: Dual-level attention mechanisms at instruction and block levels, considering both intra-instruction token relationships and inter-instruction dependencies
  4. Parallelized Processing: Partitions long instruction sequences into small fragments, supporting GPU parallel processing and significantly improving inference speed

Experimental Setup

Dataset

  • Benchmark Suite: SPEC2017, containing 24 benchmark programs
  • Instruction Set Architecture: Power ISA
  • Interval Size: 5,000,000 instructions, warmup size 1,000,000 instructions
  • Code Fragment Length: 100-200 instructions
  • Total Checkpoints: 623

Evaluation Metrics

  • Speed Metric: Speedup ratio relative to gem5 simulator
  • Accuracy Metric: Mean Absolute Percentage Error (MAPE)

Comparison Methods

  • Traditional Method: gem5 O3 superscalar processor simulator
  • ML Baseline: LSTM-based Ithemal model
  • Ablation Study: CAPSim variant without context information

Implementation Details

  • Hardware Platform: NVIDIA GeForce RTX 4090 (24GB), Intel Xeon CPU E5-2623 v4
  • Model Parameters: Embedding vector dimension 128, attention heads 4, encoder layers 4
  • Training Settings: SGD optimizer, learning rate 0.001, momentum 0.9
  • Sampling Parameters: Threshold 200, sampling coefficient 0.02

Experimental Results

Main Results

Speed Improvement:

  • Maximum speedup: 8.3× (510.parest benchmark)
  • Average speedup: 4.9×
  • Speedup effect correlates with checkpoint quantity, demonstrating GPU parallelization advantages

Accuracy Performance:

  • Improvement of 9.5%-21.2% compared to LSTM baseline, average improvement of 15.8%
  • Accuracy improvement of 1.3%-9.6% after incorporating context information, average improvement of 6.2%
  • Average MAPE of 12.0% on mixed training set

Ablation Studies

  1. Attention Mechanism vs LSTM: Attention mechanism significantly outperforms LSTM when handling long code fragments
  2. Context Information Impact: Context information plays a crucial role in improving prediction accuracy
  3. Categorical Training Effect: Categorical training improves accuracy by 0.5% compared to mixed training

Generalization Ability Testing

Cross-benchmark Testing:

  • 6×6 cross-validation experiments, 36 training-test combinations
  • Training set accuracy 91.3%, overall average accuracy 88.3%
  • Demonstrates good generalization capability to unseen benchmarks

Cross-architecture Parameter Testing: Accuracy performance under different microarchitecture parameter configurations:

Parameter ConfigurationFetchWidthIssueWidthCommitWidthROBEntryError
Baseline Configuration88819212.0%
Variant 148819212.2%
Variant 284819212.9%

Experimental Findings

  1. Significant Parallelization Effect: GPU parallel processing shows obvious advantages over CPU serial simulation
  2. Long-range Dependencies Important: Attention mechanisms effectively capture complex dependencies between instructions
  3. Context Information Critical: CPU state information is essential for accurate execution time prediction
  4. Categorical Training Effective: Categorical training based on program characteristics improves model generalization

Traditional Simulators

  • Cycle-level Simulators: gem5, SimpleScalar, Sniper, etc., high accuracy but slow speed
  • Basic Block Level Tools: llvm-mca, uiCA, IACA, etc., fast speed but limited functionality

Machine Learning Methods

  • Regression Models: Using linear/nonlinear regression to predict CPI and power consumption
  • Deep Learning Methods:
    • Ithemal: LSTM predicting basic block throughput
    • Difftune: Optimizing llvm-mca parameters
    • Granite: Graph neural network predicting basic block performance

Sampling Techniques

  • Statistical Sampling: SMARTS periodic sampling
  • Targeted Sampling: SimPoint sampling based on program behavior

Main advantages of this work compared to existing research:

  1. First to achieve complete program-level performance prediction (rather than basic block level only)
  2. Uses cycle-level simulator as ground truth (rather than simple compiler tools)
  3. Attention mechanisms better model long-range dependencies

Conclusions and Discussion

Main Conclusions

  1. Technical Feasibility: Attention mechanism-based methods can effectively predict CPU performance for complete programs
  2. Performance Advantages: Significant speedup compared to traditional gem5 simulator (2.2-8.3×)
  3. Accuracy Assurance: Maintains acceptable prediction accuracy while significantly improving speed
  4. Generalization Capability: Model demonstrates good adaptability to unseen benchmarks and different architecture parameters

Limitations

  1. Accuracy Trade-off: While speed is significantly improved, prediction accuracy still lags behind specialized cycle-level simulators (12% average error)
  2. Architecture Dependency: Current implementation is based on Power ISA; extension to other instruction sets requires re-adaptation
  3. Training Data Requirements: Requires substantial annotated data for training, with high initial cost
  4. Complex Scenario Handling: Prediction capability may be limited for extremely complex program behaviors and microarchitectural characteristics

Future Directions

  1. Multi-architecture Support: Extension to mainstream instruction set architectures such as x86 and ARM
  2. Accuracy Improvement: Exploring more advanced attention mechanisms and context modeling methods
  3. Multi-core Support: Extension to multi-core and heterogeneous system performance prediction
  4. Online Learning: Support for runtime adaptive learning and model updates

In-depth Evaluation

Strengths

Technical Innovation:

  1. First application of Transformer attention mechanisms to CPU performance prediction domain
  2. Innovative combination of context information and instruction sequence modeling
  3. Designed a complete end-to-end prediction framework

Experimental Comprehensiveness:

  1. Comprehensive evaluation on standard SPEC2017 benchmarks
  2. Includes detailed ablation studies and generalization ability testing
  3. Comparison with multiple baseline methods

Result Convincingness:

  1. Significant speed improvement (maximum 8.3× speedup)
  2. Accuracy improvement compared to existing ML methods
  3. Good cross-benchmark generalization capability

Writing Clarity:

  1. Clear problem motivation articulation
  2. Detailed method description including mathematical formulas
  3. Complete experimental setup and result presentation

Weaknesses

Method Limitations:

  1. Prediction accuracy still has room for improvement (12% average error)
  2. Verified only on Power ISA, lacking multi-architecture verification
  3. Handling capability for extremely complex scenarios not fully verified

Experimental Setup Defects:

  1. Hardware platform comparison may not be entirely fair (GPU vs CPU)
  2. Lack of comparison with more recent ML methods
  3. Insufficient analysis of prediction effectiveness differences across different program types

Analysis Insufficiency:

  1. Insufficient in-depth analysis of attention mechanism interpretability
  2. Limited error case analysis
  3. Insufficient computational resource consumption analysis

Impact

Contribution to the Field:

  1. Provides new technical pathway for CPU performance prediction
  2. Advances ML applications in computer architecture domain
  3. Provides tools for rapid architecture design space exploration

Practical Value:

  1. Significantly improves efficiency of large-scale benchmark evaluation
  2. Provides rapid performance feedback for compiler optimization and hardware design
  3. Reduces time cost of computer architecture research

Reproducibility:

  1. Relatively detailed method description
  2. Uses standard benchmark test suites
  3. However, some implementation details and code are not publicly available

Applicable Scenarios

  1. Architecture Design Space Exploration: Rapidly evaluate performance impact of different design parameters
  2. Compiler Optimization: Provide rapid performance feedback for code optimization
  3. Benchmark Test Acceleration: Significantly reduce runtime of standard benchmark tests
  4. Teaching and Research: Provide efficient simulation tools for architecture courses and research

References

The paper cites 61 related references, primarily including:

Classical Simulators:

  • gem5: The gem5 simulator (Binkert et al.)
  • SimpleScalar, Sniper, Zesto and other traditional simulators

Machine Learning Methods:

  • Ithemal: Accurate, portable and fast basic block throughput estimation (Mendis et al.)
  • Granite: A graph neural network model for basic block throughput estimation (Sýkora et al.)

Attention Mechanisms:

  • Attention is all you need (Vaswani et al.)
  • Transformer-related research

Benchmarks:

  • SPEC CPU2017 benchmark suite

Overall Assessment: This is an innovative and practically valuable paper in the CPU performance prediction domain. The authors successfully introduce attention mechanisms into CPU performance prediction, achieving a breakthrough from basic block level to complete program-level prediction, and obtaining significant speed improvements. Although there is room for improvement in prediction accuracy and method generalization, this work provides valuable tools and insights for computer architecture research with good application prospects.