2025-11-11T09:13:09.652713

Autocomp: A Powerful and Portable Code Optimizer for Tensor Accelerators

Hong, Bhatia, Cheung et al.
Hardware accelerators, especially those designed for tensor processing, have become ubiquitous in today's computing landscape. However, even with significant efforts in building compilers, programming these tensor accelerators remains challenging, leaving much of their potential underutilized. Recently, large language models (LLMs), trained on large amounts of code, have shown significant promise in code generation and optimization tasks, but generating low-resource languages, such as specialized tensor accelerator code still poses a significant challenge. We tackle this challenge with Autocomp, an approach that empowers accelerator programmers to leverage domain knowledge and hardware feedback to optimize code via an automated LLM-driven search. We accomplish this by: 1) formulating each optimization pass as a structured two-phase prompt, divided into planning and code generation phases, 2) inserting domain knowledge during planning via a concise and adaptable optimization menu, and 3) integrating correctness and performance metrics from hardware as feedback at each search iteration. Across three distinct hardware platforms, we demonstrate that Autocomp-optimized code runs 5.6x faster than the vendor-provided library (Gemmini), outperforms expert-level hand-tuned code by 1.9x (AWS Trainium), and achieves 3.8x higher performance than a machine learning-based cost model for GPUs (NVIDIA L40S). Additionally, we demonstrate that optimization schedules generated from Autocomp can be reused across similar tensor operations, improving speedups by up to 24% under a fixed sample budget.
academic

Autocomp: A Powerful and Portable Code Optimizer for Tensor Accelerators

Basic Information

  • Paper ID: 2505.18574
  • Title: Autocomp: A Powerful and Portable Code Optimizer for Tensor Accelerators
  • Authors: Charles Hong, Sahil Bhatia, Alvin Cheung, Yakun Sophia Shao (UC Berkeley)
  • Classification: cs.PL cs.AI cs.AR cs.LG
  • Publication Status: Preprint. Under review.
  • Paper Link: https://arxiv.org/abs/2505.18574

Abstract

Hardware accelerators, particularly those designed specifically for tensor processing, have become ubiquitous in contemporary computing environments. However, despite substantial efforts in compiler development, programming these tensor accelerators remains challenging, leaving much of their potential underutilized. This paper proposes Autocomp, a method for optimizing code through automated LLM-driven search, enabling accelerator programmers to leverage domain knowledge and hardware feedback. The approach is realized through three key techniques: 1) formulating each optimization process as a structured two-stage prompt, divided into planning and code generation phases; 2) injecting domain knowledge during the planning phase through a concise and adaptable optimization menu; 3) integrating correctness and performance metrics from hardware as feedback in each search iteration.

Research Background and Motivation

Core Problems

The primary challenges in tensor accelerator programming include:

  1. Programming Complexity: Unlike general-purpose CPU programming, tensor accelerators require explicit management of data movement, state configuration, and operation scheduling
  2. Compiler Adaptation Cost: Adapting traditional compilers for new hardware platforms requires substantial engineering effort, with software development costs accounting for 40-50% of new hardware development costs
  3. Optimization Scheduling Problem: Combinatorial explosion in determining which optimizations to apply and in what order
  4. Low-Resource Language Challenge: Instruction set architectures (ISAs) and domain-specific languages (DSLs) for specialized accelerators are underrepresented in LLM training corpora

Limitations of Existing Approaches

  1. Traditional Compilers: XLA, TVM, Triton, etc., support only a limited number of hardware backends, primarily CPUs and GPUs
  2. DSL Approaches: Halide, Exo, and similar tools provide primitives for expressing tensor computations, but optimization burden remains on programmers
  3. Data-Driven Methods: Require substantial performance data for training, which is extremely scarce for domain-specific hardware accelerators
  4. Direct LLM Application: Zero-shot code generation is highly unreliable for low-resource accelerator languages

Core Contributions

  1. First LLM-Driven Low-Resource Tensor Accelerator Code Optimization Method: Proposes the Autocomp framework specifically designed for specialized hardware accelerators
  2. Highly Portable Optimization Framework: Enables adaptation to new hardware platforms through prompt modification, significantly reducing engineering costs
  3. Superior Performance: Substantially outperforms existing methods across three different hardware platforms
  4. Schedule Reuse Mechanism: Demonstrates that optimization schedules can be reused across similar tensor operations, improving sample efficiency

Methodology Details

Task Definition

Input: Unoptimized tensor accelerator code Output: Functionally equivalent but performance-optimized code Constraints: Maintain semantic equivalence, verified through hardware validation for correctness

Core Architecture: Two-Stage Optimization Framework

Stage 1: Plan Generation

Prompt structure includes:

  1. Accelerator ISA Description: Instruction semantics, memory addressing specifications, hardware architecture description
  2. Current Code: Code to be optimized
  3. Performance Feedback: Metrics such as latency (cycle count) and memory utilization
  4. Optimization Menu: Predefined high-level optimization options (e.g., loop tiling, reordering, fusion)
  5. Search Iteration Information: Current iteration count to guide optimization selection

Stage 2: Implementation

Prompt structure includes:

  1. Accelerator ISA Description: Same as Stage 1
  2. Current Code: Same as Stage 1
  3. Generated Plan: Specific optimization plan output from Stage 1
  4. In-Context Learning Examples: Code examples for complex optimizations (e.g., tiling)
  5. Implementation Instructions: Natural language instructions for applying the plan and outputting optimized code

Beam Search Optimization Strategy

  • Beam width B=6 for parallel exploration of multiple optimization trajectories
  • Correctness Filtering: Candidate code validated through functional test suites
  • Performance Screening: Only candidates outperforming parent nodes are retained
  • Iterative Optimization: Fixed budget of T search iterations

Key Technical Innovations

1. Diversity Enhancement Techniques

  • Optimization Menu Dropout: Randomly remove partial menu options (70% probability) during each planning phase
  • LLM Integration: Distribute requests across multiple LLMs to increase response diversity

2. Hardware Feedback Integration

  • Real-time performance metrics (latency, memory utilization) guide next optimization step selection
  • Cycle-accurate simulation or chip-level performance measurement

3. Schedule Reuse Mechanism

  • Record high-quality schedule sequences
  • Reuse known schedules for similar tensor operations (same aspect ratios or shared dimensions)
  • Perform further optimization after lightweight search

Experimental Setup

Hardware Platforms

  1. Gemmini: Open-source accelerator generator supporting systolic arrays and vector-style tensor accelerators
  2. AWS Trainium: Commercial high-performance tensor accelerator using Neuron Kernel Interface (NKI)
  3. NVIDIA L40S GPU: Modern data center GPU with dedicated Tensor Cores

Benchmarks

  • Gemmini: GEMM and convolutions from ResNet-50, TinyMPC model predictive control
  • Trainium: Tutorial-level and advanced deep learning operators (RMSNorm, LayerNorm, GEMM, Mamba, etc.)
  • GPU: KernelBench Level 1 benchmarks

Comparison Methods

  1. High-Level Software Libraries: Gemmini software library, PyTorch NeuronX, PyTorch
  2. Unoptimized Low-Level Code: Unoptimized Exo, nki-samples tutorial code
  3. Hand-Optimized Code: Expert-level manual optimization implementations
  4. ML Cost Models: TVM MetaSchedule (GPU)
  5. Hardware FSM: Gemmini hardware finite state machine (reference upper bound)

Experimental Results

Main Performance Results

Gemmini Platform

  • GEMM Benchmark: 5.6× improvement over Gemmini software library, 1.4× better than expert hand-optimized code
  • Convolution Benchmark: 2.6× improvement over software library, 1.1× better than hand-optimized code
  • Fine-Grained Linear Algebra: 2.7× improvement over unoptimized code, even 1.6× better than expert-optimized hardware FSM implementation (forward pass)

AWS Trainium Platform

  • Tutorial Workloads: 1.36× improvement over hand-optimized code (geometric mean), 13.52× better than PyTorch NeuronX compiled code
  • Advanced Workloads: 1.9× improvement over expert-optimized code (geometric mean), up to 17.37× improvement on 1D depthwise convolution

NVIDIA L40S GPU

  • KernelBench Benchmark: 2.05× improvement over PyTorch (geometric mean), 3.8× better than TVM MetaSchedule
  • Outperforms PyTorch on all benchmarks, while TVM outperforms PyTorch on only 2 benchmarks

Ablation Study Analysis

Detailed ablation experiments validate the importance of each component:

  • Accelerator ISA: Performance significantly decreases when removed, but improvements still achieved
  • Optimization Menu: Completely necessary; removal causes complete performance degradation
  • Menu Dropout: Significantly impacts performance, preventing model bias toward limited menu options
  • LLM Integration: Provides important diversity; single-model performance is lower
  • Hardware Performance Feedback: Helpful but limited in effect, as optimization menu already incorporates relevant metrics

Schedule Reuse Effects

  • At 100-sample budget: Schedule reuse achieves 4.6× speedup, without reuse only 3.7×
  • At 200-sample budget: Schedule reuse achieves 5.0× speedup, without reuse only 4.2×
  • Demonstrates schedule generalizability, effectively reducing search costs for similar benchmarks

Tensor Accelerator Code Optimization

  • Performance Models: Timeloop, MAESTRO use high-level hardware architecture models
  • Automation Methods: Machine learning, linear programming, black-box optimization, reinforcement learning
  • Limitations: Existing abstractions overlook implementation-specific and instruction-level optimizations

LLM Code Optimization

  • Application Scope: Evolutionary search, retrieval-augmented generation, iterative optimization, model post-training
  • System-Level Optimization: CUDA, SIMD intrinsics
  • Research Gap: Lack of LLM code optimization work for specialized hardware (non-CPU/GPU)

Conclusions and Discussion

Main Conclusions

  1. Effectiveness of LLM-Driven Optimization: Autocomp significantly outperforms traditional methods across multiple hardware platforms
  2. Exceptional Portability: Adaptation to new hardware requires only prompt modification, with minimal engineering cost
  3. Value of Schedule Reuse: Optimization schedules demonstrate good generalizability, significantly improving sample efficiency

Technical Insights

  1. Necessity of Two-Stage Design: Separating planning and implementation phases improves success rates for complex optimization tasks
  2. Importance of Domain Knowledge: Domain expertise provided by the optimization menu is critical for performance
  3. Value of Hardware Feedback: Real-time performance metrics effectively guide optimization direction selection

Limitations

  1. LLM Capability Dependency: Method performance is constrained by underlying LLM code generation and reasoning capabilities
  2. Search Cost: Requires multiple LLM calls and hardware simulation, resulting in high computational cost
  3. Domain Specificity: Optimization menus require manual design for different hardware platforms
  4. Evaluation Scope: Primarily focused on tensor computation workloads; applicability to other computation types remains unknown

Future Directions

  1. Automatic Menu Generation: Research methods for automatically constructing optimization menus
  2. Cross-Platform Schedule Transfer: Explore schedule knowledge transfer across different hardware platforms
  3. Cost Efficiency Optimization: Reduce LLM calls and hardware simulation during the search process
  4. Broader Applications: Extend to other specialized accelerators beyond tensor computation

In-Depth Evaluation

Strengths

  1. Strong Innovation: First application of LLMs to low-resource tensor accelerator code optimization with novel technical approach
  2. High Practical Value: Addresses real engineering pain points, significantly reducing software development costs for new hardware
  3. Comprehensive Evaluation: Full assessment across three different hardware platforms with convincing results
  4. General Framework Design: Framework exhibits good extensibility and portability
  5. Superior Performance: Significantly outperforms existing best methods across multiple benchmarks

Weaknesses

  1. Computational Cost: Requires extensive LLM calls and hardware simulation, potentially limiting practical application
  2. Manual Design Dependency: Optimization menus still require expert knowledge for manual design, limiting automation degree
  3. Evaluation Limitations: Primarily focused on specific types of tensor computation; generalizability remains to be verified
  4. Insufficient Theoretical Analysis: Lacks theoretical guarantees on convergence and optimality

Impact Assessment

  1. Academic Value: Opens new application of LLMs in specialized hardware compilation optimization with important academic significance
  2. Industrial Impact: Likely to significantly reduce software stack development costs for new hardware with important industrial value
  3. Reproducibility: Authors commit to open-sourcing implementation and prompts, facilitating subsequent research
  4. Inspirational Value: Provides new technical pathways for compilation optimization of other specialized hardware

Applicable Scenarios

  1. New Hardware Prototype Development: Rapidly generate optimized code for newly designed tensor accelerators
  2. DSL Compiler Construction: Complement or alternative to traditional compilers
  3. Performance Tuning Tools: Help developers optimize existing accelerator code
  4. Research and Education: Provide automated tools for accelerator programming and optimization

References

The paper cites extensive related work, primarily including:

  • Hardware accelerator design (Gemmini, TPU, Trainium, etc.)
  • Compilers and DSLs (XLA, TVM, Halide, Exo, etc.)
  • LLM code generation (CodeGen, Codex, etc.)
  • Automated optimization methods (reinforcement learning, evolutionary algorithms, etc.)

Overall Assessment: This is a high-quality research paper making important contributions to the emerging interdisciplinary field of applying LLMs to specialized hardware compilation optimization. The method demonstrates strong innovation, comprehensive experimental evaluation, and significant practical value. While there remains room for improvement in computational cost and automation degree, it opens new directions for field development with important academic and industrial value.