2025-11-11T09:13:09.652713

Autocomp: A Powerful and Portable Code Optimizer for Tensor Accelerators

Hong, Bhatia, Cheung et al.

Hardware accelerators, especially those designed for tensor processing, have become ubiquitous in today's computing landscape. However, even with significant efforts in building compilers, programming these tensor accelerators remains challenging, leaving much of their potential underutilized. Recently, large language models (LLMs), trained on large amounts of code, have shown significant promise in code generation and optimization tasks, but generating low-resource languages, such as specialized tensor accelerator code still poses a significant challenge. We tackle this challenge with Autocomp, an approach that empowers accelerator programmers to leverage domain knowledge and hardware feedback to optimize code via an automated LLM-driven search. We accomplish this by: 1) formulating each optimization pass as a structured two-phase prompt, divided into planning and code generation phases, 2) inserting domain knowledge during planning via a concise and adaptable optimization menu, and 3) integrating correctness and performance metrics from hardware as feedback at each search iteration. Across three distinct hardware platforms, we demonstrate that Autocomp-optimized code runs 5.6x faster than the vendor-provided library (Gemmini), outperforms expert-level hand-tuned code by 1.9x (AWS Trainium), and achieves 3.8x higher performance than a machine learning-based cost model for GPUs (NVIDIA L40S). Additionally, we demonstrate that optimization schedules generated from Autocomp can be reused across similar tensor operations, improving speedups by up to 24% under a fixed sample budget.

academic

Autocomp: A Powerful and Portable Code Optimizer for Tensor Accelerators

Basic Information

Paper ID: 2505.18574
Title: Autocomp: A Powerful and Portable Code Optimizer for Tensor Accelerators
Authors: Charles Hong, Sahil Bhatia, Alvin Cheung, Yakun Sophia Shao (UC Berkeley)
Classification: cs.PL cs.AI cs.AR cs.LG
Publication Status: Preprint. Under review.
Paper Link: https://arxiv.org/abs/2505.18574

Abstract

Hardware accelerators, particularly those designed specifically for tensor processing, have become ubiquitous in contemporary computing environments. However, despite substantial efforts in compiler development, programming these tensor accelerators remains challenging, leaving much of their potential underutilized. This paper proposes Autocomp, a method for optimizing code through automated LLM-driven search, enabling accelerator programmers to leverage domain knowledge and hardware feedback. The approach is realized through three key techniques: 1) formulating each optimization process as a structured two-stage prompt, divided into planning and code generation phases; 2) injecting domain knowledge during the planning phase through a concise and adaptable optimization menu; 3) integrating correctness and performance metrics from hardware as feedback in each search iteration.

Research Background and Motivation

Core Problems

The primary challenges in tensor accelerator programming include:

Programming Complexity: Unlike general-purpose CPU programming, tensor accelerators require explicit management of data movement, state configuration, and operation scheduling
Compiler Adaptation Cost: Adapting traditional compilers for new hardware platforms requires substantial engineering effort, with software development costs accounting for 40-50% of new hardware development costs
Optimization Scheduling Problem: Combinatorial explosion in determining which optimizations to apply and in what order
Low-Resource Language Challenge: Instruction set architectures (ISAs) and domain-specific languages (DSLs) for specialized accelerators are underrepresented in LLM training corpora

Limitations of Existing Approaches

Traditional Compilers: XLA, TVM, Triton, etc., support only a limited number of hardware backends, primarily CPUs and GPUs
DSL Approaches: Halide, Exo, and similar tools provide primitives for expressing tensor computations, but optimization burden remains on programmers
Data-Driven Methods: Require substantial performance data for training, which is extremely scarce for domain-specific hardware accelerators
Direct LLM Application: Zero-shot code generation is highly unreliable for low-resource accelerator languages

Core Contributions

First LLM-Driven Low-Resource Tensor Accelerator Code Optimization Method: Proposes the Autocomp framework specifically designed for specialized hardware accelerators
Highly Portable Optimization Framework: Enables adaptation to new hardware platforms through prompt modification, significantly reducing engineering costs
Superior Performance: Substantially outperforms existing methods across three different hardware platforms
Schedule Reuse Mechanism: Demonstrates that optimization schedules can be reused across similar tensor operations, improving sample efficiency

Methodology Details

Task Definition

Input: Unoptimized tensor accelerator code Output: Functionally equivalent but performance-optimized code Constraints: Maintain semantic equivalence, verified through hardware validation for correctness

Core Architecture: Two-Stage Optimization Framework

Stage 1: Plan Generation

Prompt structure includes:

Accelerator ISA Description: Instruction semantics, memory addressing specifications, hardware architecture description
Current Code: Code to be optimized
Performance Feedback: Metrics such as latency (cycle count) and memory utilization
Optimization Menu: Predefined high-level optimization options (e.g., loop tiling, reordering, fusion)
Search Iteration Information: Current iteration count to guide optimization selection

Stage 2: Implementation

Prompt structure includes:

Accelerator ISA Description: Same as Stage 1
Current Code: Same as Stage 1
Generated Plan: Specific optimization plan output from Stage 1
In-Context Learning Examples: Code examples for complex optimizations (e.g., tiling)
Implementation Instructions: Natural language instructions for applying the plan and outputting optimized code

Beam Search Optimization Strategy

Beam width B=6 for parallel exploration of multiple optimization trajectories
Correctness Filtering: Candidate code validated through functional test suites
Performance Screening: Only candidates outperforming parent nodes are retained
Iterative Optimization: Fixed budget of T search iterations

Key Technical Innovations

1. Diversity Enhancement Techniques

Optimization Menu Dropout: Randomly remove partial menu options (70% probability) during each planning phase
LLM Integration: Distribute requests across multiple LLMs to increase response diversity

2. Hardware Feedback Integration

Real-time performance metrics (latency, memory utilization) guide next optimization step selection
Cycle-accurate simulation or chip-level performance measurement

3. Schedule Reuse Mechanism

Record high-quality schedule sequences
Reuse known schedules for similar tensor operations (same aspect ratios or shared dimensions)
Perform further optimization after lightweight search

Experimental Setup

Hardware Platforms

Gemmini: Open-source accelerator generator supporting systolic arrays and vector-style tensor accelerators
AWS Trainium: Commercial high-performance tensor accelerator using Neuron Kernel Interface (NKI)
NVIDIA L40S GPU: Modern data center GPU with dedicated Tensor Cores

Benchmarks

Gemmini: GEMM and convolutions from ResNet-50, TinyMPC model predictive control
Trainium: Tutorial-level and advanced deep learning operators (RMSNorm, LayerNorm, GEMM, Mamba, etc.)
GPU: KernelBench Level 1 benchmarks

Comparison Methods

High-Level Software Libraries: Gemmini software library, PyTorch NeuronX, PyTorch
Unoptimized Low-Level Code: Unoptimized Exo, nki-samples tutorial code
Hand-Optimized Code: Expert-level manual optimization implementations
ML Cost Models: TVM MetaSchedule (GPU)
Hardware FSM: Gemmini hardware finite state machine (reference upper bound)

Experimental Results

Main Performance Results

Gemmini Platform

GEMM Benchmark: 5.6× improvement over Gemmini software library, 1.4× better than expert hand-optimized code
Convolution Benchmark: 2.6× improvement over software library, 1.1× better than hand-optimized code
Fine-Grained Linear Algebra: 2.7× improvement over unoptimized code, even 1.6× better than expert-optimized hardware FSM implementation (forward pass)

AWS Trainium Platform

Tutorial Workloads: 1.36× improvement over hand-optimized code (geometric mean), 13.52× better than PyTorch NeuronX compiled code
Advanced Workloads: 1.9× improvement over expert-optimized code (geometric mean), up to 17.37× improvement on 1D depthwise convolution

NVIDIA L40S GPU

KernelBench Benchmark: 2.05× improvement over PyTorch (geometric mean), 3.8× better than TVM MetaSchedule
Outperforms PyTorch on all benchmarks, while TVM outperforms PyTorch on only 2 benchmarks

Ablation Study Analysis

Detailed ablation experiments validate the importance of each component:

Accelerator ISA: Performance significantly decreases when removed, but improvements still achieved
Optimization Menu: Completely necessary; removal causes complete performance degradation
Menu Dropout: Significantly impacts performance, preventing model bias toward limited menu options
LLM Integration: Provides important diversity; single-model performance is lower
Hardware Performance Feedback: Helpful but limited in effect, as optimization menu already incorporates relevant metrics

Schedule Reuse Effects

At 100-sample budget: Schedule reuse achieves 4.6× speedup, without reuse only 3.7×
At 200-sample budget: Schedule reuse achieves 5.0× speedup, without reuse only 4.2×
Demonstrates schedule generalizability, effectively reducing search costs for similar benchmarks

Tensor Accelerator Code Optimization

Performance Models: Timeloop, MAESTRO use high-level hardware architecture models
Automation Methods: Machine learning, linear programming, black-box optimization, reinforcement learning
Limitations: Existing abstractions overlook implementation-specific and instruction-level optimizations

LLM Code Optimization

Application Scope: Evolutionary search, retrieval-augmented generation, iterative optimization, model post-training
System-Level Optimization: CUDA, SIMD intrinsics
Research Gap: Lack of LLM code optimization work for specialized hardware (non-CPU/GPU)

Conclusions and Discussion

Main Conclusions

Effectiveness of LLM-Driven Optimization: Autocomp significantly outperforms traditional methods across multiple hardware platforms
Exceptional Portability: Adaptation to new hardware requires only prompt modification, with minimal engineering cost
Value of Schedule Reuse: Optimization schedules demonstrate good generalizability, significantly improving sample efficiency

Technical Insights

Necessity of Two-Stage Design: Separating planning and implementation phases improves success rates for complex optimization tasks
Importance of Domain Knowledge: Domain expertise provided by the optimization menu is critical for performance
Value of Hardware Feedback: Real-time performance metrics effectively guide optimization direction selection

Limitations

LLM Capability Dependency: Method performance is constrained by underlying LLM code generation and reasoning capabilities
Search Cost: Requires multiple LLM calls and hardware simulation, resulting in high computational cost
Domain Specificity: Optimization menus require manual design for different hardware platforms
Evaluation Scope: Primarily focused on tensor computation workloads; applicability to other computation types remains unknown

Future Directions

Automatic Menu Generation: Research methods for automatically constructing optimization menus
Cross-Platform Schedule Transfer: Explore schedule knowledge transfer across different hardware platforms
Cost Efficiency Optimization: Reduce LLM calls and hardware simulation during the search process
Broader Applications: Extend to other specialized accelerators beyond tensor computation

In-Depth Evaluation

Strengths

Strong Innovation: First application of LLMs to low-resource tensor accelerator code optimization with novel technical approach
High Practical Value: Addresses real engineering pain points, significantly reducing software development costs for new hardware
Comprehensive Evaluation: Full assessment across three different hardware platforms with convincing results
General Framework Design: Framework exhibits good extensibility and portability
Superior Performance: Significantly outperforms existing best methods across multiple benchmarks

Weaknesses

Computational Cost: Requires extensive LLM calls and hardware simulation, potentially limiting practical application
Manual Design Dependency: Optimization menus still require expert knowledge for manual design, limiting automation degree
Evaluation Limitations: Primarily focused on specific types of tensor computation; generalizability remains to be verified
Insufficient Theoretical Analysis: Lacks theoretical guarantees on convergence and optimality

Impact Assessment

Academic Value: Opens new application of LLMs in specialized hardware compilation optimization with important academic significance
Industrial Impact: Likely to significantly reduce software stack development costs for new hardware with important industrial value
Reproducibility: Authors commit to open-sourcing implementation and prompts, facilitating subsequent research
Inspirational Value: Provides new technical pathways for compilation optimization of other specialized hardware

Applicable Scenarios

New Hardware Prototype Development: Rapidly generate optimized code for newly designed tensor accelerators
DSL Compiler Construction: Complement or alternative to traditional compilers
Performance Tuning Tools: Help developers optimize existing accelerator code
Research and Education: Provide automated tools for accelerator programming and optimization

References

The paper cites extensive related work, primarily including:

Hardware accelerator design (Gemmini, TPU, Trainium, etc.)
Compilers and DSLs (XLA, TVM, Halide, Exo, etc.)
LLM code generation (CodeGen, Codex, etc.)
Automated optimization methods (reinforcement learning, evolutionary algorithms, etc.)

Overall Assessment: This is a high-quality research paper making important contributions to the emerging interdisciplinary field of applying LLMs to specialized hardware compilation optimization. The method demonstrates strong innovation, comprehensive experimental evaluation, and significant practical value. While there remains room for improvement in computational cost and automation degree, it opens new directions for field development with important academic and industrial value.