2025-11-30T02:10:19.077243

Bombyx: OpenCilk Compilation for FPGA Hardware Acceleration

Shahawy, de Castelnau, Ienne
Task-level parallelism (TLP) is a widely used approach in software where independent tasks are dynamically created and scheduled at runtime. Recent systems have explored architectural support for TLP on field-programmable gate arrays (FPGAs), often leveraging high-level synthesis (HLS) to create processing elements (PEs). In this paper, we present Bombyx, a compiler toolchain that lowers OpenCilk programs into a Cilk-1-inspired intermediate representation, enabling efficient mapping of CPU-oriented TLP applications to spatial architectures on FPGAs. Unlike OpenCilk's implicit task model, which requires costly context switching in hardware, Cilk-1 adopts explicit continuation-passing - a model that better aligns with the streaming nature of FPGAs. Bombyx supports multiple compilation targets: one is an OpenCilk-compatible runtime for executing Cilk-1-style code using the OpenCilk backend, and another is a synthesizable PE generator designed for HLS tools like Vitis HLS. Additionally, we introduce a decoupled access-execute optimization that enables automatic generation of high-performance PEs, improving memory-compute overlap and overall throughput.
academic

Bombyx: OpenCilk Compilation for FPGA Hardware Acceleration

Basic Information

  • Paper ID: 2511.21346
  • Title: Bombyx: OpenCilk Compilation for FPGA Hardware Acceleration
  • Authors: Mohamed Shahawy, Julien de Castelnau, Paolo Ienne (École Polytechnique Fédérale de Lausanne)
  • Classification: cs.AR (Computer Architecture)
  • Publication Date: November 26, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2511.21346

Abstract

This paper presents Bombyx, a toolchain that compiles OpenCilk programs into FPGA hardware accelerators. Bombyx transforms OpenCilk's implicit task parallelism model into an explicit continuation-passing intermediate representation in Cilk-1 style, which is better suited to FPGA streaming characteristics. The tool supports multiple compilation targets: an OpenCilk-compatible runtime for verification, and synthesizable processing unit generators for high-level synthesis tools such as Vitis HLS. Additionally, Bombyx introduces Decoupled Access-Execute (DAE) optimization, which automatically generates high-performance processing units, improving memory-compute overlap and overall throughput.

Research Background and Motivation

1. Core Problem to Address

Task-level parallelism (TLP) is a widely-used parallelization technique in software that enables dynamic creation and scheduling of independent tasks at runtime. While hardware frameworks (such as ParallelXL and HardCilk) support TLP on FPGAs, there is a lack of automated tools to automatically extract and compile processing unit (PE) code from software TLP frameworks. Existing frameworks typically require users to manually provide PE code, which is both tedious and error-prone.

2. Problem Significance

  • Automation Requirements: Porting CPU-oriented TLP applications to FPGAs requires substantial manual effort, including code restructuring, hardware interface adaptation, and configuration file generation
  • Performance Optimization: Manually written code is difficult to apply advanced hardware optimizations (such as Decoupled Access-Execute)
  • Development Efficiency: The absence of automated toolchains hinders widespread adoption of TLP in FPGA acceleration

3. Limitations of Existing Approaches

  • OpenCilk's Implicit Model: The fork-join model using cilk_spawn and cilk_sync requires context switching at synchronization points. Implementing context switching in hardware requires saving the entire circuit state, which is neither directly supported by current HLS tools nor practical without substantial RTL engineering
  • TAPIR Intermediate Representation: OpenCilk uses TAPIR, which employs low-level compiler constructs, making it difficult to generate readable C++ code close to the original for HLS
  • Manual PE Writing: Requires manual handling of closure alignment, write buffer interfaces, configuration file generation, and other tedious details

4. Research Motivation

Cilk-1's explicit continuation-passing model is more suitable for hardware implementation because it splits functions at synchronization points into terminating functions (executed atomically without context switching). Although this model is not intuitive for software programming (and was thus abandoned in Cilk's evolution), it is natural for hardware implementation. Bombyx aims to automate the transformation from OpenCilk to explicit TLP and generate optimized hardware PEs.

Core Contributions

  1. Automated Compilation Pipeline: Proposes a complete automated compilation toolchain Bombyx from OpenCilk to FPGA hardware accelerators
  2. Explicit Intermediate Representation: Designs implicit and explicit IRs based on control flow graphs, enabling automatic transformation from fork-join model to continuation-passing model
  3. Multi-target Code Generation:
    • HardCilk backend: Automatically generates synthesizable C++ HLS code and configuration files
    • Cilk-1 simulation layer: Verifies transformation correctness using OpenCilk runtime
  4. Decoupled Access-Execute Optimization: Supports DAE optimization through compiler pragmas, decoupling memory access and computation into separate tasks to improve hardware performance
  5. Experimental Validation: Achieves 26.5% runtime reduction in graph traversal benchmarks with DAE optimization

Methodology Details

Task Definition

Input: Task-parallel C/C++ programs written in OpenCilk (CPU-oriented)
Output:

  • Synthesizable C++ HLS code (for FPGA PE generation)
  • HardCilk system configuration files (JSON format)
  • Or Cilk-1 style executable code (for verification)

Constraints:

  • Programs must follow OpenCilk's fork-join model
  • Inter-function dependencies must be explicitly expressed through cilk_spawn and cilk_sync

Overall Architecture Design

Bombyx's compilation process consists of three main phases (see Figure 3):

OpenCilk Source → AST → Implicit IR → Explicit IR → Code Generation
                   ↓           ↓            ↓            ↓
                Clang      Control Flow   DAE      HardCilk/Cilk-1
               Frontend      Graph      Optimization

1. AST to Implicit IR Transformation

  • Uses OpenCilk Clang frontend to generate abstract syntax tree
  • Converts AST to control flow graph (CFG) representation of implicit IR
  • Each function corresponds to one CFG containing:
    • Unique entry block (no incoming edges)
    • One or more exit blocks (no outgoing edges)
    • Basic blocks composed of sequential C statements, terminated by control flow statements

Why Not Use TAPIR: TAPIR uses low-level constructs (such as φ nodes, alloca, etc.), making it difficult to generate readable C++ code close to the original. Bombyx's IR preserves the original code structure.

2. Implicit IR to Explicit IR Transformation

This is the core transformation step of Bombyx, converting OpenCilk's implicit synchronization model to Cilk-1's explicit continuation-passing model.

Key Concepts:

  • Closure: A data structure representing a task, containing:
    • Ready parameters
    • Placeholders awaiting dependencies
    • Return pointer
  • spawn_next: Creates a continuation task awaiting dependencies
  • send_argument: Explicitly writes arguments to awaiting closure and notifies scheduler

Transformation Algorithm:

  1. Path Partitioning: Traverses CFG, starting new paths when encountering function termination blocks (return) or sync operations
    • Each path becomes a self-contained terminating function
    • Gray regions in Figure 4(c) represent two paths
  2. Dependency Identification: Analyzes dependencies across sync boundaries
    • Identifies variables needed after sync (such as x and y in Figure 1)
    • These variables must be explicitly stored in closures
  3. Keyword Replacement:
    • Inserts closure declarations at spawn call sites
    • Replaces sync with spawn_next calls to successor functions
    • Changes spawn return values to explicit closure field writes
    • Preserves return statements for later conversion to send_argument

Transformation Example (Figure 1 to Figure 2):

// Implicit (OpenCilk)
int x = cilk_spawn fib(n-1);
int y = cilk_spawn fib(n-2);
cilk_sync;
return x + y;

// Explicit (Cilk-1)
cont int x, y;
spawn_next sum(k, ?x, ?y);  // Create continuation task
spawn fib(x, n-1);           // Write to x placeholder
spawn fib(y, n-2);           // Write to y placeholder
// Function terminates, no sync needed

HardCilk Backend Generation

HardCilk is an open-source FPGA TLP architecture generator providing hardware work-stealing scheduler. Bombyx automatically generates all components required by HardCilk:

1. Closure Alignment

  • Automatically adds padding to align closure sizes to 128/256 bits
  • Facilitates hardware implementation

2. Write Buffer Interface

  • HardCilk uses write buffer modules to handle spawn_next and send_argument
  • Bombyx automatically inserts required metadata (task type, target address, etc.)

3. JSON Configuration Generation

Automatically generates system description files through static analysis, containing:

  • Closure sizes
  • Spawn/spawn_next/send_argument relationships between tasks
  • Other system configuration parameters

Decoupled Access-Execute (DAE) Optimization

Motivation

In naive PE implementations, a single unit handles both memory requests and computation, resulting in:

  • Memory stalls leaving PE idle
  • Reduced overall throughput

Traditional pipelining in HLS tools (such as Vitis) is limited:

  • Cannot handle data-dependent loop boundaries
  • Static schedulers cannot determine when to schedule memory access

DAE Solution

Under TLP, express memory access and execution as separate tasks:

  • Memory Access Task: Specifically handles memory requests
  • Execution Task: Handles computation logic
  • Task Scheduler: Coordinates execution, implementing elastic pipelining

Implementation Method

Through compiler pragmas:

#PRAGMA BOMBYX DAE
struct node_t nd = *n;  // Memory access operation
// Subsequent code becomes execution task

Compiler automatically:

  1. Extracts annotated lines into independent function (memory access task)
  2. Replaces with spawn call
  3. Inserts sync
  4. After conversion to explicit form, memory access task spawns continuation of execution task

Experimental Setup

Benchmarks

Graph Traversal Program (Figure 5):

  • Parallel breadth-first graph traversal
  • Accesses node adjacency lists (memory-intensive)
  • Recursively traverses adjacent nodes in parallel

Datasets

Synthetically generated tree-shaped graphs:

  • Graph 1: Depth D=7, branching factor B=4, total 5,461 nodes
  • Graph 2: Depth D=9, branching factor B=4, total 87,381 nodes
  • Node count calculation: BD1B1\frac{B^D - 1}{B - 1}

Experimental Configuration

  • FPGA Platform: Xilinx Alveo U55C (xcu55c-fsvh2892-2L-e)
  • Synthesis Tool: Vivado 2024.1
  • Target Frequency: 300 MHz
  • PE Count:
    • Non-DAE: 1 PE
    • DAE: 3 PEs (spawner, executor, access each 1)

Evaluation Metrics

  1. Runtime: Time required to traverse entire graph
  2. Resource Utilization:
    • LUT (Look-Up Tables)
    • FF (Flip-Flops)
    • BRAM (Block RAM)

Experimental Results

Main Performance Results

  • Runtime Reduction: DAE optimization achieves 26.5% performance improvement
  • Improved memory-compute overlap through decoupling access and execution

Resource Utilization Analysis

ComponentLUTFFBRAM
Non-DAE2,6572,3052
DAE (Total)3,8963,4644
- Spawner1333870
- Executor1,9991,9132
- Access1,7641,1642

Resource Overhead:

  • LUT increase of 47% (2,657 → 3,896)
  • FF increase of 50% (2,305 → 3,464)
  • BRAM doubled (2 → 4)

Experimental Findings

  1. Performance-Resource Tradeoff: 26.5% performance improvement at approximately 50% resource overhead is a reasonable tradeoff for memory-intensive applications
  2. PE Size Analysis:
    • Spawner + Executor ≈ Non-DAE single PE size
    • Access task consumes additional resources
    • Recommendation: RTL-implemented data-parallel PEs can amortize memory task cost
  3. Optimization Potential: Paper indicates future work could integrate memory tasks as black-box primitives rather than HLS-generated

TLP Software Frameworks

  1. Intel Thread Building Blocks (TBB): Widely used in industry for task parallelism
  2. Cilk Plus: Intel's Cilk extension
  3. OpenCilk: Latest Cilk framework, outperforming previous frameworks

TLP Hardware Frameworks

  1. ParallelXL: Early FPGA TLP architecture framework
  2. HardCilk: Bombyx's target platform, open-source FPGA TLP system providing hardware work-stealing scheduler

Cilk Evolution

  1. Cilk-1: Earliest version adopting explicit continuation-passing
  2. Subsequent Versions: Shifted to implicit fork-join model (more suitable for software programming)
  3. Theoretical Foundation: Proven that any implicit program can be transformed to explicit form

Advantages of This Work

  • First Automated Tool: Complete automated pipeline from OpenCilk to FPGA PE
  • Explicit Transformation: Leverages Cilk-1 style to avoid expensive hardware context switching
  • Optimization Support: Integrates DAE and other hardware-specific optimizations

Conclusions and Discussion

Main Conclusions

  1. Bombyx successfully implements automated compilation from OpenCilk to FPGA hardware
  2. Explicit continuation-passing model suits FPGA streaming characteristics, avoiding expensive context switching
  3. DAE optimization achieves 26.5% performance improvement in graph traversal benchmarks
  4. Tool is extensible to other hardware TLP frameworks

Limitations

  1. DAE Optimization Requires Manual Guidance: Currently requires programmers to insert pragmas; full automation not achieved
  2. Limited Evaluation:
    • Evaluated on only single benchmark (graph traversal)
    • Lacks comparison with other methods (hand-written code, other compilers)
    • No testing on diverse application scenarios
  3. Resource Overhead: DAE optimization introduces ~50% resource increase, potentially limiting large-scale systems
  4. Memory Task Implementation: HLS-generated memory PE is inefficient; RTL implementation recommended but not implemented
  5. Tool Maturity: As research prototype, lacks comprehensive error handling and edge case support

Future Directions

  1. Automatic DAE Identification: Develop compiler analysis to automatically identify code patterns suitable for DAE optimization
  2. RTL Memory Units: Integrate efficient RTL-implemented data-parallel memory units
  3. Additional Optimizations: Explore other hardware-specific optimizations (data reuse, locality optimization)
  4. Extended Targets: Support more hardware TLP frameworks and HLS tools
  5. Comprehensive Evaluation: Evaluate on more benchmarks, including diverse TLP application types

In-Depth Evaluation

Strengths

1. Methodological Innovation

  • Unique Transformation Strategy: Cleverly solves hardware context switching challenge by leveraging Cilk-1's explicit continuation-passing model
  • Automation Value: First complete automated toolchain from OpenCilk to FPGA hardware acceleration, filling important gap
  • Reasonable IR Design: IR design preserving original code structure generates more readable HLS code

2. Engineering Contributions

  • Strong Practicality: Automates tedious details like closure alignment, write buffer interfaces, configuration file generation
  • Verifiability: Provides Cilk-1 simulation layer for transformation verification, enhancing tool credibility
  • Open-Source Friendly: Targets HardCilk, an open-source system, facilitating tool adoption

3. Technical Insights

  • Hardware-Software Co-design: Deep understanding of adaptation between software TLP models and hardware implementation
  • DAE Optimization: Combines classical hardware optimization with TLP, demonstrating potential of compiler-guided optimization

4. Writing Clarity

  • Clearly demonstrates implicit-to-explicit transformation through Fibonacci example
  • Figure 4 IR comparison is intuitive and easy to understand
  • Compilation pipeline diagram is clear

Weaknesses

1. Insufficient Experiments

  • Single Benchmark: Only graph traversal evaluated, cannot fully verify tool generality
  • Lacks Comparisons: No comparison with hand-written code or other compilation methods
  • Limited Scale: Test graphs relatively small (maximum 87K nodes)
  • Shallow Performance Analysis: Lacks analysis of performance improvement sources (bandwidth utilization, task scheduling efficiency, etc.)

2. Method Limitations

  • Semi-Automated DAE: Requires manual pragmas, limiting usability
  • Limited Optimization Space: Only demonstrates one optimization (DAE), unexplored other hardware optimizations
  • Large Resource Overhead: 50% resource increase may limit practical applications

3. Missing Technical Details

  • Incomplete Transformation Algorithm: Lacks detailed explanation of dependency analysis, closure field selection, and other key algorithms
  • Correctness Guarantees: No formal proof or systematic verification method provided
  • Edge Cases: Lacks discussion of handling recursion, pointers, complex data structures

4. Evaluation Depth

  • Scalability Unknown: Not tested on large-scale applications or complex control flow
  • Generality Questionable: Does graph traversal represent typical TLP applications?
  • Gap from Theory: Is 26.5% improvement close to theoretical limit?

Impact

1. Contribution to Field

  • Fills Toolchain Gap: Provides important infrastructure for TLP application on FPGAs
  • Inspires Future Research: Explicit transformation approach applicable to other parallel models
  • Promotes Hardware TLP: Lowers adoption barriers, facilitating TLP in FPGA acceleration

2. Practical Value

  • Moderate Practicality: Direct value for HardCilk users, but requires further maturation
  • Learning Cost: Still requires understanding TLP concepts and hardware constraints
  • Production Readiness: As research prototype, distance from production use remains

3. Reproducibility

  • Moderate: Depends on OpenCilk, HardCilk, Vitis HLS toolchain
  • Code Not Open-Sourced: Paper does not mention code release plans
  • Sufficient Implementation Details: Provides adequate implementation and experimental details

Applicable Scenarios

Suitable Scenarios

  1. Dynamic Task Parallel Applications: Graph algorithms, tree traversal, dynamic programming
  2. Memory-Intensive Computing: DAE optimization particularly suited for memory-bound applications
  3. HardCilk Users: Developers already using or planning to use HardCilk
  4. Rapid Prototyping: Quick migration of CPU TLP code to FPGA

Unsuitable Scenarios

  1. Regular Parallelism: Data parallelism, pipelining not requiring dynamic scheduling
  2. Resource-Constrained: 50% resource overhead may be unacceptable
  3. Extreme Performance: Hand-optimized RTL may still be superior
  4. Complex Memory Patterns: Tool support for complex pointers and dynamic memory unknown

Improvement Recommendations

  1. Extended Evaluation: Add more benchmarks (sorting, searching, scientific computing)
  2. Performance Comparison: Compare with hand-written HLS, CPU, GPU implementations
  3. Automatic DAE: Develop compiler analysis for automatic DAE opportunity identification
  4. Optimization Diversity: Explore more hardware optimizations (data reuse, local caching)
  5. Formal Verification: Provide formal guarantees of transformation correctness
  6. Open-Source Release: Release tool to promote community contribution and verification

Key References

Critical references cited in this work:

  1. 2 OpenCilk (PPoPP'23): Latest Cilk framework, Bombyx's input language
  2. 4 HardCilk (FCCM'24): Bombyx's target platform, authors' prior work
  3. 5 Cilk-1 (SIGPLAN'95): Classic explicit continuation-passing TLP system, theoretical foundation of Bombyx
  4. 6 Joerg Dissertation (1996): Proves theoretical feasibility of implicit-to-explicit transformation

Overall Assessment: Bombyx is a valuable research contribution that fills the important gap of automated toolchain from OpenCilk to FPGA hardware acceleration. Its core innovation lies in leveraging Cilk-1's explicit continuation-passing model to avoid expensive hardware context switching, providing a complete compilation pipeline. However, as preliminary work, the paper shows obvious deficiencies in breadth and depth of experimental evaluation, and the semi-automated nature of DAE optimization limits usability. The tool has direct value for HardCilk users and TLP researchers but requires further maturation for widespread adoption. Recommended future work should focus on automatic optimization identification, extended benchmark evaluation, and open-source release to promote community verification and improvement.