Bombyx: OpenCilk Compilation for FPGA Hardware Acceleration
Shahawy, de Castelnau, Ienne
Task-level parallelism (TLP) is a widely used approach in software where independent tasks are dynamically created and scheduled at runtime. Recent systems have explored architectural support for TLP on field-programmable gate arrays (FPGAs), often leveraging high-level synthesis (HLS) to create processing elements (PEs). In this paper, we present Bombyx, a compiler toolchain that lowers OpenCilk programs into a Cilk-1-inspired intermediate representation, enabling efficient mapping of CPU-oriented TLP applications to spatial architectures on FPGAs. Unlike OpenCilk's implicit task model, which requires costly context switching in hardware, Cilk-1 adopts explicit continuation-passing - a model that better aligns with the streaming nature of FPGAs. Bombyx supports multiple compilation targets: one is an OpenCilk-compatible runtime for executing Cilk-1-style code using the OpenCilk backend, and another is a synthesizable PE generator designed for HLS tools like Vitis HLS. Additionally, we introduce a decoupled access-execute optimization that enables automatic generation of high-performance PEs, improving memory-compute overlap and overall throughput.
academic
Bombyx: OpenCilk Compilation for FPGA Hardware Acceleration
This paper presents Bombyx, a toolchain that compiles OpenCilk programs into FPGA hardware accelerators. Bombyx transforms OpenCilk's implicit task parallelism model into an explicit continuation-passing intermediate representation in Cilk-1 style, which is better suited to FPGA streaming characteristics. The tool supports multiple compilation targets: an OpenCilk-compatible runtime for verification, and synthesizable processing unit generators for high-level synthesis tools such as Vitis HLS. Additionally, Bombyx introduces Decoupled Access-Execute (DAE) optimization, which automatically generates high-performance processing units, improving memory-compute overlap and overall throughput.
Task-level parallelism (TLP) is a widely-used parallelization technique in software that enables dynamic creation and scheduling of independent tasks at runtime. While hardware frameworks (such as ParallelXL and HardCilk) support TLP on FPGAs, there is a lack of automated tools to automatically extract and compile processing unit (PE) code from software TLP frameworks. Existing frameworks typically require users to manually provide PE code, which is both tedious and error-prone.
OpenCilk's Implicit Model: The fork-join model using cilk_spawn and cilk_sync requires context switching at synchronization points. Implementing context switching in hardware requires saving the entire circuit state, which is neither directly supported by current HLS tools nor practical without substantial RTL engineering
TAPIR Intermediate Representation: OpenCilk uses TAPIR, which employs low-level compiler constructs, making it difficult to generate readable C++ code close to the original for HLS
Manual PE Writing: Requires manual handling of closure alignment, write buffer interfaces, configuration file generation, and other tedious details
Cilk-1's explicit continuation-passing model is more suitable for hardware implementation because it splits functions at synchronization points into terminating functions (executed atomically without context switching). Although this model is not intuitive for software programming (and was thus abandoned in Cilk's evolution), it is natural for hardware implementation. Bombyx aims to automate the transformation from OpenCilk to explicit TLP and generate optimized hardware PEs.
Automated Compilation Pipeline: Proposes a complete automated compilation toolchain Bombyx from OpenCilk to FPGA hardware accelerators
Explicit Intermediate Representation: Designs implicit and explicit IRs based on control flow graphs, enabling automatic transformation from fork-join model to continuation-passing model
Multi-target Code Generation:
HardCilk backend: Automatically generates synthesizable C++ HLS code and configuration files
Cilk-1 simulation layer: Verifies transformation correctness using OpenCilk runtime
Decoupled Access-Execute Optimization: Supports DAE optimization through compiler pragmas, decoupling memory access and computation into separate tasks to improve hardware performance
Experimental Validation: Achieves 26.5% runtime reduction in graph traversal benchmarks with DAE optimization
Uses OpenCilk Clang frontend to generate abstract syntax tree
Converts AST to control flow graph (CFG) representation of implicit IR
Each function corresponds to one CFG containing:
Unique entry block (no incoming edges)
One or more exit blocks (no outgoing edges)
Basic blocks composed of sequential C statements, terminated by control flow statements
Why Not Use TAPIR: TAPIR uses low-level constructs (such as φ nodes, alloca, etc.), making it difficult to generate readable C++ code close to the original. Bombyx's IR preserves the original code structure.
This is the core transformation step of Bombyx, converting OpenCilk's implicit synchronization model to Cilk-1's explicit continuation-passing model.
Key Concepts:
Closure: A data structure representing a task, containing:
Ready parameters
Placeholders awaiting dependencies
Return pointer
spawn_next: Creates a continuation task awaiting dependencies
send_argument: Explicitly writes arguments to awaiting closure and notifies scheduler
Transformation Algorithm:
Path Partitioning: Traverses CFG, starting new paths when encountering function termination blocks (return) or sync operations
Each path becomes a self-contained terminating function
Gray regions in Figure 4(c) represent two paths
Dependency Identification: Analyzes dependencies across sync boundaries
Identifies variables needed after sync (such as x and y in Figure 1)
These variables must be explicitly stored in closures
Keyword Replacement:
Inserts closure declarations at spawn call sites
Replaces sync with spawn_next calls to successor functions
Changes spawn return values to explicit closure field writes
Preserves return statements for later conversion to send_argument
Transformation Example (Figure 1 to Figure 2):
// Implicit (OpenCilk)
int x = cilk_spawn fib(n-1);
int y = cilk_spawn fib(n-2);
cilk_sync;
return x + y;
// Explicit (Cilk-1)
cont int x, y;
spawn_next sum(k, ?x, ?y); // Create continuation task
spawn fib(x, n-1); // Write to x placeholder
spawn fib(y, n-2); // Write to y placeholder
// Function terminates, no sync needed
HardCilk is an open-source FPGA TLP architecture generator providing hardware work-stealing scheduler. Bombyx automatically generates all components required by HardCilk:
Performance-Resource Tradeoff: 26.5% performance improvement at approximately 50% resource overhead is a reasonable tradeoff for memory-intensive applications
PE Size Analysis:
Spawner + Executor ≈ Non-DAE single PE size
Access task consumes additional resources
Recommendation: RTL-implemented data-parallel PEs can amortize memory task cost
Optimization Potential: Paper indicates future work could integrate memory tasks as black-box primitives rather than HLS-generated
2 OpenCilk (PPoPP'23): Latest Cilk framework, Bombyx's input language
4 HardCilk (FCCM'24): Bombyx's target platform, authors' prior work
5 Cilk-1 (SIGPLAN'95): Classic explicit continuation-passing TLP system, theoretical foundation of Bombyx
6 Joerg Dissertation (1996): Proves theoretical feasibility of implicit-to-explicit transformation
Overall Assessment: Bombyx is a valuable research contribution that fills the important gap of automated toolchain from OpenCilk to FPGA hardware acceleration. Its core innovation lies in leveraging Cilk-1's explicit continuation-passing model to avoid expensive hardware context switching, providing a complete compilation pipeline. However, as preliminary work, the paper shows obvious deficiencies in breadth and depth of experimental evaluation, and the semi-automated nature of DAE optimization limits usability. The tool has direct value for HardCilk users and TLP researchers but requires further maturation for widespread adoption. Recommended future work should focus on automatic optimization identification, extended benchmark evaluation, and open-source release to promote community verification and improvement.