2025-11-30T02:10:19.077243

Bombyx: OpenCilk Compilation for FPGA Hardware Acceleration

Shahawy, de Castelnau, Ienne

Task-level parallelism (TLP) is a widely used approach in software where independent tasks are dynamically created and scheduled at runtime. Recent systems have explored architectural support for TLP on field-programmable gate arrays (FPGAs), often leveraging high-level synthesis (HLS) to create processing elements (PEs). In this paper, we present Bombyx, a compiler toolchain that lowers OpenCilk programs into a Cilk-1-inspired intermediate representation, enabling efficient mapping of CPU-oriented TLP applications to spatial architectures on FPGAs. Unlike OpenCilk's implicit task model, which requires costly context switching in hardware, Cilk-1 adopts explicit continuation-passing - a model that better aligns with the streaming nature of FPGAs. Bombyx supports multiple compilation targets: one is an OpenCilk-compatible runtime for executing Cilk-1-style code using the OpenCilk backend, and another is a synthesizable PE generator designed for HLS tools like Vitis HLS. Additionally, we introduce a decoupled access-execute optimization that enables automatic generation of high-performance PEs, improving memory-compute overlap and overall throughput.

academic

Bombyx: OpenCilk Compilation for FPGA Hardware Acceleration

Basic Information

Paper ID: 2511.21346
Title: Bombyx: OpenCilk Compilation for FPGA Hardware Acceleration
Authors: Mohamed Shahawy, Julien de Castelnau, Paolo Ienne (École Polytechnique Fédérale de Lausanne)
Classification: cs.AR (Computer Architecture)
Publication Date: November 26, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2511.21346

Abstract

This paper presents Bombyx, a toolchain that compiles OpenCilk programs into FPGA hardware accelerators. Bombyx transforms OpenCilk's implicit task parallelism model into an explicit continuation-passing intermediate representation in Cilk-1 style, which is better suited to FPGA streaming characteristics. The tool supports multiple compilation targets: an OpenCilk-compatible runtime for verification, and synthesizable processing unit generators for high-level synthesis tools such as Vitis HLS. Additionally, Bombyx introduces Decoupled Access-Execute (DAE) optimization, which automatically generates high-performance processing units, improving memory-compute overlap and overall throughput.

Research Background and Motivation

1. Core Problem to Address

Task-level parallelism (TLP) is a widely-used parallelization technique in software that enables dynamic creation and scheduling of independent tasks at runtime. While hardware frameworks (such as ParallelXL and HardCilk) support TLP on FPGAs, there is a lack of automated tools to automatically extract and compile processing unit (PE) code from software TLP frameworks. Existing frameworks typically require users to manually provide PE code, which is both tedious and error-prone.

2. Problem Significance

Automation Requirements: Porting CPU-oriented TLP applications to FPGAs requires substantial manual effort, including code restructuring, hardware interface adaptation, and configuration file generation
Performance Optimization: Manually written code is difficult to apply advanced hardware optimizations (such as Decoupled Access-Execute)
Development Efficiency: The absence of automated toolchains hinders widespread adoption of TLP in FPGA acceleration

3. Limitations of Existing Approaches

OpenCilk's Implicit Model: The fork-join model using cilk_spawn and cilk_sync requires context switching at synchronization points. Implementing context switching in hardware requires saving the entire circuit state, which is neither directly supported by current HLS tools nor practical without substantial RTL engineering
TAPIR Intermediate Representation: OpenCilk uses TAPIR, which employs low-level compiler constructs, making it difficult to generate readable C++ code close to the original for HLS
Manual PE Writing: Requires manual handling of closure alignment, write buffer interfaces, configuration file generation, and other tedious details

4. Research Motivation

Cilk-1's explicit continuation-passing model is more suitable for hardware implementation because it splits functions at synchronization points into terminating functions (executed atomically without context switching). Although this model is not intuitive for software programming (and was thus abandoned in Cilk's evolution), it is natural for hardware implementation. Bombyx aims to automate the transformation from OpenCilk to explicit TLP and generate optimized hardware PEs.

Core Contributions

Automated Compilation Pipeline: Proposes a complete automated compilation toolchain Bombyx from OpenCilk to FPGA hardware accelerators
Explicit Intermediate Representation: Designs implicit and explicit IRs based on control flow graphs, enabling automatic transformation from fork-join model to continuation-passing model
Multi-target Code Generation:
- HardCilk backend: Automatically generates synthesizable C++ HLS code and configuration files
- Cilk-1 simulation layer: Verifies transformation correctness using OpenCilk runtime
Decoupled Access-Execute Optimization: Supports DAE optimization through compiler pragmas, decoupling memory access and computation into separate tasks to improve hardware performance
Experimental Validation: Achieves 26.5% runtime reduction in graph traversal benchmarks with DAE optimization

Methodology Details

Task Definition

Input: Task-parallel C/C++ programs written in OpenCilk (CPU-oriented)
Output:

Synthesizable C++ HLS code (for FPGA PE generation)
HardCilk system configuration files (JSON format)
Or Cilk-1 style executable code (for verification)

Constraints:

Programs must follow OpenCilk's fork-join model
Inter-function dependencies must be explicitly expressed through cilk_spawn and cilk_sync

Overall Architecture Design

Bombyx's compilation process consists of three main phases (see Figure 3):

OpenCilk Source → AST → Implicit IR → Explicit IR → Code Generation
                   ↓           ↓            ↓            ↓
                Clang      Control Flow   DAE      HardCilk/Cilk-1
               Frontend      Graph      Optimization

1. AST to Implicit IR Transformation

Uses OpenCilk Clang frontend to generate abstract syntax tree
Converts AST to control flow graph (CFG) representation of implicit IR
Each function corresponds to one CFG containing:
- Unique entry block (no incoming edges)
- One or more exit blocks (no outgoing edges)
- Basic blocks composed of sequential C statements, terminated by control flow statements

Why Not Use TAPIR: TAPIR uses low-level constructs (such as φ nodes, alloca, etc.), making it difficult to generate readable C++ code close to the original. Bombyx's IR preserves the original code structure.

2. Implicit IR to Explicit IR Transformation

This is the core transformation step of Bombyx, converting OpenCilk's implicit synchronization model to Cilk-1's explicit continuation-passing model.

Key Concepts:

Closure: A data structure representing a task, containing:
- Ready parameters
- Placeholders awaiting dependencies
- Return pointer
spawn_next: Creates a continuation task awaiting dependencies
send_argument: Explicitly writes arguments to awaiting closure and notifies scheduler

Transformation Algorithm:

Path Partitioning: Traverses CFG, starting new paths when encountering function termination blocks (return) or sync operations
- Each path becomes a self-contained terminating function
- Gray regions in Figure 4(c) represent two paths
Dependency Identification: Analyzes dependencies across sync boundaries
- Identifies variables needed after sync (such as x and y in Figure 1)
- These variables must be explicitly stored in closures
Keyword Replacement:
- Inserts closure declarations at spawn call sites
- Replaces sync with spawn_next calls to successor functions
- Changes spawn return values to explicit closure field writes
- Preserves return statements for later conversion to send_argument

Transformation Example (Figure 1 to Figure 2):

// Implicit (OpenCilk)
int x = cilk_spawn fib(n-1);
int y = cilk_spawn fib(n-2);
cilk_sync;
return x + y;

// Explicit (Cilk-1)
cont int x, y;
spawn_next sum(k, ?x, ?y);  // Create continuation task
spawn fib(x, n-1);           // Write to x placeholder
spawn fib(y, n-2);           // Write to y placeholder
// Function terminates, no sync needed

HardCilk Backend Generation

HardCilk is an open-source FPGA TLP architecture generator providing hardware work-stealing scheduler. Bombyx automatically generates all components required by HardCilk:

1. Closure Alignment

Automatically adds padding to align closure sizes to 128/256 bits
Facilitates hardware implementation

2. Write Buffer Interface

HardCilk uses write buffer modules to handle spawn_next and send_argument
Bombyx automatically inserts required metadata (task type, target address, etc.)

3. JSON Configuration Generation

Automatically generates system description files through static analysis, containing:

Closure sizes
Spawn/spawn_next/send_argument relationships between tasks
Other system configuration parameters

Decoupled Access-Execute (DAE) Optimization

Motivation

In naive PE implementations, a single unit handles both memory requests and computation, resulting in:

Memory stalls leaving PE idle
Reduced overall throughput

Traditional pipelining in HLS tools (such as Vitis) is limited:

Cannot handle data-dependent loop boundaries
Static schedulers cannot determine when to schedule memory access

DAE Solution

Under TLP, express memory access and execution as separate tasks:

Memory Access Task: Specifically handles memory requests
Execution Task: Handles computation logic
Task Scheduler: Coordinates execution, implementing elastic pipelining

Implementation Method

Through compiler pragmas:

#PRAGMA BOMBYX DAE
struct node_t nd = *n;  // Memory access operation
// Subsequent code becomes execution task

Compiler automatically:

Extracts annotated lines into independent function (memory access task)
Replaces with spawn call
Inserts sync
After conversion to explicit form, memory access task spawns continuation of execution task

Experimental Setup

Benchmarks

Graph Traversal Program (Figure 5):

Parallel breadth-first graph traversal
Accesses node adjacency lists (memory-intensive)
Recursively traverses adjacent nodes in parallel

Datasets

Synthetically generated tree-shaped graphs:

Graph 1: Depth D=7, branching factor B=4, total 5,461 nodes
Graph 2: Depth D=9, branching factor B=4, total 87,381 nodes
Node count calculation: $\frac{B^D - 1}{B - 1}$

Experimental Configuration

FPGA Platform: Xilinx Alveo U55C (xcu55c-fsvh2892-2L-e)
Synthesis Tool: Vivado 2024.1
Target Frequency: 300 MHz
PE Count:
- Non-DAE: 1 PE
- DAE: 3 PEs (spawner, executor, access each 1)

Evaluation Metrics

Runtime: Time required to traverse entire graph
Resource Utilization:
- LUT (Look-Up Tables)
- FF (Flip-Flops)
- BRAM (Block RAM)

Experimental Results

Main Performance Results

Runtime Reduction: DAE optimization achieves 26.5% performance improvement
Improved memory-compute overlap through decoupling access and execution

Resource Utilization Analysis

Component	LUT	FF	BRAM
Non-DAE	2,657	2,305	2
DAE (Total)	3,896	3,464	4
- Spawner	133	387	0
- Executor	1,999	1,913	2
- Access	1,764	1,164	2

Resource Overhead:

LUT increase of 47% (2,657 → 3,896)
FF increase of 50% (2,305 → 3,464)
BRAM doubled (2 → 4)

Experimental Findings

Performance-Resource Tradeoff: 26.5% performance improvement at approximately 50% resource overhead is a reasonable tradeoff for memory-intensive applications
PE Size Analysis:
- Spawner + Executor ≈ Non-DAE single PE size
- Access task consumes additional resources
- Recommendation: RTL-implemented data-parallel PEs can amortize memory task cost
Optimization Potential: Paper indicates future work could integrate memory tasks as black-box primitives rather than HLS-generated

TLP Software Frameworks

Intel Thread Building Blocks (TBB): Widely used in industry for task parallelism
Cilk Plus: Intel's Cilk extension
OpenCilk: Latest Cilk framework, outperforming previous frameworks

TLP Hardware Frameworks

ParallelXL: Early FPGA TLP architecture framework
HardCilk: Bombyx's target platform, open-source FPGA TLP system providing hardware work-stealing scheduler

Cilk Evolution

Cilk-1: Earliest version adopting explicit continuation-passing
Subsequent Versions: Shifted to implicit fork-join model (more suitable for software programming)
Theoretical Foundation: Proven that any implicit program can be transformed to explicit form

Advantages of This Work

First Automated Tool: Complete automated pipeline from OpenCilk to FPGA PE
Explicit Transformation: Leverages Cilk-1 style to avoid expensive hardware context switching
Optimization Support: Integrates DAE and other hardware-specific optimizations

Conclusions and Discussion

Main Conclusions

Bombyx successfully implements automated compilation from OpenCilk to FPGA hardware
Explicit continuation-passing model suits FPGA streaming characteristics, avoiding expensive context switching
DAE optimization achieves 26.5% performance improvement in graph traversal benchmarks
Tool is extensible to other hardware TLP frameworks

Limitations

DAE Optimization Requires Manual Guidance: Currently requires programmers to insert pragmas; full automation not achieved
Limited Evaluation:
- Evaluated on only single benchmark (graph traversal)
- Lacks comparison with other methods (hand-written code, other compilers)
- No testing on diverse application scenarios
Resource Overhead: DAE optimization introduces ~50% resource increase, potentially limiting large-scale systems
Memory Task Implementation: HLS-generated memory PE is inefficient; RTL implementation recommended but not implemented
Tool Maturity: As research prototype, lacks comprehensive error handling and edge case support

Future Directions

Automatic DAE Identification: Develop compiler analysis to automatically identify code patterns suitable for DAE optimization
RTL Memory Units: Integrate efficient RTL-implemented data-parallel memory units
Additional Optimizations: Explore other hardware-specific optimizations (data reuse, locality optimization)
Extended Targets: Support more hardware TLP frameworks and HLS tools
Comprehensive Evaluation: Evaluate on more benchmarks, including diverse TLP application types

In-Depth Evaluation

Strengths

1. Methodological Innovation

Unique Transformation Strategy: Cleverly solves hardware context switching challenge by leveraging Cilk-1's explicit continuation-passing model
Automation Value: First complete automated toolchain from OpenCilk to FPGA hardware acceleration, filling important gap
Reasonable IR Design: IR design preserving original code structure generates more readable HLS code

2. Engineering Contributions

Strong Practicality: Automates tedious details like closure alignment, write buffer interfaces, configuration file generation
Verifiability: Provides Cilk-1 simulation layer for transformation verification, enhancing tool credibility
Open-Source Friendly: Targets HardCilk, an open-source system, facilitating tool adoption

3. Technical Insights

Hardware-Software Co-design: Deep understanding of adaptation between software TLP models and hardware implementation
DAE Optimization: Combines classical hardware optimization with TLP, demonstrating potential of compiler-guided optimization

4. Writing Clarity

Clearly demonstrates implicit-to-explicit transformation through Fibonacci example
Figure 4 IR comparison is intuitive and easy to understand
Compilation pipeline diagram is clear

Weaknesses

1. Insufficient Experiments

Single Benchmark: Only graph traversal evaluated, cannot fully verify tool generality
Lacks Comparisons: No comparison with hand-written code or other compilation methods
Limited Scale: Test graphs relatively small (maximum 87K nodes)
Shallow Performance Analysis: Lacks analysis of performance improvement sources (bandwidth utilization, task scheduling efficiency, etc.)

2. Method Limitations

Semi-Automated DAE: Requires manual pragmas, limiting usability
Limited Optimization Space: Only demonstrates one optimization (DAE), unexplored other hardware optimizations
Large Resource Overhead: 50% resource increase may limit practical applications

3. Missing Technical Details

Incomplete Transformation Algorithm: Lacks detailed explanation of dependency analysis, closure field selection, and other key algorithms
Correctness Guarantees: No formal proof or systematic verification method provided
Edge Cases: Lacks discussion of handling recursion, pointers, complex data structures

4. Evaluation Depth

Scalability Unknown: Not tested on large-scale applications or complex control flow
Generality Questionable: Does graph traversal represent typical TLP applications?
Gap from Theory: Is 26.5% improvement close to theoretical limit?

Impact

1. Contribution to Field

Fills Toolchain Gap: Provides important infrastructure for TLP application on FPGAs
Inspires Future Research: Explicit transformation approach applicable to other parallel models
Promotes Hardware TLP: Lowers adoption barriers, facilitating TLP in FPGA acceleration

2. Practical Value

Moderate Practicality: Direct value for HardCilk users, but requires further maturation
Learning Cost: Still requires understanding TLP concepts and hardware constraints
Production Readiness: As research prototype, distance from production use remains

3. Reproducibility

Moderate: Depends on OpenCilk, HardCilk, Vitis HLS toolchain
Code Not Open-Sourced: Paper does not mention code release plans
Sufficient Implementation Details: Provides adequate implementation and experimental details

Applicable Scenarios

Suitable Scenarios

Dynamic Task Parallel Applications: Graph algorithms, tree traversal, dynamic programming
Memory-Intensive Computing: DAE optimization particularly suited for memory-bound applications
HardCilk Users: Developers already using or planning to use HardCilk
Rapid Prototyping: Quick migration of CPU TLP code to FPGA

Unsuitable Scenarios

Regular Parallelism: Data parallelism, pipelining not requiring dynamic scheduling
Resource-Constrained: 50% resource overhead may be unacceptable
Extreme Performance: Hand-optimized RTL may still be superior
Complex Memory Patterns: Tool support for complex pointers and dynamic memory unknown

Improvement Recommendations

Extended Evaluation: Add more benchmarks (sorting, searching, scientific computing)
Performance Comparison: Compare with hand-written HLS, CPU, GPU implementations
Automatic DAE: Develop compiler analysis for automatic DAE opportunity identification
Optimization Diversity: Explore more hardware optimizations (data reuse, local caching)
Formal Verification: Provide formal guarantees of transformation correctness
Open-Source Release: Release tool to promote community contribution and verification

Key References

Critical references cited in this work:

2 OpenCilk (PPoPP'23): Latest Cilk framework, Bombyx's input language
4 HardCilk (FCCM'24): Bombyx's target platform, authors' prior work
5 Cilk-1 (SIGPLAN'95): Classic explicit continuation-passing TLP system, theoretical foundation of Bombyx
6 Joerg Dissertation (1996): Proves theoretical feasibility of implicit-to-explicit transformation

Overall Assessment: Bombyx is a valuable research contribution that fills the important gap of automated toolchain from OpenCilk to FPGA hardware acceleration. Its core innovation lies in leveraging Cilk-1's explicit continuation-passing model to avoid expensive hardware context switching, providing a complete compilation pipeline. However, as preliminary work, the paper shows obvious deficiencies in breadth and depth of experimental evaluation, and the semi-automated nature of DAE optimization limits usability. The tool has direct value for HardCilk users and TLP researchers but requires further maturation for widespread adoption. Recommended future work should focus on automatic optimization identification, extended benchmark evaluation, and open-source release to promote community verification and improvement.