Building a Frontend for Calyx

In this tutorial, we're going to build a compiler for a small language called MrXL.

MrXL Overview

MrXL provides constructs to create arrays, and perform map and reduce operations on those arrays. Here's an example of a dot product implementation in MrXL:

input avec: int[4]
input bvec: int[4]
output dot: int[4]
prodvec := map 1 (a <- avec, b <- bvec) { a * b }
dot := reduce 1 (a, b <- prodvec) 0 { a + b }

We define the interface of program by specifying input and output arrays. Input arrays have their values populated by an external harness while the output arrays must be computed using the program.

A map expression iterates over multiple arrays of the same element and produces a new vector using the function provided in the body. In the above example, the map expression multiplies the values of avec and bvec. map 1 states that the operation has a parallelism factor of 1 which means that the loop iterations are performed sequentially.

reduce expressions walk over memories and accumulate a result into a register. In the above code snippet, we add together all the elements of prodvec and place them in a register named dot. Since the reduce parallelism factor is also 1, the reduction is performed sequentially.

Run a MrXL Program

Once we have installed the mrxl command line tool, we can run MrXL programs using fud.

To provide MrXL program with input values, we use fud's JSON-based data format. Let's try to run this program, which has a parallelism factor of two:

input foo: int[4]
output baz: int[4]
baz := map 2 (a <- foo) { a + 5 }

In order to take advantage of the parallelism in the program, the MrXL compiler automatically partitions the input memory foo into two different physical banks: foo_b0 and foo_b1. Therefore, we split up our logical foo input of [1, 2, 3, 4] into [1,2] and [3,4]:

  "foo_b0": {
    "data": [
    "format": {
      "numeric_type": "bitnum",
      "is_signed": false,
      "width": 32
  "foo_b1": {
    "data": [
    "format": {
      "numeric_type": "bitnum",
      "is_signed": false,
      "width": 32

Our complete data file similarly splits up the input for baz.

Run the program with the complete data by typing:

fud exec frontends/mrxl/test/add.mrxl \
    --from mrxl \
    --to vcd -s frontends/mrxl/test/

Compiling MrXL to Calyx

This guide will walk you through the steps to build a Python program that compiles MrXL programs to Calyx code. The guide assumes some basic familiarity with Calyx. Take a look at Calyx tutorial if you need a refresher.

To simplify things, we'll make a few assumptions about MrXL programs:

  • Every array in a MrXL program has the same length.
  • Every integer in our generated hardware will be 32 bits.
  • Every map and reduce body will be either a multiplication or addition of either an array element or an integer.

The following sections will outline these two high level tasks:

  1. Parse MrXL into a representation we can process with Python
  2. Generate Calyx code

You can find our complete implementation in the Calyx repository.

Parse MrXL into an AST

To start, we'll parse this MrXL program into a Python AST representation. We chose to represent AST nodes with Python dataclass. A program is a sequence of array declarations followed by computation statements:

class Prog:
    decls: List[Decl] # Memory declarations
    stmts: List[Stmt] # Map and reduce statements

Decl nodes correspond to array declarations like input avec: int[1024], and carry data about whether they're an input or output array, their name, and their type:

class Decl:
    input: bool  # Otherwise, output.
    name: str
    type: Type

Stmt nodes represent statements such as dot := reduce 4 (a, b <- prodvec) 0 { a + b }, and contain more nested nodes representing their function header and body, and type of operation.

class Stmt:
    dest: str
    op: Union[Map, Reduce]

The complete AST defines the remaining AST nodes required to represent a MrXL program.

Generate Calyx Code

The skeleton of a Calyx program has three sections, and looks like this:

component main() -> {
  cells {}
  wires {}
  control {}

The cells section instantiates hardware units like adders, memories and registers. The wires section contains groups that connect together hardware instances to perform some logical task such as incrementing a specific register. Finally, the control section schedules the execution of groups using control operators such as seq, par, and while.

We perform syntax-directed compilation by walking over nodes in the above AST and generating cells, wires, and control operations.

Calyx Embedded DSL

To make it easy to generate the hardware, we'll use Calyx's builder module in Python:

import calyx.builder as cb

prog = cb.Builder() # A Calyx program
main = prog.component("main") # Create a component named "main"

Decl nodes

Decl nodes instantiate new memories and registers. We need these to be instantiated in the cells section of our Calyx output. We use Calyx's std_reg and std_mem_d1 primitives to represent registers and memories:

import "primitives/core.futil"; // Import standard library

component main() -> () {
  cells {
    // A memory with 4 32-bit elements. Indexed using a 6-bit value.
    foo = std_mem_d1(32, 4, 6);
    // A register that contains a 32-bit value
    r = std_reg(32);

For each Decl node, we need to determine if we're instantiating a memory or a register, and then translate that to a corresponding Calyx declaration and place that inside the cells section of our generated program.

If a memory is used in a parallel map or reduce, we might need to create different physical banks for it. We define a function to walk over the AST and compute the parallelism factor for each memory:

    # Collect banking factors.
    par_factor = compute_par_factors(prog.stmts)

Using this information, we can instantiate registers and memories for our inputs and outputs:

            for i in range(par):
                main.mem_d1(f"{name}_b{i}", 32, arr_size // par, 32, is_external=True)

The main.mem_d1 call is a function defined by the Calyx builder module to instantiate memories for a component. By setting is_external=True, we're indicating that a memory declaration is a part of the program's input-output interface.

Compiling Map Operations

For every map or reduce node, we need to generate Calyx code that iterates over an array, performs some kind of computation, and then stores the result of that computation. For map operations, we'll perform a computation on an element of an input array, and then store the result in a result array. We can use Calyx's while loops to iterate over an input array, perform the map's computation, and store the final value. At a high level, we want to generate the following pieces of hardware:

  1. A register to store the current value of the loop index.
  2. A comparator to check of the loop index is less than the array size.
  3. An adder to increment the value of the index.
  4. Hardware needed to implement the loop body computation.

Loop Condition

We define a combinational group to perform the comparison idx < arr_size that uses an lt cell.

    group_name = f"cond_{suffix}"
    cell = f"lt_{suffix}"
    lt = comp.cell(cell, Stdlib.op("lt", 32, signed=False))
    with comp.comb_group(group_name):
        lt.left = idx.out
        lt.right = cb.const(32, arr_size)

Index Increment

    group_name = f"incr_idx_{suffix}"
    adder = comp.add(f"incr_{suffix}", 32)
    with as incr:
        adder.left = idx.out
        adder.right = 1
        idx.in_ = adder.out
        idx.write_en = 1
        incr.done = idx.done

The loop index increment is implemented using a [group][lf-group] and an adder (adder). We provide the index's previous value and the constant 1 to the adder and write the adder's output into the register. Because we're performing a stateful update of the register, we must wait for the register to state that it's committed the write by setting the group's done condition to the register's done signal.

Body Computation

The final piece of the puzzle is the body's computation. The corresponding group indexes into the input memories:

        with"eval_body_{suffix}") as ev:
            # Index each array
            for bind in stmt.bind:
                # Map bindings have exactly one dest
                mem = comp.get_cell(f"{name2arr[bind.dest[0]]}")
                mem.addr0 = idx.out

Because the builder module is an embedded DSL, we can simply use Python's for loop to generate all the required assignments for indexing.

This code instantiates an adder or a multiplier depending on the computation needed using the expr_to_port helper function:

            # Provide inputs to the op
            op.left = expr_to_port(body.lhs)
            op.right = expr_to_port(body.rhs)

And writes the value from the operation into the output memory:

            out_mem = comp.get_cell(f"{dest}_b{bank}")
            out_mem.addr0 = idx.out
            out_mem.write_data = op.out
            # Multipliers are sequential so we need to manipulate go/done signals
            if body.op == "mul":
                op.go = 1
                out_mem.write_en = op.done
                out_mem.write_en = 1
            ev.done = out_mem.done

This final operation is complex because we must account for whether we're using an adder or a multiplier. Adders are combinational–they produce their output immediately–while multipliers are sequential and require multiple cycles to produce its output.

When using a mutliplier, we need to explicitly set its go signal to one and only write the output from the multiplier into the memory when its done signal is asserted. We do this by assigning the memory's write_en (write enable) signal to the multiplier's done signal. Finally, the group's computation is done when the memory write is committed.

Generating Control

Once we have generated the hardware needed for our computation, we can schedule its computation using control operators:

                CompPort(CompVar(port), "out"),

We generate a while loop that checks that the index is less than the array size. Then, it sequentially executes the computation for the body and increments the loop index.

Add Parallelization

MrXL allows you to parallelize your map and reduce operations. Let's revisit the map example from earlier:

input foo: int[4]
output baz: int[4]
baz := map 4 (a <- foo) { a + 5 }

The number 4 specifies that four copies of the loop bodies should be executed in parallel. Our implementation already creates memory banks to allow for parallel accesses. At a high-level, we can change the compilation for the map operation to produce n copies of the hardware we generate above and generate a control program that looks like this:

par {
  while le_b0.out with cond_b0 { seq { eval_body_b0; incr_idx_b0; } }
  while le_b1.out with cond_b1 { seq { eval_body_b1; incr_idx_b1; } }
  while le_b2.out with cond_b2 { seq { eval_body_b2; incr_idx_b2; } }
  while le_b3.out with cond_b3 { seq { eval_body_b3; incr_idx_b3; } }

The par operator executes all the loops in parallel. The full implementation shows the necessary code to accomplish this which simply creates an outer loop to generate distinct hardware for each copy of the loop.


Hopefully this should be enough to get you started with writing your own MrXL compiler. Some more follow-up tasks you could try if you're interested:

  • Read the code for compiling reduce statements and extend to support parallel reductions using reduction trees.
  • Implement code generation that allows memories that differ from one another in size.
  • Implement complex function body expressions. We only support binary operations with two operands, like a + 5.
  • Add a new filter operation to MrXL.