Building a Frontend for Calyx

This tutorial assumes that you have already worked through the Calyx tutorial. You won't get very much out of it by itself!

In the Calyx tutorial you wrote Calyx code by hand. This is (probably) a good way to build character, but it's no way to live. Indeed, Calyx was designed to be a compiler IL, and not a human-facing language.

In this tutorial, we're going to learn all about this by building a DSL-to-hardware compiler for a toy language that we wish to accelerate. We will compile the DSL to Calyx, and then let Calyx take us to hardware.

MrXL Overview

Meet MrXL, our toy DSL. MrXL lets you define arrays and registers and then perform map and reduce operations.

Example: sum of squares

Here's a MrXL program that squares and then sums the values of an input array:

input avec: int[4]
output sos: int
squares := map 1 (a <- avec) { a * a }
sos := reduce 1 (acc, i <- squares) 0 { acc + i }

This short program shows off all of MrXL's features, so let's pick it apart line by line:

We specify an array, avec, which will have four integers. The input keyword means that an external harness will populate the array.
We specify sos, a register. The output keyword means that we will populate sos in our program.
The map operation gets the values of avec and raises each to the second power. We stash the result in a new array, squares. The number 1 denotes a parallelism factor of 1, meaning that the operation is performed sequentially. We will improve this shortly.
The reduce operation walks over squares and accumulates the result into a register. The parallelism factor is again 1.

Running our example

Installing MrXL

If you are going through this tutorial in the Docker container, you can skip these installation steps and jump to Running MrXL just below.

First, install the builder library by typing the following command from the repository root:

cd calyx-py && flit install -s && cd -

Next, install the mrxl binary:

cd frontends/mrxl && flit install -s && cd -

Running MrXL

Run:

mrxl frontends/mrxl/test/sos.mrxl --data frontends/mrxl/test/sos.mrxl.data --interpret

Why 42? Because we populated avec with:

{
  "avec": [
    0,
    1,
    4,
    5
  ]
}

and \( 0^2 + 1^2 + 4^2 + 5^2 = 42 \).

Compiling our example into Calyx

Above, we merely interpreted MrXL code in software using a simple, pre-written interpreter implemented in Python. Our goal in this tutorial is to build a compiler from MrXL to hardware by translating it to the Calyx IL. The Calyx IL code generated by compiling this program looks more like:

Click to expand 103 lines.

import "primitives/core.futil";
import "primitives/binary_operators.futil";
component main() -> () {
  cells {
    @external avec_b0 = comb_mem_d1(32, 4, 32);
    @external sos = comb_mem_d1(32, 1, 32);
    sos_reg = std_reg(32);
    squares_b0 = comb_mem_d1(32, 4, 32);
    idx_b0_0 = std_reg(32);
    incr_b0_0 = std_add(32);
    lt_b0_0 = std_lt(32);
    mul_b0_0 = std_mult_pipe(32);
    idx1 = std_reg(32);
    incr_1 = std_add(32);
    lt_1 = std_lt(32);
    add_1 = std_add(32);
  }
  wires {
    group sos_reg2mem {
      sos.addr0 = 32'd0;
      sos.write_data = sos_reg.out;
      sos.write_en = 1'd1;
      sos_reg2mem[done] = sos.done;
    }
    group incr_idx_b0_0 {
      incr_b0_0.left = idx_b0_0.out;
      incr_b0_0.right = 32'd1;
      idx_b0_0.in = incr_b0_0.out;
      idx_b0_0.write_en = 1'd1;
      incr_idx_b0_0[done] = idx_b0_0.done;
    }
    comb group cond_b0_0 {
      lt_b0_0.left = idx_b0_0.out;
      lt_b0_0.right = 32'd4;
    }
    group eval_body_b0_0 {
      avec_b0.addr0 = idx_b0_0.out;
      mul_b0_0.left = avec_b0.read_data;
      mul_b0_0.right = avec_b0.read_data;
      squares_b0.addr0 = idx_b0_0.out;
      squares_b0.write_data = mul_b0_0.out;
      mul_b0_0.go = 1'd1;
      squares_b0.write_en = mul_b0_0.done;
      eval_body_b0_0[done] = squares_b0.done;
    }
    group init_idx_1 {
      idx1.in = 32'd0;
      idx1.write_en = 1'd1;
      init_idx_1[done] = idx1.done;
    }
    group incr_idx_1 {
      incr_1.left = idx1.out;
      incr_1.right = 32'd1;
      idx1.in = incr_1.out;
      idx1.write_en = 1'd1;
      incr_idx_1[done] = idx1.done;
    }
    comb group cond_1 {
      lt_1.left = idx1.out;
      lt_1.right = 32'd4;
    }
    group init_1 {
      sos_reg.in = 32'd0;
      sos_reg.write_en = 1'd1;
      init_1[done] = sos_reg.done;
    }
    group reduce1 {
      squares_b0.addr0 = idx1.out;
      add_1.left = sos_reg.out;
      add_1.right = squares_b0.read_data;
      sos_reg.in = add_1.out;
      sos_reg.write_en = 1'd1;
      reduce1[done] = sos_reg.done;
    }
  }
  control {
    seq {
      par {
        while lt_b0_0.out with cond_b0_0 {
          seq {
            eval_body_b0_0;
            incr_idx_b0_0;
          }
        }
      }
      seq {
        par {
          init_1;
          init_idx_1;
        }
        while lt_1.out with cond_1 {
          seq {
            reduce1;
            incr_idx_1;
          }
        }
      }
      par {
        sos_reg2mem;
      }
    }
  }
}

Generate it for yourself! Run:

mrxl frontends/mrxl/test/sos.mrxl

Simulating our example with Verilog

Finally, let us go the whole hog: we compile our MrXL program to Calyx, compile it to Verilog, then simulate it using Verilator.

Run:

mrxl --convert --data frontends/mrxl/test/sos.mrxl.data frontends/mrxl/test/sos.mrxl > sos.data
fud2 --from mrxl frontends/mrxl/test/sos.mrxl --to dat --through verilator -s sim.data=sos.data

The above commands take a MrXL program, sos.mrxl, and generates results with Verilator using the MrXL data file sos.mrxl.data.

Compiling MrXL into Calyx

Calyx is an infrastructure for designing domain-specific languages (DSL) which can generate efficient hardware. The rest of the tutorial will show you how to implement such a DSL by studying the MrXL-to-Calyx compiler, written in Python.

We have placed a few simplifying restrictions on MrXL programs:

Every array in a MrXL program has the same length.
Every integer in our generated hardware is 32 bits long.
The bodies of map and reduce operations must be binary + or * operations involving array elements or integers.
If repeated map/reduce operations are performed on the same array, each of those operations must have the same parallelism factor.
All reduce operations must be performed sequentially, i.e., with parallelism factor 1.

These restrictions can be lifted or relaxed via commensurate changes to the compiler.

The compilation process breaks into two steps:

Parsing MrXL into a representation we can process in Python.
Generating Calyx code.

Parsing MrXL into an AST

To start, we'll parse the MrXL program into a Python AST representation. We choose to represent AST nodes with Python dataclasses. A program is a sequence of array/register declarations followed by computation statements:

@dataclass
class Prog:
    """A MrXL program."""
    decls: List[Decl]  # Memory declarations
    stmts: List[Stmt]  # Map and reduce statements

Decl nodes correspond to array declarations such as input avec: int[4]. They carry information about whether the array is an input or output, the array's name, and the type of the array's elements:

@dataclass
class Decl:
    """Declaration of a memory."""
    input: bool  # If `False`, this is an `output`.
    name: str
    type: Type

Stmt nodes represent statements such as sos := reduce 1 (acc, i <- squares) 0 { acc + i }. They contain further nested nodes representing the function-header and -body, and the type of operation:

@dataclass
class Stmt:
    """A statement in the program."""
    dst: str
    operation: Union[Map, Reduce]

We elide further details, but point you to the AST, which defines all the nodes we need to represent a MrXL program.

Generating Calyx code

As you know, the skeleton of a Calyx program has three sections:

component main() -> {
  cells {}
  wires {}
  control {}
}

The cells section instantiates hardware units like adders, memories and registers. The wires section contains groups that connect hardware instances to perform some logical task (e.g, incrementing a register). Finally, the control section schedules the execution of groups using control operators such as seq, par, and while.

We perform syntax-directed compilation by walking over the nodes of the AST and generating cells, wires, and control operations.

An Embedded DSL that Generates Calyx

To make it easy to generate hardware, we'll use Calyx's builder module written in Python:

import calyx.builder as cb

prog = cb.Builder() # A Calyx program
main = prog.component("main") # Create a component named "main"

`Decl` nodes

Decl nodes instantiate new memories and registers. We need these to be instantiated in the cells section of our Calyx output. We use Calyx's std_reg and comb_mem_d1 primitives to represent registers and memories:

import "primitives/core.futil"; // Import standard library

component main() -> () {
  cells {
    // A memory with 4 32-bit elements. Indexed using a 6-bit value.
    avec = comb_mem_d1(32, 4, 6);
    // A register that contains a 32-bit value
    sos = std_reg(32);
  }
  ...
}

For each Decl node, we need to determine if we're instantiating a memory or a register, translate the node into a corresponding Calyx declaration, and place the declaration inside the cells section of our generated program.

If a memory is used in a parallel map or reduce, we might need to create different physical banks for it. We explain why below. We define a function to walk over the AST and compute the parallelism factor for each memory:

    # Collect banking factors.
    par_factor = compute_par_factors(prog.stmts)

Using this information, we can instantiate registers and memories for our inputs and outputs:

            for i in range(par):
                main.comb_mem_d1(
                    f"{name}_b{i}", 32, arr_size // par, 32, is_external=True
                )

The main.mem_d1 call is a function defined by the Calyx builder module to instantiate memories for a component. By setting is_external=True, we're indicating that a memory declaration is a part of the component's input-output interface.

Compiling `map` operations

For every map or reduce node, we need to generate Calyx code that iterates over an array, performs some kind of computation, and then stores the result of that computation. For map operations, we'll perform a computation on every element of an input array, and then store the answers in a result array. We can use Calyx's while loops to do this. At a high level, we want to generate the following pieces of hardware:

A register to store the current value of the loop index.
A comparator to check of the loop index is less than the array size.
An adder to increment the value of the index.
Whatever hardware is needed to implement the loop body computation.

We have implemented exactly this, and you have been using it thus far with the fud2 invocations that we have provided you.

However, it is time to get your hands dirty. We provide a stub implementation of map in gen_calyx.py:

def my_map_impl(
    comp: cb.ComponentBuilder,
    dest: str,
    stmt: ast.Map,
    arr_size: int,
    bank_factor: int,
    s_idx: int,
):
    """
    Returns a dictionary containing Calyx cells, wires and
    control needed to implement a `map` statement.
    See gen_stmt_impl for format of the dictionary.

    Generates these groups:
      - a group that implements the body of the `map` statement
      - a group that increments an index to access the `map` input array
      - a group that implements the loop condition, checking if the index
        has reached the end of the input array
    """
    # TODO: Implement map!
    return Empty()

You are invited to try implementing map yourself according to the outline given in the description by filling in the body of this function.

To run mrxl with my_map_impl instead of our implementation, pass the --my-map flag:

fud2 --from mrxl test/sos.mrxl \
     --to dat --through verilator \
     -s mrxl.flags="--my-map "  \
     -s sim.data=sos.data

If you are feeling good about your implementation, skip to the next section! If you'd like to read through the details of our implementation – or build yours in tandem – continue on with the rest of this section.

Loop condition

We define a combinational group to perform the comparison idx < arr_size that uses an lt cell:

    group_name = f"cond_{suffix}"
    cell = f"lt_{suffix}"
    less_than = comp.cell(cell, Stdlib.op("lt", 32, signed=False))
    with comp.comb_group(group_name):
        less_than.left = idx.out
        less_than.right = arr_size

Index increment

The loop index increment is implemented using a group and an adder:

    group_name = f"incr_idx_{suffix}"
    adder = comp.add(32)
    with comp.group(group_name) as incr:
        adder.left = idx.out
        adder.right = 1
        idx.in_ = adder.out
        idx.write_en = 1
        incr.done = idx.done

We provide the index's previous value and the constant 1 to adder, and write the adder's output into the register (idx). Because we're performing a stateful update of the register, we must wait for the register to state that it has committed the write. We do this by setting the group's done condition to track the register's done signal.

Body computation

The final piece of the puzzle is the body's computation. The corresponding group indexes into the input memories:

        with comp.group(f"eval_body_{suffix}") as evl:
            # Index each array
            for bind in stmt.binds:
                # Map bindings have exactly one dest
                mem = comp.get_cell(f"{name2arr[bind.dst[0]]}")
                mem.addr0 = idx.out

Because the builder module is an embedded DSL, we can simply use Python's for loop to generate all the required assignments for indexing.

This code instantiates an adder or a multiplier depending on the computation needed using the expr_to_port helper function:

        if body.operation == "mul":
            operation = comp.cell(
                f"mul_{suffix}", Stdlib.op("mult_pipe", 32, signed=False)
            )
        else:
            operation = comp.add(32, f"add_{suffix}")

and writes the value from the operation into the output memory:

            out_mem = comp.get_cell(f"{dest}_b{bank}")
            out_mem.addr0 = idx.out
            out_mem.write_data = operation.out
            # Multipliers are sequential so we need to manipulate go/done signals
            if body.operation == "mul":
                operation.go = 1
                out_mem.write_en = operation.done
            else:
                out_mem.write_en = 1
            evl.done = out_mem.done

This final operation is complicated because we must account for whether we're using an adder or a multiplier. Adders are combinational–they produce their output immediately–while multipliers are sequential and require multiple cycles to produce its output.

When using a mutliplier, we need to explicitly set its go signal to one and only write the output from the multiplier into the memory when its done signal is asserted. We do this by assigning the memory's write_en (write enable) signal to the multiplier's done signal. Finally, the group's computation is done when the memory write is committed.

Generating control

Once we have generated the hardware needed for our computation, we can schedule its computation using control operators:

            While(
                CompPort(CompVar(port), "out"),
                CompVar(cond),
                SeqComp(
                    [
                        Enable(f"eval_body_{suffix}"),
                        Enable(incr),
                    ]
                ),
            )

We generate a while loop that checks that the index is less than the array size. Then, it sequentially executes the computation for the body and increments the loop index.

Take a breather! You now understand the basics of compiling DSLs to Calyx. However, we're not done yet; we build hardware because we want things to be fast and efficient, so the next part will teach you how to parallelize MrXL programs.

Adding parallelization

MrXL allows us to parallelize our map operations.

Consider a variation on our prior example:

input avec: int[4]
output squares: int[4]
squares := map 2 (a <- avec) { a * a }

The noteworthy change is the parallelism factor of the map operation. The parallelism factor 2 specifies that two copies of the loop bodies should be executed in parallel.

Translating this into a hardware implementation has a couple of associated challenges:

Our memories (comb_mem_d1) are extremely primitive and do not support parallel accesses; a program may only read or write to one memory location every cycle. In order to support parallel accesses, we need to create multiple physical banks that represent one logical memory and contain distinct elements.
We need to generate multiple copies of the hardware we generated above because, again, adders and multipliers are physical resources and can only support one computation at a time.

To produce the full Calyx program for the above example, run the following from the root of the Calyx repository:

fud2 --from mrxl --to calyx frontends/mrxl/test/squares.mrxl

Memory banking

There's a lot going on in the Calyx code, but the thing to focus on is this. To take advantage of the parallelism in the program, the MrXL compiler assumes that the input memory avec is split into two banks, avec_b0 and avec_b1. Look for these in the Calyx code.

Why do we do this? Memories in Calyx can only support one read/write per cycle of the clock, so if we keep avec around as one memory, our attempts at parallelization will be thwarted simply because we will be bottlenecked when accessing our data. Splitting avec into two banks allows us to read/write into two logical spots of avec in parallel.

Banking comes with an additional responsibility. When driving data to the memories, we can't just supply values for a memory avec. The memory avec exists logically in our minds, but Calyx now only knows about avec_b0 and avec_b1. We must drive data to these. Though nontrivial, this data-banking can also be handled automatically; all the necessary information is in the MrXL source program.

Parallel Control

Next, our compiler duplicates the computational resources, hooks them up to the right memories using groups, and generates a control program that looks like this:

par {
  while le_b0.out with cond_b0 { seq { eval_body_b0; incr_idx_b0; } }
  while le_b1.out with cond_b1 { seq { eval_body_b1; incr_idx_b1; } }
}

The par operator executes all the loops in parallel which use distinct computational resources. As specified by the language specification, conflicting resource usage is undefined behavior.

You can use fud2 to compile the MrXL program and run it with some data:

fud2 --from mrxl --to dat \
     --through verilator \
     -s sim.data=squares.data \
     frontends/mrxl/test/squares.mrxl

The complete implementation shows the necessary code to create physical memory banks and create an outer loop to generate distinct hardware for each copy of the loop.

Further Steps

Congratulations, you know about as much about MrXL as we do! The small size of the language makes it a nice sandbox for you to play in. We mentioned that the restrictions placed on the language can be lifted by beefing up the compiler, and here's your chance to give it a whirl!

Here are some of those restrictions again, along with pointers about how to lift them.

The bodies of map and reduce operations must be binary + or * operations involving array elements or integers.

Say you wanted to add subtraction and division to the mix. We have set you up for success: the MrXL parser already parses - and / into sub and div respectively. Now, in gen_calyx.py, you need to check for "sub" and "div" as possible binary operations, and then invoke the appropriate cell-builders of the builder library. For reference, see how the + and * operations are handled at present. For "fun", take a look at how Calyx implements multiplication, and how that maps to the existing invocation to create a 32-bit multiplication cell using the builder!
All reduce operations must be performed sequentially, i.e., with parallelism factor 1.

This is a big gap! One way to perform reductions in parallel is using reduction trees. To get you started, we provide a toy implementation using the builder here. That example is rather brittle: it takes exactly 16 inputs, banked into four arrays, and adds their values together. Try incorporating this brittle version into your MrXL-to-Calyx compiler at first, and you can later think about generalizing it to any (commutative) operation, memories of any length, and any parallelism factor.

If you'd like to read more about reduction trees, you can check out these slides from David Kirk and Wen-mei W. Hwu.
If repeated map/reduce operations are performed on the same array, each of those operations must have the same parallelism factor.

The heart of the issue is figuring out how to bank the underlying array. How do we bank it two different ways at the same time? The answer is to bank the array a little finer than you think, and to then use arbitration logic to provide two fictional banking setups at the same time. We provide a toy implementation using the builder here. There, a 24-cell array has been split into six banks, but then we allow the user to pretend, simultaneously, that it is split into two banks or three banks.

Aside: supplying data to MrXL programs

You may have noticed that the data files that we pass to MrXL programs are lighter-weight than those we pass to Calyx programs. They are lighter in two ways.

Boilerplate

Calyx requires cells to be allocated for the output cells. Instead of asking the user to supply zeroed-out arrays and registers for output cells, we can infer the need for these output cells from the source code. We can then add these on to the MrXL-native data files to make them amenable to Calyx.

Additionally, because MrXL supports only a few kinds of data, some interesting parameters of Calyx-native data files turn into "just boilerplate" in MrXL-native data. We can keep MrXL-native data files relatively light and add this format information automatically.

Example: `squares.mrxl.data`

Let us look at these changes in practice.

We write avec's values as a straightforward array:

{
	"avec": [
			0,
			1,
			4,
			5
		]
}

but under the hood, the Calyx IL version of squares.mrxl comes to expect something of the form:

{
    "avec_b0": {
        "data": [
            0,
            1
        ],
        "format": {
            "is_signed": false,
            "numeric_type": "bitnum",
            "width": 32
        }
    },
    "avec_b1": {
        "data": [
            4,
            5
        ],
        "format": {
            "is_signed": false,
            "numeric_type": "bitnum",
            "width": 32
        }
    },
    "squares_b0": {
        "data": [
            0,
            0
        ],
        "format": {
            "is_signed": false,
            "numeric_type": "bitnum",
            "width": 32
        }
    },
    "squares_b1": {
        "data": [
            0,
            0
        ],
        "format": {
            "is_signed": false,
            "numeric_type": "bitnum",
            "width": 32
        }
    }
}

And we can generate this automatically. Try it yourself:

mrxl frontends/mrxl/test/squares.mrxl --data frontends/mrxl/test/squares.mrxl.data --convert

This transformation is achieved using a fud2 pass that converts MrXL-native data files into Calyx-native data files.

Keyboard shortcuts

Calyx Documentation