Building a Frontend for Calyx
This tutorial assumes that you have already worked through the Calyx tutorial. You won't get very much out of it by itself!
In the Calyx tutorial you wrote Calyx code by hand. This is (probably) a good way to build character, but it's no way to live. Indeed, Calyx was designed to be a compiler IL, and not a human-facing language.
In this tutorial, we're going to learn all about this by building a DSL-to-hardware compiler for a toy language that we wish to accelerate. We will compile the DSL to Calyx, and then let Calyx take us to hardware.
MrXL Overview
Meet MrXL, our toy DSL.
MrXL lets you define arrays and registers and then perform map
and reduce
operations.
Example: sum of squares
Here's a MrXL program that squares and then sums the values of an input array:
input avec: int[4]
output sos: int
squares := map 1 (a <- avec) { a * a }
sos := reduce 1 (acc, i <- squares) 0 { acc + i }
This short program shows off all of MrXL's features, so let's pick it apart line by line:
- We specify an array,
avec
, which will have four integers. Theinput
keyword means that an external harness will populate the array. - We specify
sos
, a register. Theoutput
keyword means that we will populatesos
in our program. - The
map
operation gets the values ofavec
and raises each to the second power. We stash the result in a new array,squares
. The number1
denotes a parallelism factor of 1, meaning that the operation is performed sequentially. We will improve this shortly. - The
reduce
operation walks oversquares
and accumulates the result into a register. The parallelism factor is again1
.
Running our example
Installing MrXL
If you are going through this tutorial in the Docker container, you can skip these installation steps and jump to Running MrXL just below.
First, install the builder
library by typing the following command from the repository root:
cd calyx-py && flit install -s && cd -
Next, install the mrxl
binary:
cd frontends/mrxl && flit install -s && cd -
Register the mrxl
binary with fud
:
fud register mrxl -p frontends/mrxl/fud/mrxl.py
Now, running fud check
should report that the mrxl
binary is correctly installed.
Running MrXL
Run:
mrxl frontends/mrxl/test/sos.mrxl --data frontends/mrxl/test/sos.mrxl.data --interpret
Why 42
? Because we populated avec
with:
{
"avec": [
0,
1,
4,
5
]
}
and \( 0^2 + 1^2 + 4^2 + 5^2 = 42 \).
Compiling our example into Calyx
Above, we merely interpreted MrXL code in software using a simple, pre-written interpreter implemented in Python. Our goal in this tutorial is to build a compiler from MrXL to hardware by translating it to the Calyx IL. The Calyx IL code generated by compiling this program looks more like:
Click to expand 103 lines.
import "primitives/core.futil";
import "primitives/binary_operators.futil";
component main() -> () {
cells {
@external avec_b0 = comb_mem_d1(32, 4, 32);
@external sos = comb_mem_d1(32, 1, 32);
sos_reg = std_reg(32);
squares_b0 = comb_mem_d1(32, 4, 32);
idx_b0_0 = std_reg(32);
incr_b0_0 = std_add(32);
lt_b0_0 = std_lt(32);
mul_b0_0 = std_mult_pipe(32);
idx1 = std_reg(32);
incr_1 = std_add(32);
lt_1 = std_lt(32);
add_1 = std_add(32);
}
wires {
group sos_reg2mem {
sos.addr0 = 32'd0;
sos.write_data = sos_reg.out;
sos.write_en = 1'd1;
sos_reg2mem[done] = sos.done;
}
group incr_idx_b0_0 {
incr_b0_0.left = idx_b0_0.out;
incr_b0_0.right = 32'd1;
idx_b0_0.in = incr_b0_0.out;
idx_b0_0.write_en = 1'd1;
incr_idx_b0_0[done] = idx_b0_0.done;
}
comb group cond_b0_0 {
lt_b0_0.left = idx_b0_0.out;
lt_b0_0.right = 32'd4;
}
group eval_body_b0_0 {
avec_b0.addr0 = idx_b0_0.out;
mul_b0_0.left = avec_b0.read_data;
mul_b0_0.right = avec_b0.read_data;
squares_b0.addr0 = idx_b0_0.out;
squares_b0.write_data = mul_b0_0.out;
mul_b0_0.go = 1'd1;
squares_b0.write_en = mul_b0_0.done;
eval_body_b0_0[done] = squares_b0.done;
}
group init_idx_1 {
idx1.in = 32'd0;
idx1.write_en = 1'd1;
init_idx_1[done] = idx1.done;
}
group incr_idx_1 {
incr_1.left = idx1.out;
incr_1.right = 32'd1;
idx1.in = incr_1.out;
idx1.write_en = 1'd1;
incr_idx_1[done] = idx1.done;
}
comb group cond_1 {
lt_1.left = idx1.out;
lt_1.right = 32'd4;
}
group init_1 {
sos_reg.in = 32'd0;
sos_reg.write_en = 1'd1;
init_1[done] = sos_reg.done;
}
group reduce1 {
squares_b0.addr0 = idx1.out;
add_1.left = sos_reg.out;
add_1.right = squares_b0.read_data;
sos_reg.in = add_1.out;
sos_reg.write_en = 1'd1;
reduce1[done] = sos_reg.done;
}
}
control {
seq {
par {
while lt_b0_0.out with cond_b0_0 {
seq {
eval_body_b0_0;
incr_idx_b0_0;
}
}
}
seq {
par {
init_1;
init_idx_1;
}
while lt_1.out with cond_1 {
seq {
reduce1;
incr_idx_1;
}
}
}
par {
sos_reg2mem;
}
}
}
}
Generate it for yourself! Run:
mrxl frontends/mrxl/test/sos.mrxl
Simulating our example with Verilog
Finally, let us go the whole hog: we compile our MrXL program to Calyx, compile it to Verilog, then simulate it using Verilator.
Run:
fud e --from mrxl frontends/mrxl/test/sos.mrxl --to dat --through verilog -s mrxl.data frontends/mrxl/test/sos.mrxl.data
The above command takes a MrXL program, sos.mrxl
, and generates results with Verilator using the MrXL data file sos.mrxl.data
.
Compiling MrXL into Calyx
Calyx is an infrastructure for designing domain-specific languages (DSL) which can generate efficient hardware. The rest of the tutorial will show you how to implement such a DSL by studying the MrXL-to-Calyx compiler, written in Python.
We have placed a few simplifying restrictions on MrXL programs:
- Every array in a MrXL program has the same length.
- Every integer in our generated hardware is 32 bits long.
- The bodies of
map
andreduce
operations must be binary+
or*
operations involving array elements or integers. - If repeated
map
/reduce
operations are performed on the same array, each of those operations must have the same parallelism factor. - All
reduce
operations must be performed sequentially, i.e., with parallelism factor1
.
These restrictions can be lifted or relaxed via commensurate changes to the compiler.
The compilation process breaks into two steps:
- Parsing MrXL into a representation we can process in Python.
- Generating Calyx code.
Parsing MrXL into an AST
To start, we'll parse the MrXL program into a Python AST representation. We choose to represent AST nodes with Python dataclasses. A program is a sequence of array/register declarations followed by computation statements:
@dataclass
class Prog:
"""A MrXL program."""
decls: List[Decl] # Memory declarations
stmts: List[Stmt] # Map and reduce statements
Decl
nodes correspond to array declarations such as input avec: int[4]
.
They carry information about whether the array is an input
or output
, the array's name, and the type of the array's elements:
@dataclass
class Decl:
"""Declaration of a memory."""
input: bool # If `False`, this is an `output`.
name: str
type: Type
Stmt
nodes represent statements such as sos := reduce 1 (acc, i <- squares) 0 { acc + i }
.
They contain further nested nodes representing the function-header and -body, and the type of operation:
@dataclass
class Stmt:
"""A statement in the program."""
dst: str
operation: Union[Map, Reduce]
We elide further details, but point you to the AST, which defines all the nodes we need to represent a MrXL program.
Generating Calyx code
As you know, the skeleton of a Calyx program has three sections:
component main() -> {
cells {}
wires {}
control {}
}
The cells section instantiates hardware units like adders, memories and registers.
The wires section contains groups that connect
hardware instances to perform some logical task (e.g, incrementing a register).
Finally, the control section schedules the execution of groups using control operators such as seq
, par
, and while
.
We perform syntax-directed compilation by walking over the nodes of the AST and generating cells
, wires
, and control
operations.
An Embedded DSL that Generates Calyx
To make it easy to generate hardware, we'll use Calyx's builder
module written in Python:
import calyx.builder as cb
prog = cb.Builder() # A Calyx program
main = prog.component("main") # Create a component named "main"
Decl
nodes
Decl
nodes instantiate new memories and registers.
We need these to be instantiated in the cells
section of our Calyx output.
We use Calyx's std_reg
and comb_mem_d1
primitives to represent registers and memories:
import "primitives/core.futil"; // Import standard library
component main() -> () {
cells {
// A memory with 4 32-bit elements. Indexed using a 6-bit value.
avec = comb_mem_d1(32, 4, 6);
// A register that contains a 32-bit value
sos = std_reg(32);
}
...
}
For each Decl
node, we need to determine if we're instantiating a memory or a register, translate the node into a corresponding Calyx declaration, and place the declaration inside the cells
section of our generated program.
If a memory is used in a parallel map
or reduce
, we might need to create different physical banks for it.
We explain why below.
We define a function to walk over the AST and compute the parallelism factor for each memory:
# Collect banking factors.
par_factor = compute_par_factors(prog.stmts)
Using this information, we can instantiate registers and memories for our inputs and outputs:
for i in range(par):
main.comb_mem_d1(
f"{name}_b{i}", 32, arr_size // par, 32, is_external=True
)
The main.mem_d1
call is a function defined by the Calyx builder module to instantiate memories for a component.
By setting is_external=True
, we're indicating that a memory declaration is a part of the component's input-output interface.
Compiling map
operations
For every map
or reduce
node, we need to generate Calyx code that iterates over an array, performs some kind of computation, and then stores the result of that computation.
For map
operations, we'll perform a computation on every element of an input array, and then store the answers in a result array.
We can use Calyx's while loops to do this.
At a high level, we want to generate the following pieces of hardware:
- A register to store the current value of the loop index.
- A comparator to check of the loop index is less than the array size.
- An adder to increment the value of the index.
- Whatever hardware is needed to implement the loop body computation.
We have implemented exactly this, and you have been using it thus far with the fud
invocations that we have provided you.
However, it is time to get your hands dirty.
We provide a stub implementation of map
in gen_calyx.py
:
def my_map_impl(
comp: cb.ComponentBuilder,
dest: str,
stmt: ast.Map,
arr_size: int,
bank_factor: int,
s_idx: int,
):
"""
Returns a dictionary containing Calyx cells, wires and
control needed to implement a `map` statement.
See gen_stmt_impl for format of the dictionary.
Generates these groups:
- a group that implements the body of the `map` statement
- a group that increments an index to access the `map` input array
- a group that implements the loop condition, checking if the index
has reached the end of the input array
"""
# TODO: Implement map!
return Empty()
You are invited to try implementing map yourself according to the outline given in the description by filling in the body of this function.
To run mrxl
with my_map_impl
instead of our implementation, pass the --my-map
flag:
fud e --from mrxl test/sos.mrxl \
--to dat --through verilog \
-s mrxl.flags "--my-map " \
-s mrxl.data test/sos.mrxl.data
If you are feeling good about your implementation, skip to the next section! If you'd like to read through the details of our implementation – or build yours in tandem – continue on with the rest of this section.
Loop condition
We define a combinational group to perform the comparison idx < arr_size
that uses an lt
cell:
group_name = f"cond_{suffix}"
cell = f"lt_{suffix}"
less_than = comp.cell(cell, Stdlib.op("lt", 32, signed=False))
with comp.comb_group(group_name):
less_than.left = idx.out
less_than.right = arr_size
Index increment
The loop index increment is implemented using a group and an adder
:
group_name = f"incr_idx_{suffix}"
adder = comp.add(32)
with comp.group(group_name) as incr:
adder.left = idx.out
adder.right = 1
idx.in_ = adder.out
idx.write_en = 1
incr.done = idx.done
We provide the index's previous value and the constant 1
to adder
, and write the adder's output into the register (idx
).
Because we're performing a stateful update of the register, we must wait for the register to state that it has committed the write.
We do this by setting the group's done
condition to track the register's done
signal.
Body computation
The final piece of the puzzle is the body's computation. The corresponding group indexes into the input memories:
with comp.group(f"eval_body_{suffix}") as evl:
# Index each array
for bind in stmt.binds:
# Map bindings have exactly one dest
mem = comp.get_cell(f"{name2arr[bind.dst[0]]}")
mem.addr0 = idx.out
Because the builder module is an embedded DSL, we can simply use Python's for
loop to generate all the required assignments for indexing.
This code instantiates an adder or a multiplier depending on the computation needed using the expr_to_port
helper function:
if body.operation == "mul":
operation = comp.cell(
f"mul_{suffix}", Stdlib.op("mult_pipe", 32, signed=False)
)
else:
operation = comp.add(32, f"add_{suffix}")
and writes the value from the operation into the output memory:
out_mem = comp.get_cell(f"{dest}_b{bank}")
out_mem.addr0 = idx.out
out_mem.write_data = operation.out
# Multipliers are sequential so we need to manipulate go/done signals
if body.operation == "mul":
operation.go = 1
out_mem.write_en = operation.done
else:
out_mem.write_en = 1
evl.done = out_mem.done
This final operation is complicated because we must account for whether we're using an adder or a multiplier. Adders are combinational–they produce their output immediately–while multipliers are sequential and require multiple cycles to produce its output.
When using a mutliplier, we need to explicitly set its go
signal to one and only write the output from the multiplier into the memory when its done
signal is asserted.
We do this by assigning the memory's write_en
(write enable) signal to the multiplier's done signal.
Finally, the group's computation is done when the memory write is committed.
Generating control
Once we have generated the hardware needed for our computation, we can schedule its computation using control operators:
While(
CompPort(CompVar(port), "out"),
CompVar(cond),
SeqComp(
[
Enable(f"eval_body_{suffix}"),
Enable(incr),
]
),
)
We generate a while loop that checks that the index is less than the array size. Then, it sequentially executes the computation for the body and increments the loop index.
Take a breather! You now understand the basics of compiling DSLs to Calyx. However, we're not done yet; we build hardware because we want things to be fast and efficient, so the next part will teach you how to parallelize MrXL programs.
Adding parallelization
MrXL allows us to parallelize our map
operations.
Consider a variation on our prior example:
input avec: int[4]
output squares: int[4]
squares := map 2 (a <- avec) { a * a }
The noteworthy change is the parallelism factor of the map
operation.
The parallelism factor 2
specifies that two copies of the loop bodies should be executed in parallel.
Translating this into a hardware implementation has a couple of associated challenges:
- Our memories (
comb_mem_d1
) are extremely primitive and do not support parallel accesses; a program may only read or write to one memory location every cycle. In order to support parallel accesses, we need to create multiple physical banks that represent one logical memory and contain distinct elements. - We need to generate multiple copies of the hardware we generated above because, again, adders and multipliers are physical resources and can only support one computation at a time.
To produce the full Calyx program for the above example, run the following from the root of the Calyx repository:
fud e --from mrxl --to calyx frontends/mrxl/test/squares.mrxl
Memory banking
There's a lot going on in the Calyx code, but the thing to focus on is this.
To take advantage of the parallelism in the program, the MrXL compiler assumes that the input memory avec
is split into two banks, avec_b0
and avec_b1
.
Look for these in the Calyx code.
Why do we do this? Memories in Calyx can only support one read/write per cycle of the clock, so if we keep avec
around as one memory, our attempts at parallelization will be thwarted simply because we will be bottlenecked when accessing our data.
Splitting avec
into two banks allows us to read/write into two logical spots of avec
in parallel.
Banking comes with an additional responsibility.
When driving data to the memories, we can't just supply values for a memory avec
.
The memory avec
exists logically in our minds, but Calyx now only knows about avec_b0
and avec_b1
.
We must drive data to these.
Though nontrivial, this data-banking can also be handled automatically; all the necessary information is in the MrXL source program.
Parallel Control
Next, our compiler duplicates the computational resources, hooks them up to the right memories using groups, and generates a control program that looks like this:
par {
while le_b0.out with cond_b0 { seq { eval_body_b0; incr_idx_b0; } }
while le_b1.out with cond_b1 { seq { eval_body_b1; incr_idx_b1; } }
}
The par
operator executes all the loops in parallel which use distinct computational resources.
As specified by the language specification, conflicting resource usage is undefined behavior.
You can use fud
to compile the MrXL program and run it with some data:
fud e --from mrxl --to dat \
--through verilog \
-s mrxl.data frontends/mrxl/test/squares.mrxl.data \
frontends/mrxl/test/squares.mrxl
The complete implementation shows the necessary code to create physical memory banks and create an outer loop to generate distinct hardware for each copy of the loop.
Further Steps
Congratulations, you know about as much about MrXL as we do! The small size of the language makes it a nice sandbox for you to play in. We mentioned that the restrictions placed on the language can be lifted by beefing up the compiler, and here's your chance to give it a whirl!
Here are some of those restrictions again, along with pointers about how to lift them.
-
The bodies of
map
andreduce
operations must be binary+
or*
operations involving array elements or integers.Say you wanted to add subtraction and division to the mix. We have set you up for success: the MrXL parser already parses
-
and/
intosub
anddiv
respectively. Now, ingen_calyx.py
, you need to check for "sub" and "div" as possible binary operations, and then invoke the appropriate cell-builders of thebuilder
library. For reference, see how the+
and*
operations are handled at present. For "fun", take a look at how Calyx implements multiplication, and how that maps to the existing invocation to create a 32-bit multiplication cell using thebuilder
! -
All
reduce
operations must be performed sequentially, i.e., with parallelism factor1
.This is a big gap! One way to perform reductions in parallel is using reduction trees. To get you started, we provide a toy implementation using the
builder
here. That example is rather brittle: it takes exactly 16 inputs, banked into four arrays, and adds their values together. Try incorporating this brittle version into your MrXL-to-Calyx compiler at first, and you can later think about generalizing it to any (commutative) operation, memories of any length, and any parallelism factor.If you'd like to read more about reduction trees, you can check out these slides from David Kirk and Wen-mei W. Hwu.
-
If repeated
map
/reduce
operations are performed on the same array, each of those operations must have the same parallelism factor.The heart of the issue is figuring out how to bank the underlying array. How do we bank it two different ways at the same time? The answer is to bank the array a little finer than you think, and to then use arbitration logic to provide two fictional banking setups at the same time. We provide a toy implementation using the
builder
here. There, a 24-cell array has been split into six banks, but then we allow the user to pretend, simultaneously, that it is split into two banks or three banks.
Aside: supplying data to MrXL programs
You may have noticed that the data files that we pass to MrXL programs are lighter-weight than those we pass to Calyx programs. They are lighter in two ways.
Boilerplate
Calyx requires cells to be allocated for the output cells. Instead of asking the user to supply zeroed-out arrays and registers for output cells, we can infer the need for these output cells from the source code. We can then add these on to the MrXL-native data files to make them amenable to Calyx.
Additionally, because MrXL supports only a few kinds of data, some interesting parameters of Calyx-native data files turn into "just boilerplate" in MrXL-native data.
We can keep MrXL-native data files relatively light and add this format
information automatically.
Example: squares.mrxl.data
Let us look at these changes in practice.
We write avec
's values as a straightforward array:
{
"avec": [
0,
1,
4,
5
]
}
but under the hood, the Calyx IL version of squares.mrxl
comes to expect something of the form:
{
"avec_b0": {
"data": [
0,
1
],
"format": {
"is_signed": false,
"numeric_type": "bitnum",
"width": 32
}
},
"avec_b1": {
"data": [
4,
5
],
"format": {
"is_signed": false,
"numeric_type": "bitnum",
"width": 32
}
},
"squares_b0": {
"data": [
0,
0
],
"format": {
"is_signed": false,
"numeric_type": "bitnum",
"width": 32
}
},
"squares_b1": {
"data": [
0,
0
],
"format": {
"is_signed": false,
"numeric_type": "bitnum",
"width": 32
}
}
}
And we can generate this automatically. Try it yourself:
mrxl frontends/mrxl/test/squares.mrxl --data frontends/mrxl/test/squares.mrxl.data --convert
This transformation is achieved using a fud
pass that converts MrXL-native data files into Calyx-native data files.