Building a Frontend for Calyx
In this tutorial, we're going to build a compiler for a small language called MrXL.
MrXL Overview
MrXL provides constructs to create arrays, and perform map
and reduce
operations on those arrays. Here's an example of a dot product implementation in MrXL:
input avec: int[4]
input bvec: int[4]
output dot: int[4]
prodvec := map 1 (a <- avec, b <- bvec) { a * b }
dot := reduce 1 (a, b <- prodvec) 0 { a + b }
We define the interface of program by specifying input
and output
arrays.
Input arrays have their values populated by an external harness while the output arrays must be computed using the program.
A map
expression iterates over multiple arrays of the same element and produces a new vector using the function provided in the body.
In the above example, the map
expression multiplies the values of avec
and bvec
.
map 1
states that the operation has a parallelism factor of 1 which means that the loop iterations are performed sequentially.
reduce
expressions walk over memories and accumulate a result into a register.
In the above code snippet, we add together all the elements of prodvec
and place them in a register named dot
.
Since the reduce
parallelism factor is also 1, the reduction is performed sequentially.
Run a MrXL Program
Once we have installed the mrxl command line tool, we can run MrXL programs using fud
.
To provide MrXL program with input values, we use fud's JSON-based data format. Let's try to run this program, which has a parallelism factor of two:
input foo: int[4]
output baz: int[4]
baz := map 2 (a <- foo) { a + 5 }
In order to take advantage of the parallelism in the program, the MrXL compiler automatically partitions the input memory foo
into two different physical banks: foo_b0
and foo_b1
.
Therefore, we split up our logical foo
input of [1, 2, 3, 4]
into [1,2]
and [3,4]
:
"foo_b0": {
"data": [
1,
2
],
"format": {
"numeric_type": "bitnum",
"is_signed": false,
"width": 32
}
},
"foo_b1": {
"data": [
3,
4
],
"format": {
"numeric_type": "bitnum",
"is_signed": false,
"width": 32
}
},
Our complete data file similarly splits up the input for baz
.
Run the program with the complete data by typing:
fud exec frontends/mrxl/test/add.mrxl \
--from mrxl \
--to vcd -s verilog.data frontends/mrxl/test/add.mrxl.data
Compiling MrXL to Calyx
This guide will walk you through the steps to build a Python program that compiles MrXL programs to Calyx code. The guide assumes some basic familiarity with Calyx. Take a look at Calyx tutorial if you need a refresher.
To simplify things, we'll make a few assumptions about MrXL programs:
- Every array in a MrXL program has the same length.
- Every integer in our generated hardware will be 32 bits.
- Every
map
andreduce
body will be either a multiplication or addition of either an array element or an integer.
The following sections will outline these two high level tasks:
- Parse MrXL into a representation we can process with Python
- Generate Calyx code
You can find our complete implementation in the Calyx repository.
Parse MrXL into an AST
To start, we'll parse this MrXL program into a Python AST representation. We chose to represent AST nodes with Python dataclass
.
A program is a sequence of array declarations followed by computation statements:
@dataclass
class Prog:
decls: List[Decl] # Memory declarations
stmts: List[Stmt] # Map and reduce statements
Decl
nodes correspond to array declarations like input avec: int[1024]
, and carry data about whether they're an input
or output
array, their name, and their type:
@dataclass
class Decl:
input: bool # Otherwise, output.
name: str
type: Type
Stmt
nodes represent statements such as dot := reduce 4 (a, b <- prodvec) 0 { a + b }
, and contain more nested nodes representing their function header and body, and type of operation.
@dataclass
class Stmt:
dest: str
op: Union[Map, Reduce]
The complete AST defines the remaining AST nodes required to represent a MrXL program.
Generate Calyx Code
The skeleton of a Calyx program has three sections, and looks like this:
component main() -> {
cells {}
wires {}
control {}
}
The cells section instantiates hardware units like adders, memories and registers.
The wires section contains groups that connect
together hardware instances to perform some logical task such as incrementing a specific register.
Finally, the control section schedules the execution of groups using control operators such as seq
, par
, and while
.
We perform syntax-directed compilation by walking over nodes in the above AST and generating cells
, wires
, and control
operations.
Calyx Embedded DSL
To make it easy to generate the hardware, we'll use Calyx's builder
module in Python:
import calyx.builder as cb
prog = cb.Builder() # A Calyx program
main = prog.component("main") # Create a component named "main"
Decl
nodes
Decl
nodes instantiate new memories and registers.
We need these to be instantiated in the cells
section of our Calyx output.
We use Calyx's std_reg
and std_mem_d1
primitives to represent registers and memories:
import "primitives/core.futil"; // Import standard library
component main() -> () {
cells {
// A memory with 4 32-bit elements. Indexed using a 6-bit value.
foo = std_mem_d1(32, 4, 6);
// A register that contains a 32-bit value
r = std_reg(32);
}
...
}
For each Decl
node, we need to determine if we're instantiating a memory or a register, and then translate that to a corresponding Calyx declaration and place that inside the cells
section of our generated program.
If a memory is used in a parallel map
or reduce
, we might need to create different physical banks for it.
We define a function to walk over the AST and compute the parallelism factor for each memory:
# Collect banking factors.
par_factor = compute_par_factors(prog.stmts)
Using this information, we can instantiate registers and memories for our inputs and outputs:
for i in range(par):
main.mem_d1(f"{name}_b{i}", 32, arr_size // par, 32, is_external=True)
The main.mem_d1
call is a function defined by the Calyx builder module to instantiate memories for a component.
By setting is_external=True
, we're indicating that a memory declaration is a part of the program's input-output interface.
Compiling Map
Operations
For every map or reduce node, we need to generate Calyx code that iterates over an array, performs some kind of computation, and then stores the result of that computation.
For map
operations, we'll perform a computation on an element of an input array, and then store the result in a result array.
We can use Calyx's while loops to iterate over an input array, perform the map's computation, and store the final value.
At a high level, we want to generate the following pieces of hardware:
- A register to store the current value of the loop index.
- A comparator to check of the loop index is less than the array size.
- An adder to increment the value of the index.
- Hardware needed to implement the loop body computation.
Loop Condition
We define a combinational group to perform the comparison idx < arr_size
that uses an lt
cell.
group_name = f"cond_{suffix}"
cell = f"lt_{suffix}"
lt = comp.cell(cell, Stdlib.op("lt", 32, signed=False))
with comp.comb_group(group_name):
lt.left = idx.out
lt.right = cb.const(32, arr_size)
Index Increment
group_name = f"incr_idx_{suffix}"
adder = comp.add(f"incr_{suffix}", 32)
with comp.group(group_name) as incr:
adder.left = idx.out
adder.right = 1
idx.in_ = adder.out
idx.write_en = 1
incr.done = idx.done
The loop index increment is implemented using a [group][lf-group] and an adder (adder
).
We provide the index's previous value and the constant 1 to the adder and write the adder's output into the register.
Because we're performing a stateful update of the register, we must wait for the register to state that it's committed the write by setting the group's done condition to the register's done
signal.
Body Computation
The final piece of the puzzle is the body's computation. The corresponding group indexes into the input memories:
with comp.group(f"eval_body_{suffix}") as ev:
# Index each array
for bind in stmt.bind:
# Map bindings have exactly one dest
mem = comp.get_cell(f"{name2arr[bind.dest[0]]}")
mem.addr0 = idx.out
Because the builder module is an embedded DSL, we can simply use Python's for
loop to generate all the required assignments for indexing.
This code instantiates an adder or a multiplier depending on the computation needed using the expr_to_port
helper function:
# Provide inputs to the op
op.left = expr_to_port(body.lhs)
op.right = expr_to_port(body.rhs)
And writes the value from the operation into the output memory:
out_mem = comp.get_cell(f"{dest}_b{bank}")
out_mem.addr0 = idx.out
out_mem.write_data = op.out
# Multipliers are sequential so we need to manipulate go/done signals
if body.op == "mul":
op.go = 1
out_mem.write_en = op.done
else:
out_mem.write_en = 1
ev.done = out_mem.done
This final operation is complex because we must account for whether we're using an adder or a multiplier. Adders are combinational–they produce their output immediately–while multipliers are sequential and require multiple cycles to produce its output.
When using a mutliplier, we need to explicitly set its go
signal to one and only write the output from the multiplier into the memory when its done
signal is asserted.
We do this by assigning the memory's write_en
(write enable) signal to the multiplier's done signal.
Finally, the group's computation is done when the memory write is committed.
Generating Control
Once we have generated the hardware needed for our computation, we can schedule its computation using control operators:
While(
CompPort(CompVar(port), "out"),
CompVar(cond),
SeqComp(
[
Enable(f"eval_body_{suffix}"),
Enable(incr),
]
),
)
We generate a while loop that checks that the index is less than the array size. Then, it sequentially executes the computation for the body and increments the loop index.
Add Parallelization
MrXL allows you to parallelize your map
and reduce
operations. Let's revisit the map
example from earlier:
input foo: int[4]
output baz: int[4]
baz := map 4 (a <- foo) { a + 5 }
The number 4 specifies that four copies of the loop bodies should be executed in parallel.
Our implementation already creates memory banks to allow for parallel accesses.
At a high-level, we can change the compilation for the map
operation to produce n
copies of the hardware we generate above and generate a control program that looks like this:
par {
while le_b0.out with cond_b0 { seq { eval_body_b0; incr_idx_b0; } }
while le_b1.out with cond_b1 { seq { eval_body_b1; incr_idx_b1; } }
while le_b2.out with cond_b2 { seq { eval_body_b2; incr_idx_b2; } }
while le_b3.out with cond_b3 { seq { eval_body_b3; incr_idx_b3; } }
}
The par
operator executes all the loops in parallel.
The full implementation shows the necessary code to accomplish this which simply creates an outer loop to generate distinct hardware for each copy of the loop.
Conclusion
Hopefully this should be enough to get you started with writing your own MrXL compiler. Some more follow-up tasks you could try if you're interested:
- Read the code for compiling
reduce
statements and extend to support parallel reductions using reduction trees. - Implement code generation that allows memories that differ from one another in size.
- Implement complex function body expressions. We only support binary operations with two operands, like
a + 5
. - Add a new
filter
operation to MrXL.