Getting Started
Calyx is an intermediate language and infrastructure for building compilers that generate custom hardware accelerators. These instructions will help you set up the Calyx compiler and associated tools. By the end, you should be able to compile and simulate hardware designs generated by Calyx.
Compiler Installation
Install Rust (it should automatically install cargo
).
Clone the repository:
git clone https://github.com/cucapra/calyx.git
Then build the compiler:
cargo build
You can invoke the compiler in one of two ways:
cargo run -- --help # Rebuilds the compiler if the sources changed
./target/debug/futil --help # Default debug build of the compiler
Running Core Tests
The core test suites tests the Calyx compiler passes. Install the following tools for running the core tests:
- runt hosts our testing infrastructure. Install with:
cargo install runt
- jq is a command-line JSON processor:
- Ubuntu:
sudo apt install jq
- Mac:
brew install jq
- Other platforms: JQ installation
- Ubuntu:
Build the compiler:
cargo build
Then run the core tests with:
runt -i core
If everything has been installed correctly, this should not produce any failing tests.
Installing the Command-Line Driver
The Calyx driver wraps the various compiler frontends and backends to simplify running Calyx programs.
Install Flit:
pip3 install flit
Install fud
(from the root of the repository):
flit -f fud/pyproject.toml install -s --deps production
Configure fud
:
fud config global.futil_directory <full path to Calyx repository>
Check the fud
configuration:
fud check
fud
will report certain tools are not available. This is expected.
Simulation
There are three ways to run Calyx programs: Verilator, Icarus Verilog, and Calyx's native interpreter. You'll want to set up at least one of these options so you can try out your code.
Icarus is an easy way to get started on most platforms.
On a Mac, you can install Icarus Verilog with Homebrew by typing brew install icarus-verilog
.
On Ubuntu, install from source.
Then install the relevant fud support by running:
fud register icarus-verilog -p fud/icarus/icarus.py
Type fud check
to make sure the new stage is working.
(Some missing tools are expected; just pay attention to the report for stages.icarus-verilog.exec
.)
You can instead consider setting up Verilator for faster long-running simulations or using the interpreter to avoid RTL simulation altogether.
Running a Hardware Design
We're all set to run a Calyx hardware design now. Run the following command:
fud e examples/tutorial/language-tutorial-iterate.futil \
-s verilog.data examples/tutorial/data.json \
--to dat --through icarus-verilog -v
(Change the last bit to --through verilog
to use Verilator instead.)
This command will compile examples/tutorial/language-tutorial-iterate.futil
to Verilog
using the Calyx compiler, simulate the design using the data in examples/tutorial/data.json
, and generate a JSON representation of the
final memory state.
Congratulations! You've simulated your first hardware design with Calyx.
Where to go next?
- How can I setup syntax highlighting in my editor?
- How does the language work?
- How do I install Calyx frontends?
- Examples with
fud
- How do I write a frontend for Calyx?
Calyx Language Tutorial
This tutorial will familiarize you with the Calyx language by writing a minimal program by hand. Usually, Calyx code will be generated by a frontend. However, by writing the program by hand, you can get familiar with all the basic constructs you need to generate Calyx yourself!
Complete code for each example can be found in the tutorial directory in the Calyx repository.
Get Started
The basic building block of Calyx programs is a component which corresponds to hardware modules (or function definitions for the software-minded).
Here is an empty component definition for main
along with an import
statement to import the standard library:
import "primitives/core.futil";
component main(go: 1) -> (done: 1) {
cells {
}
wires {
}
control {
}
}
Put this in a file—you can call it language-tutorial-mem.futil
, for example.
(The futil
file extension comes from an old name for Calyx.)
You can think of a component as a unit of Calyx code roughly analogous to a function: it encapsulates a logical unit of hardware structures along with their control. Every component definition has three sections:
cells
: The hardware subcomponents that make up this component.wires
: A set of guarded connections between components, possibly organized into groups.control
: The imperative program that defines the component's execution schedule: i.e., when each group executes.
We'll fill these sections up minimally in the next sections.
A Memory Cell
Let's turn our skeleton into a tiny, nearly no-op Calyx program. We'll start by adding a memory component to the cells:
cells {
@external(1) mem = std_mem_d1(32, 1, 1);
}
This new line declares a new cell called mem
and the primitive component std_mem_d1
represents a 1D memory.
You can see the definition of std_mem_d1
, and all the other standard components, in the primitives/core.futil
library we imported.
This one has three parameters: the data width (here, 32 bits), the number of elements (just one), and the width of the address port (one bit).
The @external(1)
syntax is an extra bit of magic that allows us
to read and write to the memory.
Next, we'll add some assignments to the wires
section to update the value in
the memory.
Insert these lines to put a constant value into the memory:
wires {
mem.addr0 = 1'b0;
mem.write_data = 32'd42;
mem.write_en = 1'b1;
done = mem.done;
}
These assignments refer to four ports on the memory component:
addr0
(the address port),
write_data
(the value we're putting into the memory),
write_en
(the write enable signal, telling the memory that it's time to do a write), and
done
(signals that the write was committed).
Constants like 32'd42
are Verilog-like literals that include the bit width (32), the base (d
for decimal), and the value (42).
Assignments at the top level in the wires
section, like these, are "continuous".
They always happen, without any need for control
statements to orchestrate them.
We'll see later how to organize assignments into groups.
The complete program for this section is available under examples/tutorial/language-tutorial-mem.futil.
Compile & Run
We can almost run this program!
But first, we need to provide it with data.
The Calyx infrastructure can provide data to programs from JSON files.
So make a file called something like data.json
containing something along these lines:
{
"mem": {
"data": [10],
"format": {
"numeric_type": "bitnum",
"is_signed": false,
"width": 32
}
}
}
The mem
key means we're providing the initial value for our memory called mem
.
We have one (unsigned integer) data element, and we indicate the bit width (32 bits).
If you want to see how this Calyx program compiles to Verilog, here's the fud incantation you need:
fud exec language-tutorial-mem.futil --to verilog
Not terribly interesting! However, one nice thing you can do with programs is execute them.
To run our program using Icarus Verilog, do this:
fud exec language-tutorial-mem.futil --to dat --through icarus-verilog \
-s verilog.data data.json
Using --to dat
asks fud to run the program, and the extra -s verilog.data <filename>
argument tells it where to find the input data.
The --through icarus-verilog
option tells fud which Verilog simulator to use (see the chapter about fud for alternatives such as Verilator).
Executing this program should print:
{
"cycles": 1,
"memories": {
"mem": [
42
]
}
}
Meaning that, after the program finished, the final value in our memory was 42.
Note: Icarus Verilog may report a different cycle count compared to the one above. As long as the final value in memory is correct, this does not matter.
Add Control
Let's change our program to use an execution schedule.
First, we're going to wrap all the assignments in the wires
section into a
name group:
wires {
group the_answer {
mem.addr0 = 1'b0;
mem.write_data = 32'd42;
mem.write_en = 1'b1;
the_answer[done] = mem.done;
}
}
We also need one extra line in the group: that assignment to the_answer[done]
.
Here, we say that the_answer
's work is one once the update to mem
has finished.
Calyx groups have compilation holes called go
and done
that the control program will use to orchestrate their execution.
The last thing we need is a control program.
Add one line to activate the_answer
and then finish:
control {
the_answer;
}
If you execute this program, it should do the same thing as the original group-free version: mem
ends up with 42 in it.
But now we're controlling things with an execution schedule.
If you're curious to see how the Calyx compiler lowers this program to a Verilog-like structural form of Calyx, you can do this:
fud exec language-tutorial-mem.futil --to futil-lowered
Notably, you'll see control {}
in the output, meaning that the compiler has eliminated all the control statements and replaced them with continuous assignments in wires
.
The complete program for this section is available under examples/tutorial/language-tutorial-control.futil.
Add an Adder
The next step is to actually do some computation. In this version of the program, we'll read a value from the memory, increment it, and store the updated value back to the memory.
First, we will add two components to the cells
section:
cells {
@external(1) mem = std_mem_d1(32, 1, 1);
val = std_reg(32);
add = std_add(32);
}
We make a register val
and an integer adder add
, both configured to work on 32-bit values.
Next, we'll create three groups in the wires
section for the three steps we want to run: read, increment, and write back to the memory.
Let's start with the last step, which looks pretty similar to our the_answer
group from above, except that the value comes from the val
register instead of a constant:
group write {
mem.addr0 = 1'b0;
mem.write_en = 1'b1;
mem.write_data = val.out;
write[done] = mem.done;
}
Next, let's create a group read
that moves the value from the memory to our register val
:
group read {
mem.addr0 = 1'b0;
val.in = mem.read_data;
val.write_en = 1'b1;
read[done] = val.done;
}
Here, we use the memory's read_data
port to get the initial value out.
Finally, we need a third group to add and update the value in the register:
group upd {
add.left = val.out;
add.right = 32'd4;
val.in = add.out;
val.write_en = 1'b1;
upd[done] = val.done;
}
The std_add
component from the standard library has two input ports, left
and right
, and a single output port, out
, which we hook up to the register's in
port.
This group adds a constant 4 to the register's value, updating it in place.
We can enable the val
register with a constant 1 because the std_add
component is combinational, meaning its results are ready "instantly" without the need to wait for a done signal.
We need to extend our control program to orchestrate the execution of the three groups.
We will need a seq
statement to say we want to the three steps sequentially:
control {
seq {
read;
upd;
write;
}
}
Try running this program again. The memory's initial value was 10, and its final value after execution should be 14.
The complete program for this section is available under examples/tutorial/language-tutorial-compute.futil.
Iterate
Next, we'd like to run our little computation in a loop.
The idea is to use Calyx's while
control construct, which works like this:
while <value> with <group> {
<body>
}
A while
loop runs the control statements in the body until <value>
, which is some port on some component, becomes zero.
The with <group>
bit means that we activate a given group in order to compute the condition value that determines whether the loop continues executing.
Let's run our memory-updating seq
block in a while loop.
Change the control program to look like this:
control {
seq {
init;
while lt.out with cond {
par {
seq {
read;
upd;
write;
}
incr;
}
}
}
}
This version uses while
, the parallel composition construct par
, and a few new groups we will need to define.
The idea is that we'll use a counter register to make this loop run a fixed number of times, like a for
loop.
First, we have an outer seq
that invokes an init
group that we will write to set the counter to zero.
The while
loop then uses a new group cond
, and it will run while a signal lt.out
remains nonzero: this signal will compute counter < 8
.
The body of the loop runs our old seq
block in parallel with a new incr
group to increment the counter.
Let's add some cells to our component:
counter = std_reg(32);
add2 = std_add(32);
lt = std_lt(32);
We'll need a new register, an adder to do the incrementing, and a less-than comparator.
We can use these raw materials to build the new groups we need: init
, incr
, and cond
.
First, the init
group is pretty simple:
group init {
counter.in = 32'd0;
counter.write_en = 1'b1;
init[done] = counter.done;
}
This group just writes a zero into the counter and signals that it's done.
Next, the incr
group adds one to the value in counter
using add2
:
group incr {
add2.left = counter.out;
add2.right = 32'd1;
counter.in = add2.out;
counter.write_en = 1'b1;
incr[done] = counter.done;
}
And finally, cond
uses our comparator lt
to compute the signal we need for
our while
loop. We use a comb group
to denote that the assignments inside
the condition can be run combinationally:
comb group cond {
lt.left = counter.out;
lt.right = 32'd8;
}
By comparing with 8, we should now be running our loop body 8 times.
Try running this program again. The output should be the result of adding 4 to the initial value 8 times, so 10 + 8 × 4.
The complete program for this section is available under examples/tutorial/language-tutorial-iterate.futil.
Multi-Component Designs
Calyx designs can define and instantiate other Calyx components that themselves
encode complex control
programs.
As an example, we'll build a Calyx design that uses a simple Calyx component to save a value in a register and use it in different component.
We define a new component called identity
that has an input port in
and an output port out
.
component identity(in: 32) -> (out: 32) {
cells {
r = std_reg(32);
}
wires {
group save {
r.in = in;
r.write_en = 1'd1;
save[done] = r.done;
}
// This component always outputs the current value in r
out = r.out;
}
control {
save;
}
}
The following line defines a continuous assignment, i.e., an assignment
that is always kept active, regardless of the component's control
program
being active.
// This component always outputs the current value in r
out = r.out;
By defining this continuous assignment, we can execute our component and later observe any relevant values.
Next, we can instantiate this component in any other Calyx component.
The following Calyx program instantiates the id
component and uses it to
save a value and observe it.
component main() -> () {
cells {
// Instantiate the identity element
id = identity();
current_value = std_reg(32);
}
wires {
group run_id {
// We want to "save" the value 10 inside the identity group.
id.in = 32'd10;
// All components have a magic "go" and "done" port added to them.
// Execute the component.
id.go = 1'd1;
run_id[done] = id.done;
}
group use_id {
// We want to "observe" the current value saved in id.
// The out port on the `id` component always shows the last saved
// element. We don't need to set the `go` because we're not executing
// and control.
current_value.in = id.out;
current_value.write_en = 1'd1;
use_id[done] = current_value.done;
}
}
control {
seq { run_id; use_id; }
}
}
Our first group executes the component by setting the go
signal for the
component to high and placing the value 10
on the input port.
The second group simply saves the value on the output port. Importantly,
we don't have to set the go
signal of the component to high because we
don't need to save a new value into it.
The component executes the two groups in-order.
To see the output from running this component, run the command:
fud e examples/futil/multi-component.futil --to vcd_json
Passing Memories by Reference
One question that may arise when using Calyx as a backend is how to pass a memory "by reference" between components. In C++, this might look like:
#include <array>
#include <cstdint>
// Adds one to the first element in `v`.
void add_one(std::array<uint32_t, 1>& v) {
v[0] = v[0] + 1;
}
int main() {
std::array<uint32_t, 1> x = { 0 };
add_one(x); // The value at x[0] is now 1.
}
In the code above, we've constructed an "l-value reference" to the array,
which essentially means we can both read and write from x
in the
function add_one
.
Now, let's allow similar functionality at the Calyx IR level.
We define a new component named add_one
which represents the function
above. However, we also need to include the correct ports to both read
and write to x
:
Read from x | Write to x |
---|---|
read_data | done |
address ports | write_data |
write_en | |
address ports |
Since we're both reading and writing from x
, we'll
include the union of the columns above:
component add_one(x_done: 1, x_read_data: 32) ->
(x_write_data: 32, x_write_en: 1, x_addr0: 1) {
One tricky thing to note is where the ports belong, i.e. should it be
an input port or an output port of the component? The way to reason about this
is to ask whether we want to receive signal from or send signal to the given wire. For example,
with read_data
, we will always be receiving signal from it, so it should be an input port.
Conversely, address ports are used to mark where in memory we want to access,
so those are used as output ports.
We then simply use the given ports to both read and write to the memory passed
by reference. Note that we've split up the read and write to memory x
in separate groups,
to ensure we can schedule them sequentially in the execution flow.
We're also using the exposed ports of the memory through the component interface rather than,
say, x.write_data
.
group read_from_x {
x_addr0 = 1'd0; // Set address port to zero.
tmp_reg.in = x_read_data; // Read the value at address zero.
tmp_reg.write_en = 1'd1;
read_from_x[done] = tmp_reg.done;
}
group write_to_x {
x_addr0 = 1'd0; // Set address port to zero.
add.left = one.out;
add.right = tmp_reg.out; // Saved value from previous read.
x_write_data = add.out; // Write value to address zero.
x_write_en = 1'd1; // Set write enable signal to high.
write_to_x[done] = x_done; // The group is done when the write is complete.
}
Bringing everything back together, the add_one
component is written accordingly:
component add_one(x_done: 1, x_read_data: 32) ->
(x_write_data: 32, x_write_en: 1, x_addr0: 1) {
cells {
one = std_const(32, 1);
add = std_add(32);
tmp_reg = std_reg(32);
}
wires {
group read_from_x {
x_addr0 = 1'd0; // Set address port to zero.
tmp_reg.in = x_read_data; // Read the value at address zero.
tmp_reg.write_en = 1'd1;
read_from_x[done] = tmp_reg.done;
}
group write_to_x {
x_addr0 = 1'd0; // Set address port to zero.
add.left = one.out;
add.right = tmp_reg.out; // Saved value from previous read.
x_write_data = add.out; // Write value to address zero.
x_write_en = 1'd1; // Set write enable signal to high.
write_to_x[done] = x_done; // The group is done when the write is complete.
}
}
control {
seq { read_from_x; write_to_x; }
}
}
The final step is creating a main
component from which the original component
will be invoked. In this step, it is important to hook up the proper wires in the
call to invoke
to the corresponding memory you'd like to read and/or write to:
control {
invoke add_one0(x_done = x.done, x_read_data = x.read_data)
(x_write_data = x.write_data, x_write_en = x.write_en, x_addr0 = x.addr0);
}
This gives us the main
component:
component main() -> () {
cells {
add_one0 = add_one();
@external(1) x = std_mem_d1(32, 1, 1);
}
wires {
}
control {
invoke add_one0(x_done = x.done, x_read_data = x.read_data)
(x_write_data = x.write_data, x_write_en = x.write_en, x_addr0 = x.addr0);
}
}
To see this example simulated, run the command:
fud e examples/futil/memory-by-reference/memory-by-reference.futil --to dat \
-s verilog.data examples/futil/memory-by-reference/memory-by-reference.futil.data
Multi-dimensional Memories
Not much changes for multi-dimensional arrays. The only additional step is adding
the corresponding address ports. For example, a 2-dimensional memory will require address ports
addr0
and addr1
. More generally, an N
-dimensional memory will require address ports
addr0
, ..., addr(N-1)
.
Multiple Memories
Similarly, multiple memories will just require the ports to be passed for each of the given memories.
Here is an example of a memory copy (referred to as mem_cpy
in the C language), with 1-dimensional memories of size 5:
import "primitives/core.futil";
component copy(dest_done: 1, src_read_data: 32, length: 3) ->
(dest_write_data: 32, dest_write_en: 1, dest_addr0: 3, src_addr0: 3) {
cells {
lt = std_lt(3);
N = std_reg(3);
add = std_add(3);
}
wires {
comb group cond {
lt.left = N.out;
lt.right = length;
}
group upd_index<"static"=1> {
add.left = N.out;
add.right = 3'd1;
N.in = add.out;
N.write_en = 1'd1;
upd_index[done] = N.done;
}
group copy_index_N<"static"=1> {
src_addr0 = N.out;
dest_addr0 = N.out;
dest_write_en = 1'd1;
dest_write_data = src_read_data;
copy_index_N[done] = dest_done;
}
}
control {
while lt.out with cond {
seq {
copy_index_N;
upd_index;
}
}
}
}
component main() -> () {
cells {
@external(1) d = std_mem_d1(32,5,3);
@external(1) s = std_mem_d1(32,5,3);
length = std_const(3, 5);
copy0 = copy();
}
wires {
}
control {
seq {
invoke copy0(dest_done=d.done, src_read_data=s.read_data, length=length.out)
(dest_write_data=d.write_data, dest_write_en=d.write_en, dest_addr0=d.addr0, src_addr0=s.addr0);
}
}
}
Experimental: Synchronization
Calyx's default semantics do not admit any predictable form of language-level synchronization in presence of parallelism. We're currently experimenting with a suite of new primitives that add synchronization to the language.
std_sync_reg
The std_sync_reg
primitive defined by primitives/sync.futil
provides a synchronizing
register that acts as an M-structure which provides the following interface:
On the reader side:
- If the register is "empty", block the read till the register is written into.
- If the register is "full", provide the value to the reader, provide a
done
signal, and mark it as "empty".
On the writer side:
- If the register is "empty", write the value in the register, mark it as "full", and provide a
done
signal. - If the register is "full", block the write till the register is read from.
One way to think of this interface is as a size-1 concurrent FIFO.
Using std_sync_reg
The following example is a part of the Calyx compiler test suite and can be executed using:
runt -i examples/sync
The synchronizing register interface is non-standard: it provides two go signals and two done signals to initiate parallel reads and writes.
primitive std_sync_reg[WIDTH](
@write_together(1) in: WIDTH,
read_en: 1,
@write_together(1) write_en: 1,
@clk clk: 1,
@reset reset: 1
) -> (
out: WIDTH,
write_done: 1,
read_done: 1,
blocked: 1
);
The signal read_en
is used by a program to initiate a read operation while
the write_en
signal initiates a write operation.
We need to explicitly initiate a read operation because reading a value marks
the register as "empty" which causes any future reads to block.
Similarly, the output interface specifies the read_done
and write_done
signals
which the user program needs to read to know when the operations are completed.
The read_done
signal is similar to a valid
signal while the write_done
is
similar to a write_done
signal.
The following group initiates a write operation into the synchronizing register imm
from the memory in
:
// Write value from `in[idx]` to sync intermediate.
group write_imm {
imm.write_en = 1'd1;
imm.in = in.read_data;
in.addr0 = idx.out;
write_imm[done] = imm.write_done;
}
The group waits for the imm.write_done
signal to be high before continuing
execution.
If the synchronizing register was "full" in this cycle, the execution would
stall and cause the group to take another cycle.
The following group initiates a read the synchronizing register imm
and saves
the value into the out
memory:
// Read value from sync intermediate and write to temp.
group read_imm {
imm.read_en = 1'd1;
out.write_en = imm.read_done;
out.addr0 = idx.out;
out.write_data = imm.out;
read_imm[done] = out.done;
}
The group waits till the imm.read_done
signal is high to write the value into
the memory.
Note that in case the register is empty, imm.read_done
will be low and cause
the group to another cycle.
Finally, we can describe the control program as:
while lt.out with cmp {
seq {
par {
read_imm;
write_imm;
}
incr_idx;
}
}
Note that the two groups execute in parallel which means there is no guarantee to their order of execution. However, the synchronization ensures that the reads see a consistent set of writes in the order we expect.
Limitations
The example above implements a standard producer-consumer.
However, as implemented, the std_sync_reg
primitive does not support multiple
producers or consumers.
To do so, it would need to provide an interface that allows several read and
write ports and ensure that only one read or write operation succeeds.
This capability would be useful in implementing synchronizing barriers in Calyx.
Attributes
Calyx has an attribute system that allows information to be associated with every basic Calyx construct. This information can then be used to optimize the program or change how the program is compiled.
Attributes can decorate lots of things in Calyx: components, groups, cells, ports, and control statements.
The syntax looks like name<"attr"=value>
for components and groups or @attr(value)
for other constructs.
Attributes always map keys to values.
Because it's common to have a "Boolean" attribute that always maps to the value 1, the syntax @attr
is a shorthand for @attr(1)
.
Here is the syntax for attributes in different parts of the AST:
Component and Port Attributes
component main<"static"=10>(@go go: 1) -> (@done done: 1) {
...
}
Cell Attributes
cells {
@external mem = std_mem_d1(32, 8, 4);
reg = std_reg(32);
...
}
Group Attributes
group cond<"static"=1> {
...
}
Control Attributes
control {
@static(3) seq {
@static(1) A;
@static(2) B;
}
}
Meaning of Attributes
toplevel
The entrypoint for the Calyx program. If no component has this attribute, then
the compiler looks for a component named main
. If neither is found, the
compiler errors out.
external
The external
attribute has meaning when it is attached to a cell.
It has two meanings:
- If the
externalize
pass is enabled, the cell is turned into an "external" cell by exposing all its ports through the current component and rewriting assignments to the use the ports. See the documentation on See externalize for more information. - If the cell is a memory and has an
external
attribute on it, the verilog backend (-b verilog
) generates code to read<cell_name>.dat
to initialize the memory state and dumps out its final value after execution.
static(n)
Can be attached to components, groups, and control statements. They indicate how
many cycles a component, group, or control statement will take to run and are used
by -p static-timing
to generate more efficient control FSMs.
go
, done
, and reset
These three ports are part of the interface to Calyx components. They are the mechanism for how an "outer" component invokes an "inner" cell that it contains.
The go
and done
attributes are, in particular, used by the infer-static-timing
pass to configure which ports are used like
go
and done
signals.
Along with the static(n)
attribute, this allows the pass to calculate when
a particular done signal of a primitive will be high.
inline
Used by the inline
pass on cell definitions. Instructs the pass to completely
inline the instance into the parent component and replace all invoke
s of the
instance with the control program of the instance.
stable
Used by the canonicalize
pass.
Only meaningful on output ports and states that their value is provided by
a sequential element and is therefore available outside combinational time.
For example, after invoking a multiplier, the value on its out
port remains
latched till the next invocation.
For example
cells {
m = std_mult_pipe(32);
}
wires {
group use_m_out { // uses m.out }
}
control {
invoke m(left = 32'd10, right = 32'd4)();
use_m_out;
}
The value of m.out
in use_m_out
will be 32'd40
.
This annotation is currently used by the primitives library and the Dahlia frontend.
share
Can be attached to a component and indicates that a component can be shared
across groups. This is used by the -p resource-sharing
to decide which components
can be shared.
bound(n)
Used in infer-static-timing
and static-timing
when the number of iterations
of a While
control is known statically, as indicated by n
.
generated
Added by ir::Builder
to denote that the cell was added by a pass.
clk
Marks the special clock signal inserted by the clk-insertion
pass, which helps with lowering to RTL languages that require an explicit clock.
write_together(n)
Used by the papercut
pass.
Defines a group n
of signals that all must be driven together:
primitive std_mem_d2<"static"=1>[WIDTH, D0_SIZE, D1_SIZE, D0_IDX_SIZE, D1_IDX_SIZE](
@write_together(2) addr0: D0_IDX_SIZE,
@write_together(2) addr1: D1_IDX_SIZE,
@write_together(1) write_data: WIDTH,
@write_together(1) @go write_en: 1,
...
) -> (...);
This defines two groups.
The first group requires that write_en
and write_data
signals together
while the second requires that addr0
and addr1
are driven together.
Note that @write_together
specifications cannot encode implication of the
form "if port x
is driven then y
should be driven".
read_together(n)
Used by papercut
and canonicalize
.
Defines a combinational path n
between a set of an input ports and an output
port.
primitive std_mem_d1<"static"=1>[WIDTH, SIZE, IDX_SIZE](
@read_together(1) addr0: IDX_SIZE, ...
) -> (
@read_together(1) read_data: WIDTH, ...
);
This requires that when read_data
is used then addr0
must be driven.
Note that each group must have exactly one output port in it.
Undefined Behaviors
Undefined behavior in Calyx is either intentional or unintentional. This page tracks the various issues discussing the undefined behaviors that are known to currently exist in Calyx.
Interface Signals
The go
and done
signals form the two core interface signals in Calyx. Their semantics are baked into the compiler and pervasively used to define the meaning of programs.
Undriven Ports
Calyx's continuous assignments do not make any static guarantees about which ports need to be driven when. Current efforts attempt to codify when reading from an undriven port is incorrect.
Semantics of par
par
blocks in Calyx represent parallel execution of groups. Currently, there is no clear semantics for interactions between groups executing in parallel. The interpreter implements a form of lockstep semantics that disallows certain forms of concurrent reads and writes while the code generated by the compiler allows for arbitrary communication.
Isolation Guarantees
Calyx groups have a strong isolation guarantee---they must execute for at least one cycle and guarantee that signals inside are not visible after they are done executing.
However, the isolation guarantees for combinational groups, continuous assignments, and with
blocks is not clear.
fud
: The Calyx Driver
Working with Calyx involves a lot of command-line tools. For example, an incomplete yet daunting list of CLI tools used by Calyx is:
- All the Calyx frontends.
- Calyx compiler and its various command line tools
- Verilator, the Verilog simulation framework used to test Calyx-generated designs.
- Waveform viewers to see the results of simulation
fud
aims to provide a simple interface for using these toolchains and
executing them in a pipeline. The source for fud is
here.
Installation
You need Flit to install fud
. Install it with pip3 install flit
.
You can then install fud
with
flit install
(If using this method to install fud
, pip3
should be version >= 20)
If you are working on fud
itself, you can install it with a symlink with:
flit install --symlink
You can also install fud
with
flit build
pip3 install dist/fud-0.1.0-py3-none-any.whl
Finally, point fud
to the root of the repository:
fud config global.futil_directory <full path to Calyx repository>
Configuration
Fud uses a global configuration file to locate tool paths and default values.
To view the configuration, use fud c
or fud config
.
Check. Fud can automatically check if your configuration is valid and can help you set certain variables. Perform this check with:
fud check
Viewing keys.
To view the current value of a key, use fud config key
. For example, the
following shows the path to the Calyx compiler.
fud config stages.futil.exec
Updating keys.
Keys can be updated using fud config key value
.
For example, the following command updates the path to the Calyx compiler.
fud config stages.futil.exec ./target/debug/futil
Adding Backends
fud
wraps both frontends and backends for Calyx.
For a useful fud
installation, you need to configure the Verilator
backend and accompanying tools.
Verilator
We use the open source Verilator tool to simulate Verilog programs
generated by the Calyx compiler.
Install Verilator by following the official instructions.
On macOS, you can use brew install verilator
or, otherwise, compile from source:
git clone https://github.com/verilator/verilator
cd verilator
git pull
git checkout master
autoconf
./configure
make
sudo make install
Run fud check
to make sure you have the right version.
By default, fud
will use the verilator
executable to run Verilator.
To use a different binary, configure the path by:
fud config stages.verilog.exec <binary>
Vcdump.
Vcdump is a tool for converting vcd
(Value Change Dump) files to JSON for
easier analysis with the command line.
Install it with:
cargo install vcdump
Icarus Verilog
fud
also supports Icarus Verilog as a simulation backend.
Since Icarus Verilog interprets programs, it can often be faster than
verilator
.
First install Icarus Verilog and ensure that iverilog
and
vvp
are on your path.
Next, register the icarus-verilog
external stage with fud
:
fud register icarus-verilog -p fud/icarus/icarus.py
To test the simulation backend, run:
fud e --to dat examples/tutorial/language-tutorial-iterate.futil -s verilog.data examples/tutorial/data.json --through icarus-verilog
Registering this stage will define multiple paths in the fud transformation
graph between futil
and the vcd
and dat
stages.
If you'd like fud
to default to verilog
stage, give it a high
priority:
fud c stages.verilog.priority 1
Dahlia Frontend
In order to use the Dahlia frontend with Fud, first install
Dahlia.
Once Dahlia is compiled, point fud
to the Dahlia compiler binary:
fud config stages.dahlia.exec <full path to dahlia repo>/fuse
Python frontends (Systolic array, NTT, MrXL, TVM Relay)
You need flit to install our Python frontends.
pip3 install flit
Our Python frontends use a Calyx ast library written in Python. Install with:
cd calyx-py && flit install -s
Frontend specific instructions:
- Systolic array: Nothing else needed.
- NTT:
- Install dependencies:
pip3 install prettytable
- Install external
fud
stage:fud register ntt -p frontends/ntt-pipeline/fud/ntt.py
- Install dependencies:
- MrXL:
- Install
mrxl
binary:cd frontends/mrxl && flit install -s
- Install
mrxl
external stage forfud
:fud register mrxl -p frontends/mrxl/fud/mrxl.py
- Install
- TVM Relay: See instructions.
Adding Synthesis Backends
fud
supports wraps the Vivado (synth-verilog
) and Vivado HLS (vivado-hls
)
tools to generate area and resource estimates for Calyx designs.
See the instructions to configure them.
Working with Stages
Fud is structured as a sequence of stages that transform inputs of one form to outputs.
Stages.
fud
transforms a file in one stage into a file in a later stage.
The --from
and --to
options specify the input and output stages to the
fud exec
subcommand.
Use fud info
to view all possible stages.
Guessing stages.
fud
will try to guess the starting stage by looking at the extension of the
input file and output file (if specified using the -o
flag).
If it fails to guess correctly or doesn't know about the extension, you can
manually set the stages using --to
and --from
.
Profiling
Fud provides some very basic profiling tools through the use of the --dump_prof
(or -pr
) flag.
You can get overall stage durations for a fud
run by simply using -pr
.
For example,
fud e examples/dahlia/dot-product.fuse --to dat \
-s verilog.data examples/dahlia/dot-product.fuse.data \
-pr
will output:
stage elapsed time (s)
dahlia 1.231
futil 0.029
verilog 4.915
If you want time elapsed for each step in a stage, you can also provide one or more stages after the flag. For example,
fud e examples/dahlia/dot-product.fuse --to dat \
-s verilog.data examples/dahlia/dot-product.fuse.data \
-pr verilog
will output:
verilog elapsed time (s)
mktmp 0.0
json_to_dat 0.004
compile_with_verilator 5.584
simulate 0.161
output_json 0.003
cleanup 0.003
If you want time elapsed for a single step, you can specify the given step, e.g.
fud e examples/dahlia/dot-product.fuse --to dat \
-s verilog.data examples/dahlia/dot-product.fuse.data \
-pr verilog.simulate
will output:
verilog elapsed time (s)
simulate 0.161
Lastly, the -csv
flag will provide the profiling information in CSV format.
Examples
These commands will assume you're in the root directory of the Calyx repository.
Compiling Calyx. Fud wraps the Calyx compiler and provides a set of default compiler options to compile Calyx programs to Verilog.
# Compile Calyx source in the test simple.expect
# to Verilog. We must explicitly specify the input
# file type because it can not be guessed from
# the file extension.
fud exec examples/futil/simple.expect --from futil --to verilog
Fud can explain its execution plan when running a complex sequence of
steps using the --dry-run
option.
# Dry run of compiling the Dahlia dot product file
# to Calyx. As expected, this will *only* print
# the stages that will be run.
fud exec examples/dahlia/dot-product.fuse --to futil --dry-run
Simulating Calyx. Fud can compile a Calyx program to Verilog and simulate it using Verilator.
# Compile and simulate a vectorized add implementation
# in Calyx using the data provided,
# then dump the vcd into a new file for debugging.
# === Calyx: examples/futil/vectorized-add.futil
# === data: examples/dahlia/vectorized-add.fuse.data
# === output: v-add.vcd
fud exec \
examples/futil/vectorized-add.futil \
-o v-add.vcd \
-s verilog.data examples/dahlia/vectorized-add.fuse.data
Simulating Dahlia.
The following command prints out the final state of all memories by specifying
--to dat
.
# Compile a Dahlia dot product implementation and
# simulate in verilog using the data provided.
# === Dahlia: examples/dahlia/dot-product.fuse
# === data: examples/dahlia/dot-product.fuse.data
# (`.data` is used as an extension alias for `.json`)
fud exec \
examples/dahlia/dot-product.fuse \
--to dat \
-s verilog.data examples/dahlia/dot-product.fuse.data
Interpreting Calyx. In addition to Verilator, fud can execute Calyx programs using the experimental interpreter.
# Execute a Calyx program without compiling it,
# producing a JSON snapshot of the final state.
# === Calyx: tests/correctness/while.futil
# === data: tests/correctness/while.futil.data
fud exec \
tests/correctness/while.futil \
-s verilog.data tests/correctness/while.futil.data \
--to interpreter-out
Xilinx Toolchain
Working with vendor EDA toolchains is never a fun experience. Something will almost certainly go wrong. If you're at Cornell, you can at least avoid installing the tools yourself by using our lab servers, Gorgonzola or Havarti.
fud
can interact with the Xilinx tools (Vivado, Vivado HLS, and Vitis). There are three main things it can do:
- Synthesize Calyx-generated RTL designs to collect area and resource estimates.
- Compile Dahlia programs via C++ and Vivado HLS for comparison with the Calyx backend.
- Compile Calyx programs for actual execution in Xilinx emulation modes or on real FPGA hardware.
You can set fud
up to use either a local installation of the Xilinx tools or one on a remote server, via SSH.
Synthesis Only
The simplest way to use the Xilinx tools is to synthesize RTL or HLS designs to collect statistics about them. This route will not produce actual, runnable executables; see the next section for that.
Installing Dependencies
fud
uses extra dependencies to invoke the Xilinx toolchains.
Run the following command to install all required dependencies:
cd fud && flit install -s --deps all
Set Up
To set up to invoke the Xilinx tools over SSH, first tell fud
your username and hostname for the server:
# Vivado
fud config stages.synth-verilog.ssh_host <hostname>
fud config stages.synth-verilog.ssh_username <username>
fud config stages.synth-verilog.remote true
# Vivado HLS
fud config stages.vivado-hls.ssh_host <hostname>
fud config stages.vivado-hls.ssh_username <username>
fud config stages.vivado-hls.remote true
The server must have vivado
and vivado_hls
available on the remote machine's path. (If you need the executable names to be something else, please file an issue.)
To instead invoke the Xilinx tools locally, just let fud
run the vivado
and vivado_hls
commands.
You can optionally tell fud
where these commands exist on your machine:
fud config stages.synth-verilog.exec <path> # update vivado path
fud config stages.vivado-hls.exec <path> # update vivado_hls path
Leave the remote
option unset to use local execution and these exec
paths (which are ignored for remote execution).
Run
To run the entire toolchain and extract statistics from RTL synthesis, use the resource-estimate
target state.
For example:
fud e --to resource-estimate examples/futil/dot-product.futil
To instead obtain the raw synthesis results, use synth-files
.
To run the analogous toolchain for Dahlia programs via HLS, use the hls-estimate
target state:
fud e --to hls-estimate examples/dahlia/dot-product.fuse
There is also an hls-files
state for the raw results of Vivado HLS.
Emulation and Execution
fud
can also compile Calyx programs for actual execution, either in the Xilinx toolchain's emulation modes or for running on a physical FPGA.
This route involves generating an AXI interface wrapper for the Calyx program and invoking it using XRT's OpenCL interface.
Set Up
As above, you can invoke the Xilinx toolchain locally or remotely, via SSH.
To set up SSH execution, you can edit your config.toml
to add settings like this:
[stages.xclbin]
ssh_host = "havarti"
ssh_username = "als485"
remote = true
To use local execution, just leave off the remote = true
line.
You can also set the Xilinx mode and target device:
[stages.xclbin]
mode = "hw_emu"
device = "xilinx_u50_gen3x16_xdma_201920_3"
The options for mode
are hw_emu
(simulation) and hw
(on-FPGA execution).
The device string above is for the Alveo U50 card, which we have at Cornell. The installed Xilinx card would typically be found under the directory /opt/xilinx/platforms
, where one would be able to find a device name of interest.
To use hardware emulation, you will also need to configure the wdb
stage.
It has similar ssh_host
, ssh_username
, and remote
options to the xclbin
stage.
You will also need to configure the stage to point to your installations of Vitis and XRT, like this:
[stages.wdb]
xilinx_location: /scratch/opt/Xilinx/Vitis/2020.2
xrt_location: /opt/xilinx/xrt
Compile
The first step in the Xilinx toolchain is to generate an xclbin
executable file.
Here's an example of going all the way from a Calyx program to that:
fud e examples/futil/dot-product.futil -o foo.xclbin --to xclbin
On our machines, compiling even a simple example like the above for simulation takes about 5 minutes, end to end. A failed run takes about 2 minutes to produce an error.
By default, the Xilinx tools run in a temporary directory that is deleted when fud
finishes.
To instead keep the sandbox directory, use -s xclbin.save_temps true
.
You can then find the results in a directory named fud-out-N
for some number N
.
Emulate
You can also execute compiled designs through Xilinx hardware emulation.
Use the wdb
state as your fud
target:
fud e -vv foo.xclbin -s wdb.save_temps true -o out.wdb
This stage produces a Vivado waveform database (WDB) file
Through the magic of fud
, you can also go all the way from a Calyx program to a wdb
file in the same way.
There is also a wdb.save_temps
option, as with the xclbin
stage.
You also need to provide a host C++ program via the wdb.host
parameter, but I don't know much about that yet, so documentation about that will have to wait.
Similarly, I don't yet know what you're supposed to do with a WDB file; maybe we should figure out how to produce a VCD instead.
How it Works
The first step is to generate input files. We need to generate:
- The RTL for the design itself, using the compile command-line flags
-b verilog --synthesis -p external
. We name this filemain.sv
. - A Verilog interface wrapper, using
XilinxInterfaceBackend
, via-b xilinx
. We call thistoplevel.v
. - An XML document describing the interface, using
XilinxXmlBackend
, via-b xilinx-xml
. This file gets namedkernel.xml
.
The fud
driver gathers these files together in a sandbox directory.
The next step is to run the Xilinx tools.
The rest of this section describes how this workflow works under the hood. If you want to follow along by typing commands manually, you can start by invoking the setup scripts for Vitis and XRT:
source <Vitis_install_path>/Vitis/2020.1/settings64.sh
source /opt/xilinx/xrt/setup.sh
On some Ubuntu setups, you may need to update LIBRARY_PATH
:
export LIBRARY_PATH=/usr/lib/x86_64-linux-gnu
You can check that everything is working by typing vitis -version
or vivado -version
.
Background: .xo
and .xclbin
In the Xilinx toolchain, compilation to an executable bitstream (or simulation blob) appears to requires two steps:
taking your Verilog sources and creating an .xo
file, and then taking that and producing an .xclbin
“executable” file.
The idea appears to be a kind of metaphor for a standard C compilation workflow in software-land: .xo
is like a .o
object file, and .xclbin
contains actual executable code (bitstream or emulation equivalent), like a software executable binary.
Going from Verilog to .xo
is like “compilation” and going from .xo
to .xclbin
is like “linking.”
However, this analogy is kind of a lie.
Generating an .xo
file actually does very little work:
it just packages up the Verilog source code and some auxiliary files.
An .xo
is literally a zip file with that stuff packed up inside.
All the actual work happens during “linking,” i.e., going from .xo
to .xclbin
using the v++
tool.
This situation is a poignant reminder of how impossible separate compilation is in the EDA world.
A proper analogy would involve separately compiling the Verilog into some kind of low-level representation, and then linking would properly smash together those separately-compiled objects.
Instead, in Xilinx-land, “compilation” is just simple bundling and “linking” does all the compilation in one monolithic step.
It’s kind of cute that the Xilinx toolchain is pretending the world is otherwise, but it’s also kind of sad.
Anyway, the only way to produce a .xo
file from RTL code appears to be to use Vivado (i.e., the actual vivado
program).
Nothing from the newer Vitis package currently appears capable of producing .xo
files from Verilog (although v++
can produce these files during HLS compilation, presumably by invoking Vivado).
The main components in an .xo
file, aside from the Verilog source code itself, are two XML files:
kernel.xml
, a short file describing the argument interfaces to the hardware design,
and component.xml
, a much longer and more complicated IP-XACT file that also has to do with the interface to the RTL.
We currently generate kernel.xml
ourselves (with the xilinx-xml
backend described above) and then use Vivado, via a Tcl script, to generate the IP-XACT file.
In the future, we could consider trying to route around using Vivado by generating the IP-XACT file ourselves, using a tool such as DUH.
Our Workflow
The first step is to produce an .xo
file.
We also use a static Tcl script, gen_xo.tcl
, which is a simplified version of a script from Xilinx's Vitis tutorials.
The gist of this script is that it creates a Vivado project, adds the source files, twiddles some settings, and then uses the package_xo
command to read stuff from this project as an "IP directory" and produce an .xo
file.
The Vivado command line looks roughly like this:
vivado -mode batch -source gen_xo.tcl -tclargs xclbin/kernel.xo
That output filename after -tclargs
, unsurprisingly, gets passed to gen_xo.tcl
.
Then, we take this .xo
and turn it into an .xclbin
.
This step uses the v++
tool, with a command line that looks like this:
v++ -g -t hw_emu --platform xilinx_u50_gen3x16_xdma_201920_3 --save-temps --profile.data all:all:all --profile.exec all:all:all -lo xclbin/kernel.xclbin xclbin/kernel.xo
Fortunately, the v++
tool doesn't need any Tcl to drive it; all the action happens on the command line.
External Stages
fud
supports using stages that aren't defined in its main source tree.
These are known as 'external stages' and the provide a mechanism
for projects using Calyx to take advantage of fud
. You can register an
external stage with:
fud register stage_name -p /path/to/stage.py
Once an external stage is registered, it behaves exactly like any other stage.
You can remove an external stage with:
fud register stage_name --delete
The following defines a stage that transforms MrXL programs to Calyx programs.
from fud.stages import Stage, SourceType
from fud.utils import shell
class MrXLStage(Stage):
"""
Stage that invokes the MrXL frontend.
"""
name = "mrxl"
def __init__(self):
"""
Initialize this stage. Initializing a stage *does not* construct its
computation graph.
"""
super().__init__(
src_state="mrxl",
target_state="futil",
input_type=SourceType.Path,
output_type=SourceType.Stream,
description="Compiles MrXL to Calyx.",
)
@staticmethod
def defaults():
"""
Specify defaults that should be added to fud's configuration file when
this stage is registered.
"""
return {"exec": "mrxl"}
def _define_steps(self, input, builder, config):
"""
Define the steps that will execute in this stage. Each step represents
a delayed computation that will occur when the stage is executed.
"""
# Commands at the top-level are evaluated when the computation is being
# staged
cmd = config["stages", self.name, "exec"]
# Computations within a step are delayed from being executed until
# the full execution pipeline is generated.
@builder.step(description=cmd)
def run_mrxl(mrxl_prog: SourceType.Path) -> SourceType.Stream:
return shell(f"{cmd} {str(mrxl_prog)}")
# Define a schedule using the steps.
# A schedule *looks* like an imperative program but actually represents
# a computation graph that is executed later on.
return run_mrxl(input)
# Export the defined stages to fud
__STAGES__ = [MrXLStage]
External stages must define default values for configuration keys using the
Stage.defaults()
static method and the name of the stage using the static
name
field.
Stage Configuration
Like normal stages, external stages can have persistent configuration
information saved using fud config
.
To add persistent stage configuration, run:
fud config stages.<stage-name>.<key> <value>
To dynamically override the value of a field during execution, use the -s flag
:
fud e -s <stage-name>.<key> <value> ...
The override order for stage configuration is:
- Dynamic values provided by
-s
. - Configuration value in the fud config.
- Default value provided by
Stage.defaults()
Multiple Paths
fud
can define a stage graph that can have multiple paths between the source
and target.
For example, if you register the Icarus Verilog simulator stage, then multiple
paths can be used to generate VCD files from Dahlia programs:
% fud e --from dahlia --to vcd
[fud] ERROR: Multiple stage pipelines can transform dahlia to vcd:
dahlia → futil → verilog → vcd
dahlia → futil → icarus-verilog → vcd
Use the --through flag to select an intermediate stage
fud
says that both the verilog
and icarus-verilog
stages can be used to
generate the VCD file and you need to provide the --through
flag to decide
which stage to select.
The following command will simulate the program using the icarus-verilog
stage:
% fud e --from dahlia --to vcd --through icarus-verilog
In general, the --through
flag can be repeated as many times as needed to
get a unique fud
transformation pipeline.
Using Stage Priority
If the common workflow uses the same stage every time, it can be annoying to
specify the stage name using the --through
flag.
You can specify a priority field in the configuration of a stage to ensure
it fud
automatically selects it when multiple paths exists.
For example, to always select the verilog
stage, add the priority 1
to the
stage:
fud c stages.verilog.priority 1
Now, the command fud e --from dahlia --to vcd
is no longer ambiguous; fud
will always choose the verilog
stage to transform programs from Dahlia
sources to VCD.
In case multiple paths have the same cost, fud
will again require the
--through
flag to disambiguate paths.
CIRCT
An ongoing effort is under way to establish Calyx as a dialect in the LLVM umbrella project CIRCT. There is documentation about the Calyx dialect on the MLIR site. While semantically equivalent, they are syntactically different. Because the Calyx dialect is still under progress and does not include all the optimizations that the native Rust compiler supports, we have crafted an emitter from the Calyx dialect (MLIR) to the native compiler representation (used by the Rust compiler). This means you can lower from your favorite frontend in MLIR to the Calyx dialect, and continue all the way to SystemVerilog (with spunky optimizations) using the native compiler.
The native compiler also supports round-tripping back into the MLIR representation. We'll assume you've
already built the Rust compiler and installed fud
. Here are the steps below to round-trip:
MLIR to Native Representation
-
Set up the CIRCT project with these instructions.
-
There should be a
circt-translate
binary in<root-directory>/build/bin
. To emit the native compiler representation, use the command:
path/to/circt-translate --export-calyx /path/to/file
For example, you can use the expected output of the test tests/backend/mlir/simple.expect
:
calyx.program "main" {
calyx.component @main(%go: i1 {go=1}, %clk: i1 {clk=1}, %reset: i1 {reset=1}) -> (%out: i1, %done: i1 {done=1}) {
%r1.in, %r1.write_en, %r1.clk, %r1.reset, %r1.out, %r1.done = calyx.register @r1 : i1, i1, i1, i1, i1, i1
%_1_1.out = hw.constant 1 : i1
calyx.wires {
calyx.group @Group1 {
calyx.assign %r1.in = %_1_1.out : i1
calyx.assign %r1.write_en = %_1_1.out : i1
calyx.group_done %r1.done : i1
}
}
calyx.control {
calyx.seq {
calyx.enable @Group1
}
}
}
}
Using the command:
# Don't worry too much about the file alias; this is used for testing purposes.
path/to/circt-translate --export-calyx tests/backend/mlir/simple.expect
This should output:
// -p well-formed -b mlir
import "primitives/core.futil";
component main(@go go: 1, @clk clk: 1, @reset reset: 1) -> (out: 1, @done done: 1) {
cells {
r1 = std_reg(1);
}
wires {
group Group1 {
r1.in = 1'd1;
r1.write_en = 1'd1;
Group1[done] = r1.done;
}
}
control {
seq { Group1; }
}
}
Native Representation to MLIR
To round-trip back to the Calyx dialect, we can use fud
:
fud exec path/to/file --to mlir
For example,
fud exec tests/backend/mlir/simple.futil --to mlir
This should emit the Calyx dialect once again.
The Calyx Interpreter
The experimental Calyx interpreter resides in the interp/
directory of the
repository.
The interpreter supports all Calyx programs---from high-level programs that
make heavy use of control operators, to fully lowered Calyx programs.
(RTL simulation, in contrast, only supports execution of fully lowered programs.)
There are two ways to use the interpreter: you can directly invoke it, or you can use fud.
Basic Use
To run an example program, try:
cd interp && cargo run tests/control/if.futil
You can see the available command-line options by typing cargo run -- --help
.
Interpreting via fud
The interpreter is available as a stage in fud, which lets you provide standard JSON data files as input and easily execute passes on the input Calyx program before interpretation.
You'll want to build the interpreter first:
cd interp && cargo build
Here's how to run a Calyx program:
fud e --to interpreter-out interp/tests/control/if.futil
To provide input data, set the verilog.data
variable, like so:
fud e --to interpreter-out \
-s verilog.data tests/correctness/while.futil.data \
tests/correctness/while.futil
By default, fud will not transform the Calyx code before feeding it to the interpreter.
To run passes before the interpreter, use the futil.flags
variable in conjunction with the -p
flag.
For example, to fully lower the Calyx program before interpreting it:
fud e --to interpreter-out \
-s futil.flags '-p all' \
interp/tests/control/if.futil
The Calyx Interactive Debugger
The Calyx Interactive Debugger is a prototype debugging tool built on top of the [Calyx Interpreter][interp] which exposes a gdb-like interface for debugging Calyx programs.
Getting Started
If you are using fud
getting started with the debugger is easy.
Assuming you are trying to debug a program called my_program.futil
with data
file my_program.futil.data
, invoke the debugger with the following command:
fud e --to debugger -q my_program.futil -s verilog.data my_program.futil.data
This will open the target program in the interactive debugger. Note that fud
uses the quiet flag, -q
, here. This prevents the printing from the fud
tool
from conflicting the debugger as both tools interact with standard out.
Advancing Program execution
step
The simplest way to advance the program is via the step
command which causes
time to advance by a clock tick. It also has a shortcode: s
.
> step
> s
The above snippet advances the program by two steps.
step-over
Another way to advance the program is via the step-over
command. Unlike the
step
command, this command requires a second argument which is the name of the
group to advance over. The step-over
command then advances the program until
the given group is no longer running.
If you want to use the command to advance the program past a group group_1
, do
the following:
> step-over group_1
Note that the step-over
command will do nothing if the given group is not
running.
> step-over other_group
Group is not running
>
continue
Finally, the continue command will run the program until either a breakpoint is
hit or the program terminates. This command is used in conjunction with
breakpoints and watchpoints to provide more targeted inspection. It may also be
accessed with the shortcode c
.
> continue
Main component has finished executing. Debugger is now in inspection mode.
Breakpoints
CIDR supports breakpoints on group definitions. This helps focus attention on suspect portions of the code.
Setting a breakpoint
Breakpoints may be set on the main component by simple specifying the group of interest.
> break group_1
This is identical to
> break main::group_1
For sub-components, the name of the sub-component must be included with the
double colon separating the group name. To break on the do_mul
group inside
the pow
sub-component:
> break pow::do_mul
Managing breakpoints
To see a list of breakpoints:
> info break
or
> ib
This produces output like this:
> ib
Current breakpoints:
1. main::group_1 enabled
2. pow::do_mul enabled
All breakpoints have a number associated with them and they may be managed with this number or the group name.
To enable or disable a breakpoint:
> disable group_1 2
> enable 1 pow::do_mul
Note that this is equivalent to:
> disable group_1
> disable 2
> enable 1
> enable pow::do_mul
To delete a breakpoint:
> delete 1
> del pow::do_mul
Deleted breakpoints will be entirely removed while disabled breakpoints will
remain until they are either enabled again or subsequently deleted. Disabled
breakpoints will not cause program execution to halt when continue
-ing.
Inspecting State
display
The display command dumps the full state of the main component without
formatting. Use the print
and print-state
commands for targeted inspection
with formatting.
Formatting codes
CIDR supports several different formatting codes which do the hard work of interpreting the data in human readable ways.
name | code | description |
---|---|---|
binary | The default, a bit vector with the msb on the left | |
unsigned | \u | Unsigned bit-num formatting |
signed | \s | Two's Complement formatting |
unsigned fixedpoint | \u.N | For N >=1. Unsigned Fixed-point with N fractional bits. The remaining bits are for the integral component. |
signed fixedpoint | \s.N | For N >=1. Signed Fixed-point with N fractional bits. The remaining bits are for the integral component. |
print
and print-state
These commands allow inspecting instance state with optional formatting. Note
that this is different from breakpoints which operate on definitions. For example to print the ports of the std_mul
instance named mul
in the pow
instance pow_1
attached to the main component:
> print main.pow_1.mul
as with breakpoints, the leading main
may be elided:
> print pow_1.mul
This will print all the ports attached to this multiplier instance with binary formatting.
Formatting codes may be supplied as the first argument.
> print \u pow_1.mul
The print
may also target specific ports on cells, rather than just the cell
itself. To see only the output of the multiplier (with unsigned formatting):
> print \u pow_1.mul.out
The print-state
command works in the same way as the print
command, except
it displays the internal state of a cell, rather than port values. As such, it
can only target cells and only those with some internal state, such as registers
or memories. For example, if the main component has a memory named out_mem
its
contents may be viewed via:
> print-state main.out_mem
or just
> print-state out_mem
As with print
, print-state
supports formatting codes as an optional first
argument. So to view the contents of out_mem
with a signed interpretation:
> print-state \s out_mem
Watchpoints
Watchpoints are like breakpoints but rather than stop the execution when they
are passed, they instead print out some information. Like breakpoints, they are
set on group definitions, such as main::group_1
or pow::do_mul
Setting watchpoints
The general form of watchpoints looks like
watch [POSITION] GROUP with PRINT-COMMAND
where:
GROUP
is the group definition to be watchedPRINT-COMMAND
is a fullprint
orprint-state
command to be run by the watchpoint
The optional POSITION
argument may either be before
or after
. This
specifies whether the watchpoint should run when the group first becomes active
(before
) or when the group finishes running (after
). This defaults to
before
if not set.
Managing watchpoints
Watchpoint management is similar to breakpoints. However there may be multiple watchpoints for a single group definition, so deleting watchpoints via the group name will delete all the watchpoints associated with the group. Watchpoints do not currently have an enable/disable state.
To view all the watchpoint definitions:
> info watch
...
> iw
To delete watchpoints:
> delete-watch 1
> del-watch main::group_1
Viewing the program counter
There where
command (alias pc
) displays the currently running portion of the
control tree including active subcomponents. This can be used to more easily
determine the currently active portion of the design as well as visualize how
much of the execution is occurring in parallel at any given point.
Exiting the debugger
Use help
to see all commands. Use exit
to exit the debugger.
The Calyx Compiler
The source code documentation for the compiler can be found here.
The Calyx compiler has several command line to control the execution of various passes and backends.
Specifying Primitives Library
The compiler implementation uses a standard library of components to compile programs.
The only standard library for the compiler is located in:
<path to Calyx repository>/primitives
Specify the location of the library using the -l
flag:
cargo run -- -l ./primitives
Primitive Libraries Format
The primitive libraries consist of a .futil
file paired with a .sv
file. The
.futil
file defines a series of Calyx shim bindings in extern
blocks which
match up with SystemVerilog definitions of those primitives. These libraries may
also expose components written in Calyx, usually defined using primitives
exposed by the file.
No Calyx program can work without the primitives defined in the Core Library.
Controlling Passes
The compiler is organized as a sequence of passes that are run when the compiler executes.
To get a complete list of all passes, run the following from the repository root:
cargo run -- --list-passes
This generates results of the form:
Passes:
- collapse-control
- compile-control
...
Aliases:
- all: well-formed, papercut, remove-external-memories, ...
...
The first section list all the passes implemented in the compiler.
The second section lists aliases for combination of passes that are commonly
run together.
For example, the alias all
is an ordered sequence of default passes executed
when the compiler is run from the command-line.
The command-line provides two options to control the execution of passes:
-p, --pass
: Execute this pass or alias. Overrides default alias.-d, --disable-pass
: Disable this pass or alias. Takes priority over-p
.
For example, we can run the following to disable the static-timing
pass from
the default execution alias all
:
cargo run -- examples/futil/simple.futil -p all -d static-timing
Core Library
This library defines a standard set of components used in most Calyx programs such as registers and basic bitwise operations.
Contents
Numerical Operators
std_reg<WIDTH>
A WIDTH
-wide register.
Inputs:
in: WIDTH
- An input value to the registerWIDTH
-bits.write_en: 1
- The one bit write enabled signal. Indicates that the register should store the value on the in wire.
Outputs:
out: WIDTH
- The value contained in the register.done: 1
- The register's done signal. Set high for one cycle after writing a new value.
std_const<WIDTH,VAL>
A constant WIDTH-bit value with value VAL.
Inputs: None
Outputs:
out: WIDTH
- The value of the constant (i.e.VAL
)
std_lsh<WIDTH>
A left bit shift. Performs LEFT << RIGHT
. This component is combinational.
Inputs:
left: WIDTH
- A WIDTH-bit value to be shiftedright: WIDTH
- A WIDTH-bit value representing the shift amount
Outputs:
out: WIDTH
- A WIDTH-bit value equivalent toLEFT << RIGHT
std_rsh<WIDTH>
A right bit shift. Performs LEFT >> RIGHT
. This component is combinational.
Inputs:
left: WIDTH
- A WIDTH-bit value to be shiftedright: WIDTH
- A WIDTH-bit value representing the shift amount
Outputs:
out: WIDTH
- A WIDTH-bit value equivalent toLEFT >> RIGHT
std_add<WIDTH>
Bitwise addition without a carry flag. Performs LEFT + RIGHT
. This component
is combinational.
Inputs:
left: WIDTH
- A WIDTH-bit valueright: WIDTH
- A WIDTH-bit value
Outputs:
out: WIDTH
- A WIDTH-bit value equivalent toLEFT + RIGHT
std_sub<WIDTH>
Bitwise subtraction. Performs LEFT - RIGHT
. This component is combinational.
Inputs:
left: WIDTH
- A WIDTH-bit valueright: WIDTH
- A WIDTH-bit value
Outputs:
out: WIDTH
- A WIDTH-bit value equivalent toLEFT - RIGHT
std_slice<IN_WIDTH, OUT_WIDTH>
Slice out the lower OUT_WIDTH bits of an IN_WIDTH-bit value. Computes
in[out_width - 1 : 0]
. This component is combinational.
Inputs:
in: IN_WIDTH
- An IN_WIDTH-bit value
Outputs:
out: OUT_WIDTH
- The lower OUT_WIDTH bits ofin
std_pad<IN_WIDTH, OUT_WIDTH>
Given an IN_WIDTH-bit input, zero pad from the left to an output of OUT_WIDTH-bits. This component is combinational.
Inputs:
in: IN_WIDTH
- An IN_WIDTH-bit value to be padded
Outputs:
out: OUT_WIDTH
- The padded value
Logical Operators
std_not<WIDTH>
Bitwise NOT. This component is combinational.
Inputs:
in: WIDTH
- A WIDTH-bit input.
Outputs:
out: WIDTH
- The bitwise NOT of the input (~in
)
std_and<WIDTH>
Bitwise AND. This component is combinational.
Inputs:
left: WIDTH
- A WIDTH-bit argumentright: WIDTH
- A WIDTH-bit argument
Outputs:
out: WIDTH
- The bitwise AND of the arguments (left & right
)
std_or<WIDTH>
Bitwise OR. This component is combinational.
Inputs:
left: WIDTH
- A WIDTH-bit argumentright: WIDTH
- A WIDTH-bit argument
Outputs:
out: WIDTH
- The bitwise OR of the arguments (left | right
)
std_xor<WIDTH>
Bitwise XOR. This component is combinational.
Inputs:
left: WIDTH
- A WIDTH-bit argumentright: WIDTH
- A WIDTH-bit argument
Outputs:
out: WIDTH
- The bitwise XOR of the arguments (left ^ right
)
Comparison Operators
std_gt<WIDTH>
Greater than. This component is combinational.
Inputs:
left: WIDTH
- A WIDTH-bit argumentright: WIDTH
- A WIDTH-bit argument
Outputs:
out: 1
- A single bit output. 1 ifleft > right
else 0.
std_lt<WIDTH>
Less than. This component is combinational.
Inputs:
left: WIDTH
- A WIDTH-bit argumentright: WIDTH
- A WIDTH-bit argument
Outputs:
out: 1
- A single bit output. 1 ifleft < right
else 0.
std_eq<WIDTH>
Equality comparison. This component is combinational.
Inputs:
left: WIDTH
- A WIDTH-bit argumentright: WIDTH
- A WIDTH-bit argument
Outputs:
out: 1
- A single bit output. 1 ifleft = right
else 0.
std_neq<WIDTH>
Not equal. This component is combinational.
Inputs:
left: WIDTH
- A WIDTH-bit argumentright: WIDTH
- A WIDTH-bit argument
Outputs:
out: 1
- A single bit output. 1 ifleft != right
else 0.
std_ge<WIDTH>
Greater than or equal. This component is combinational.
Inputs:
left: WIDTH
- A WIDTH-bit argumentright: WIDTH
- A WIDTH-bit argument
Outputs:
out: 1
- A single bit output. 1 ifleft >= right
else 0.
std_le<WIDTH>
Less than or equal. This component is combinational.
Inputs:
left: WIDTH
- A WIDTH-bit argumentright: WIDTH
- A WIDTH-bit argument
Outputs:
out: 1
- A single bit output. 1 ifleft <= right
else 0.
Memories
std_mem_d1
A one-dimensional memory.
Parameters:
WIDTH
- Size of an individual memory slot.SIZE
- Number of slots in the memory.IDX_SIZE
- The width of the index given to the memory.
Inputs:
- `addr0: IDX_SIZE - The index to be accessed or updated
write_data: WIDTH
- Data to be written to the selected memory slotwrite_en: 1
- One bit write enabled signal, causes the memory to writewrite_data
to the slot indexed byaddr0
Outputs:
read_data: WIDTH
- The value stored ataddr0
. This value is combinational with respect toaddr0
.done: 1
: The done signal for the memory. This signal goes high for one cycle after finishing a write to the memory.
std_mem_d2
A two-dimensional memory.
Parameters:
WIDTH
- Size of an individual memory slot.D0_SIZE
- Number of memory slots for the first index.D1_SIZE
- Number of memory slots for the second index.D0_IDX_SIZE
- The width of the first index.D1_IDX_SIZE
- The width of the second index.
Inputs:
addr0: D0_IDX_SIZE
- The first index into the memoryaddr1: D1_IDX_SIZE
- The second index into the memorywrite_data: WIDTH
- Data to be written to the selected memory slotwrite_en: 1
- One bit write enabled signal, causes the memory to writewrite_data
to the slot indexed byaddr0
andaddr1
Outputs:
read_data: WIDTH
- The value stored atmem[addr0][addr1]
. This value is combinational with respect toaddr0
andaddr1
.done: 1
: The done signal for the memory. This signal goes high for one cycle after finishing a write to the memory.
std_mem_d3
A three-dimensional memory.
Parameters:
WIDTH
- Size of an individual memory slot.D0_SIZE
- Number of memory slots for the first index.D1_SIZE
- Number of memory slots for the second index.D2_SIZE
- Number of memory slots for the third index.D0_IDX_SIZE
- The width of the first index.D1_IDX_SIZE
- The width of the second index.D2_IDX_SIZE
- The width of the third index.
Inputs:
addr0: D0_IDX_SIZE
- The first index into the memoryaddr1: D1_IDX_SIZE
- The second index into the memoryaddr2: D2_IDX_SIZE
- The third index into the memorywrite_data: WIDTH
- Data to be written to the selected memory slotwrite_en: 1
- One bit write enabled signal, causes the memory to writewrite_data
to the slot indexed byaddr0
,addr1
, andaddr2
Outputs:
read_data: WIDTH
- The value stored atmem[addr0][addr1][addr2]
. This value is combinational with respect toaddr0
,addr1
, andaddr2
.done: 1
: The done signal for the memory. This signal goes high for one cycle after finishing a write to the memory.
std_mem_d4
A four-dimensional memory.
Parameters:
WIDTH
- Size of an individual memory slot.D0_SIZE
- Number of memory slots for the first index.D1_SIZE
- Number of memory slots for the second index.D2_SIZE
- Number of memory slots for the third index.D3_SIZE
- Number of memory slots for the fourth index.D0_IDX_SIZE
- The width of the first index.D1_IDX_SIZE
- The width of the second index.D2_IDX_SIZE
- The width of the third index.D3_IDX_SIZE
- The width of the fourth index.
Inputs:
addr0: D0_IDX_SIZE
- The first index into the memoryaddr1: D1_IDX_SIZE
- The second index into the memoryaddr2: D2_IDX_SIZE
- The third index into the memoryaddr3: D3_IDX_SIZE
- The fourth index into the memorywrite_data: WIDTH
- Data to be written to the selected memory slotwrite_en: 1
- One bit write enabled signal, causes the memory to writewrite_data
to the slot indexed byaddr0
,addr1
,addr2
, andaddr3
Outputs:
read_data: WIDTH
- The value stored atmem[addr0][addr1][addr2][addr3]
. This value is combinational with respect toaddr0
,addr1
,addr2
, andaddr3
.done: 1
: The done signal for the memory. This signal goes high for one cycle after finishing a write to the memory.
The calyx Library
The implementation of the compiler separates the frontend from the internal data structures.
This allows developers to use the compiler as a library.
The calyx
library exports all the data structures used by the compiler can be used to design new
tools that make use of Calyx.
See the library documentation for an example of how to use the calyx
library.
Dataflow Optimizations
In general, dataflow analysis uses the control and data flow of a program to compute various properties (liveness, reaching definitions, ...) at each point in a program.
For Calyx, dataflow analyses use the explicit control program and knowledge about the dataflow of each group to compute properties about each group.
Basic blocks vs. Groups
Normally, dataflow analyses compute a property at each basic block of a control flow graph (CFG). Calyx doesn't have a notion of basic blocks, and so Calyx computes a property at each group in a program.
Because Calyx separates the control flow of a program from the specification of groups, it's possible for a group to appear multiple times in the control program. For this reason we compute a property at each group enable rather than each group definition. The property at each group definition can easily be computed as the meet over all group enables.
Dataflow on an AST
Dataflow analyses are typically performed by finding the fixed point of a set of equations defined at each node of a control flow graph (CFG) using the worklist algorithm.
Because our control AST is little more than just the edges of a reducible cfg, we don't bother to build an explicit CFG and instead perform the dataflow analysis directly on the AST using Calyx's visitor infrastructure.
Abstract Algorithm
We model each control statement s
as a function, f: p -> p
where p
is
the type of the property. Control statements that have children define how information
flows between its children.
Enable
f
for enable A
is similar to the transfer function in standard dataflow analysis. It
uses information from the definition of group A
to modify the input in some way. For example,
if p
is the set of live variables, the enable f
is defined as:
f(enable A, inputs) = (inputs - kill(A)) | gen(A)
Seq
seq
defines sequential control flow edges between its children.
It is implemented by threading its input through all of its children to produce an output.
f(seq { A; B; C; ...; Z; }, inputs) =
f(A, inputs)
|> f(B, _)
|> f(C, _)
|> ...
|> f(Z, _)
To implement a backwards dataflow analysis, all you need to do is reverse the
order that seq
pipes inputs to its children:
// reverse
f(seq { A; B; C; ...; Z; }, inputs) =
f(Z, inputs)
|> ...
|> f(C, _)
|> f(B, _)
|> f(A, _)
If
if
passes its inputs to its condition group and then feeds the result of this
to both of its children. The output is the union of the outputs of both of its
children. This is standard.
f(if some.port with G { True; } else { False; }, inputs) =
f(True, f(G, inputs)) | f(False, f(G, inputs))
While
while
statements are interesting because the outputs of the body may affect the
input to the body. For this reason, we need to find a fixed point:
f(while some.port with G { body; }, inputs) =
P = inputs;
loop until P stops changing {
P = f(body, f(G, inputs))
}
Par
Par is the only statement that differs substantially from traditional dataflow because
control flow graphs don't support nodes running in parallel. In other words, there is only
ever one thing executing. However, par
changes this and allows multiple things to
execute at the same time. Consider the following example where we are computing
the liveness of x
to see why this is weird:
F; // wr x
...
par {
A; // wr x
B;
}
G; // rd x
Is x
alive between X
and the beginning of par
? The answer is no because we know
that both A
and B
will run. Therefore the write to x
in F
can not be seen by any
group.
At first glance this doesn't seem like a problem. Great, we say, we can just take
the union of the outputs of the children of par
and call it a day.

This is wrong because B
doesn't kill x
and so x
is alive coming into B
.
The union preserves this information and results in x
being alive above par
.
Taking the set intersection is not quite right here either. Consider adding another group
C
that reads from x
.

We have no information about how this read is ordered with the write
to x
in A
so we have to assume that x
is alive above par
. If we take the intersection
here:
live(A) & live(B) & live(C)
= {} & {x} & {x}
= {}
We get the wrong answer. More generally, we can see that union clobbers any writes and intersection clobbers any reads that happen in the par.
The solution to this problem is solved by passing the gen
and kill
sets along
with the live
sets. Then par
can set its output to be
(union(live(children)) - union(kill(children))) | union(gen(children))
The final tricky bit is thinking about how information flows between siblings in a par
statement.
Consider again the picture above with three nodes: A
, B
, and C
. Should x
be live
at B
? For liveness it turns out to be yes, but bare with me for a second for a thought experiment
and consider the case where we have the guarantee that statements running in parallel can not interact
with each other. This lets us reorder statements in some cases.
Then there seems to be an information trade-off for how to define the liveness of x
at B
:
- You could say that
x
is dead atB
because it doesn't see any previous writes tox
and doesn't read fromx
. This implies that you could potentially replace writes to other registers withx
. However, this by itself would causex
to be written to twice in parallel. You would have to reorderB
to run beforeA
. The takeaway here is that callingx
dead atB
gives the register reuse pass more information to work with. To compute this informationB
needs thegens
andkills
from all of its siblings (for the same reason thatpar
) needed it. This is not particularly hard to implement, but it's worthy of noting. - If you say that
x
is live atB
, then you can never rewriteB
to usex
instead of some other register, but you also don't have to worry about reordering statements in apar
.
Leaving thought experiment land, in our case we can never reorder statements in a par
because
siblings may interact with each other in arbitrary ways. For this reason, we must say that x
is live at B
. However, it's not immediately clear to me that this will be true of all dataflow
analyses. That's why I set this thought experiment down in writing.
Equivalence to worklist algorithm
In the normal worklist algorithm, we add a statement back to the worklist when its predecessor has changed.
The intuition for why this algorithm is equivalent to worklist algorithm is that because the only entry into the children for each parent control statement is the parent control statement itself. The only way that a child statement would need to be recomputed is if the inputs to the parent need to be recomputed. Anything above the parent in the tree will take care of this re-computation.
Debugging
This section contains tips and tools for debugging Calyx programs.
Debugging Tips
Debugging Calyx programs that run to completion and generate the wrong value can be challenging. We can first try to eliminate some common causes of problems.
Disabling Optimizations
The first step is disabling optimization passes.
The Calyx compiler refers to bundles of optimizations using two aliases:
pre-opt
and post-opt
.
pre-opt
passes run before the main compilation passes that remove the control
program while post-opt
passes run after.
To disable the passes, add the flag -d pre-opt -d post-opt
to compiler invocation:
- For the compiler:
futil <filename> -d pre-opt -d post-opt
. - For
fud
:fud ... -s futil.flags "-d pre-opt -d post-opt"
.
If the execution generates the right result, then one of the optimizations
passes is incorrect.
To identify which optimization pass is wrong, add back individual passes and see
when the execution fails.
To do so, first run futil --list-passes
to see the names of the passes that make
up the pre-opt
and post-opt
aliases.
Next, re-enable each pass by doing: -d pre-opt -d post-opt -p <pass1> -p <pass2>
and so on.
Disabling Static Timing
static-timing
is one of the two control compilation passes.
It uses the latency information on groups to generate faster hardware.
Disable it using the flag -d static-timing
.
Reducing Test Files
It is often possible to reduce the size of the example program that is generating incorrect results. In order to perform a reduction, we need to run the program twice, once with a "golden workflow" that we trust to generate the right result and once with the buggy workflow.
For example, if we've identified the problem to be in one of the Calyx passes, the "golden workflow" is running the program without the pass while the buggy workflow is running the program with the pass enabled. This case is so common that we've written a script that can run programs with different set of flags to the Calyx compiler and show the difference in the outputs after simulation.
The script is invoked as:
tools/flag-compare.sh <calyx program> <data>
By default, the script will try to run the programs by simulating them through
Verilator by providing fud
with the target --to dat
.
If you'd like to use the Calyx Interpreter instead, run the following command:
tools/flag-compare.sh <calyx program> <data> interpreter-out
Reducing Calyx Programs
The best way to reduce Calyx program deleting group enables from the control program and seeing if the generated program still generates the wrong output. While doing this, make sure that you're not deleting an update to a loop variable which might cause infinite loops.
By default, the compiler will complain if the program contains a group
that
is not used in the control program which can get in the way of minimizing
programs.
To get around this, run the dead-group-removal
pass before the validation
passes:
futil -p dead-group-removal -p validate ...
Reducing Dahlia Programs
If you're working with Dahlia programs, it is also possible to reduce the
program with the script since it simply uses fud
to run the program with the
simulator.
As with Calyx reduction, try deleting parts of the program and seeing if the
flag configurations for the Calyx program still generate different outputs.
Waveform Debugging
Waveform debugging is the final way of debugging Calyx programs. A waveform captures the value of every port at every clock cycle and can be viewed using a wave viewer program like GTKWave or WaveTrace to look at the wave form. Because of this level of granularity, it generates a lot of information. To make the information a little more digestible, we can use information generated by Calyx during compilation.
For waveform debugging, we recommend disabling the optimization passes and static timing compilation (unless you're debugging these passes). In this debugging strategy, we'll do the following:
- Dump out the control FSM for the program we're debugging.
- Find the FSM states that enable the particular groups that might be misbehaving.
- Open the waveform viewer and find clock cycles where the FSM takes the corresponding values and identify other signals that we care about.
Consider the control section from examples/futil/dot-product.futil:
seq {
let0;
while le0.out with cond0 {
seq {
par {
upd0;
upd1;
}
let1;
let2;
upd2;
upd3;
}
}
}
Suppose that we want to make sure that let0
is correctly performing its
computation.
We can generate the control FSM for the program using:
futil <filename> -p top-down-cc
This generates a Calyx program with several new groups.
We want to look for groups with the prefix tdcc
which look something like
this:
group tdcc {
let0[go] = !let0[done] & fsm.out == 4'd0 ? 1'd1;
cs_wh.in = fsm.out == 4'd1 ? le0.out;
cs_wh.write_en = fsm.out == 4'd1 ? 1'd1;
cond0[go] = fsm.out == 4'd1 ? 1'd1;
par[go] = !par[done] & cs_wh.out & fsm.out == 4'd2 ? 1'd1;
let1[go] = !let1[done] & cs_wh.out & fsm.out == 4'd3 ? 1'd1;
let2[go] = !let2[done] & cs_wh.out & fsm.out == 4'd4 ? 1'd1;
upd2[go] = !upd2[done] & cs_wh.out & fsm.out == 4'd5 ? 1'd1;
upd3[go] = !upd3[done] & cs_wh.out & fsm.out == 4'd6 ? 1'd1;
...
}
The assignments to let0[go]
indicate what conditions make the let0
group
execute.
In this program, we have:
let0[go] = !let0[done] & fsm.out == 4'd0 ? 1'd1;
Which states that let0
will be active when the state of the fsm
register
is 0
along with some other conditions.
The remainder of the group defines how the state in the fsm
variable changes:
...
fsm.in = fsm.out == 4'd0 & let0[done] ? 4'd1;
fsm.write_en = fsm.out == 4'd0 & let0[done] ? 1'd1;
fsm.in = fsm.out == 4'd1 & cond0[done] ? 4'd2;
fsm.write_en = fsm.out == 4'd1 & cond0[done] ? 1'd1;
...
For example, we can see that when the value of the FSM is 0 and let0[done]
becomes high, the FSM will take the value 1.
Once we have this information, we can open the VCD file and look at points when
the fsm
register has the value 1 and check to see if the assignments in
let0
activated in the way we expected.
Emitting Calyx from Python
Our frontends are written in Python3 and make use of the calyx
library to
generate their code.
To install the library, run the following from the repository root (requires flit installation):
cd calyx-py && flit install -s
The library provides an example:
python calyx-py/test/example.py
Building a Frontend for Calyx
In this tutorial, we're going to build a compiler for a small language called MrXL.
MrXL Overview
MrXL provides constructs to create arrays, and perform map
and reduce
operations on those arrays. Here's an example of a dot product implementation in MrXL:
input avec: int[4]
input bvec: int[4]
output dot: int
prodvec := map 1 (a <- avec, b <- bvec) { a * b }
dot := reduce 1 (a, b <- prodvec) 0 { a + b }
A map
expressions produces a new vector, each element an evaluated expression that can use elements of other vectors. In the above example, the map
expression multiplies the values of avec
and bvec
. These expressions also have parallelism factors: in the above code snippet, the map
expression has a parallelism factor of 16, which means we stamp out 16 multipliers to speed up the computation.
reduce
expressions walk over memories and accumulate a result into a register. In the above code snippet, we add together all of the elements of prodvec
and place them in a register named dot
.
Arrays can be input
arrays, which we populate with some input data, or output
arrays, which the program will populate for us.
Run a MrXL Program
First, you'll need to have the MrXL stage installed for fud
. See the MrXL docs.
MrXL programs need to be fed data in the form of JSON files. Let's try to run this program, which has a parallelism factor of two:
input foo: int[4]
output baz: int[4]
baz := map 2 (a <- foo) { a + 5 }
with this data file, containing banked memories to allow for the parallelism:
{
"foo_b0": {
"data": [
1,
2
],
"format": {
"numeric_type": "bitnum",
"is_signed": false,
"width": 32
}
},
"foo_b1": {
"data": [
3,
4
],
"format": {
"numeric_type": "bitnum",
"is_signed": false,
"width": 32
}
},
"baz_b0": {
"data": [
0,
0
],
"format": {
"numeric_type": "bitnum",
"is_signed": false,
"width": 32
}
},
"baz_b1": {
"data": [
0,
0
],
"format": {
"numeric_type": "bitnum",
"is_signed": false,
"width": 32
}
}
}
Run the program with the supplied data by typing:
fud exec frontends/mrxl/test/add.mrxl --from mrxl --to vcd -s verilog.data frontends/mrxl/test/add.mrxl.data
Build a Compiler for MrXL
This guide will walk you through the steps to build a Python program that compiles MrXL programs to Calyx code. To simplify things, we'll make a few assumptions:
- Every array in a MrXL program has the same length.
- Every integer in our generated hardware will be 32 bits.
- Every
map
andreduce
body will be either a multiplication or addition of either an array element or an integer.
The following sections will outline these two high level tasks:
- Parse MrXL into a representation we can process with Python
- Generate Calyx code
Parse MrXL into an AST
To start, we'll parse this MrXL program into a Python AST representation. We chose to represent AST nodes with Python dataclass
s. Our toplevel AST node looks like this:
@dataclass
class Prog:
decls: List[Decl]
stmts: List[Stmt]
Decl
nodes correspond to array declarations like input avec: int[1024]
, and carry data about whether they're an input
or output
array, their name, and their type:
@dataclass
class Decl:
input: bool # Otherwise, output.
name: str
type: Type
Stmt
nodes represent statements such as dot := reduce 4 (a, b <- prodvec) 0 { a + b }
, and contain more nested nodes representing their function header and body, and type of operation.
Now we can decide on rules for generating code depending on which AST node we're working on. Depending on the AST node, we might need to add code to cells
, wires
or control
.
Generate Calyx Code
The skeleton of a Calyx program has three sections, and looks like this:
component main() -> {
cells { }
wires { }
control { }
}
cells
contains declarations for logical hardware units like adders, memories and registers. wires
contains group
s that connect together the units declared in cell
s and form the structure of the hardware. control
contains the logic specifying when the group
s will perform their computation. Walking the nodes of the AST we defined earlier, we'll generate strings that we'll insert into each of these sections. The next few sections will discuss the different node types.
Decl
nodes
Decl
nodes instantiate new memories and registers. We need these to be instantiated in the cells
section of our Calyx output. Here's Calyx code that creates a new memory foo
, with 4 32-bit elements and a 32-bit indexor:
foo = std_mem_d1(32, 4, 32);
For each Decl
node, we need to determine if we're instantiating a memory or a register, and then translate that to a corresponding Calyx declaration and place that inside the cells
section of our generated program. Here's some code from our compiler that walks through each register and memory declaration, and generates a Calyx program with those registers:
def emit(prog):
"""
Returns a string containing a Calyx program, compiled from prog
, a MrXL
program.
"""
cells, wires, control = [], [], []
(emit_mem_decl
emits a string of the form "mem_name = std_mem_d1(<element_width>, <num_elements>, <index_width>)"
.)
Map
and Reduce
nodes
For every map or reduce node, we need to generate Calyx code that iterates over an array, performs some kind of computation, and then stores the result of that computation. For map
operations, we'll perform a computation on an element of an input array, and then store the result in a result array. For reduce
operations, we'll also use an element of an input array, but we'll also use an accumulator register that we'll use in each computation, and we'll also store to. For example, if we were writing a reduce
that summed up the elements of an input array, we'd use an accumulator register that was initialized to hold the value 0, and add to the value of this register each element of an input array.
We can implement these behaviors using Calyx groups:
incr_idx
: Increments anidx
register using an adder. This group is done when theidx
register is written to.cond
: Applies a "less than" operator toidx
, and the length of our input arrays, using thele
hardware unit.eval_body
: Reads from an array, performs some kind of computation, and writes the result of the computation to an accumulator register or another array.
We'll make these groups for each Map
and Reduce
node, so to avoid naming collisions, we'll suffix each group with an integer starting at 0, incrementing each time we need to add a new set of groups. These groups will be added to the wires
section. We'll also need to add logic to the control
section as well that uses these groups to process arrays:
while le0.out with cond0 {
seq { eval_body0; incr_idx0; }
}
This logic orchestrates our groups, basically representing iterating over our array and evaluating some computation on each element of the array. On each iteration we signal for the eval_body0
group to do its work, followed sequentially by incr_idx0
to advance our index register so that we can work on the next element of the array.
Add Parallelization
MrXL allows you to parallelize your map
and reduce
operations. Let's revisit the map
example from earlier:
input foo: int[4]
output baz: int[4]
baz := map 4 (a <- foo) { a + 5 }
The number 4 after the map
specifies the number of adders we can use at once to parallelize this computation. There are a few ways we could parallelize this program, and one of them is to split the memories used in the map
operation into 4 separate memory banks, and then we can read from each bank of foo
and write into each bank of baz
simultaneously. In general, we can break memories of size m
into b
banks (each with size m/b
), and then simultaneously process those b
banks. Realizing this in Calyx means creating separate memories for each bank, and creating group
s to process each bank. Here's a section of the compiler that generates banked memories:
Group,
CompVar,
Stdlib,
Cell,
Program,
Component,
Import,
SeqComp,
ConstantPort,
HolePort,
CompPort,
Enable,
While,
ParComp,
CombGroup,
In the Map
and Reduce
code generation section we described group
s that could be orchestrated to iterate over a memory and process it. We'll now have to do that for each memory bank, and then parallelize these operations in the generated Calyx's control
section. We can accomplish this with Calyx's par
keyword, signalling to execute groups in parallel. Here's an example of executing four while loops in parallel:
par {
while le_b0.out with cond_b0 { seq { eval_body_b0; incr_idx_b0; } }
while le_b1.out with cond_b1 { seq { eval_body_b1; incr_idx_b1; } }
while le_b2.out with cond_b2 { seq { eval_body_b2; incr_idx_b2; } }
while le_b3.out with cond_b3 { seq { eval_body_b3; incr_idx_b3; } }
}
Conclusion
Hopefully this should be enough to get you started with writing your own MrXL compiler. Some more follow up tasks you could try if you're interested:
- Implement code generation to implement
reduce
statements, which we do not include in our compiler. - Implement code generation that allows memories that differ from one another in size.
- Implement complex function body expressions. We only support binary operations with simple operands, like
a + 5
. Different hardware components take multiple cycles to execute: for example, a register takes 1 cycle to write data to, but a memory might take more. This complicates hardware design, as you need to account for differing latencies among hardware components.
Frontend Compilers
Several compiler generate Calyx programs from other high-level languages.
- Dahlia: Dahlia is an imperative, loop-based programming language.
- Systolic Array Generator: Generates systolic arrays using parameters.
- TVM Relay: Relay is an IR for the TVM framework to replace old computation graph based IRs.
- NTT Pipeline Generator: Generates a pipeline for the number theoretic transform.
- MrXL: A simple example frontend developed in the frontend tutorial.
Dahlia
Dahlia is an imperative, loop-based programming language for designing hardware accelerators.
Installation
Then, clone the repository and build the Dahlia compiler:
git clone https://github.com/cucapra/dahlia.git
cd dahlia
sbt install
sbt assembly
chmod +x ./fuse
The Dahlia compiler can be run using the ./fuse
binary:
./fuse --help
Finally, configure fud
to use the Dahlia compiler:
fud c stages.dahlia.exec <path to Dahlia repository>/fuse
Use fud
to check if the compiler was installed correctly:
fud check
fud
should report that the Dahlia compiler is available and has the right
version.
If something went wrong, try following the instructions to build the Dahlia compiler from its repository.
Compiling Dahlia to Calyx
Dahlia programs can be compiled to Calyx using:
fud e --from dahlia <input file> --to futil
The Dahlia backed for Calyx is neither complete nor stable. If you find a confusing error or wrong program, please open an issue.
Systolic Array
Systolic arrays are commonly used to implement fast linear-algebra computations. See this paper for an overview on systolic arrays.
The systolic array frontend lives in the systolic-lang folder in the Calyx repository and generates systolic arrays that can perform matrix multiplies.
The gen-systolic.py
contains the entire program required to generate
systolic arrays. In order to generate an 8 X 8 systolic array, run:
./frontends/systolic-lang/gen-systolic.py -tl 8 -td 8 -ll 8 -ld 8
Installation
Install the calyx-py library.
Command Line Options
The command line options configure the dimensions of the generated systolic array. There are no other properties of the systolic array that can be configured.
--top-length
,--left-length
: The length of top and left sides of the systolic array.--top-depth
,--left-depth
: The length of the input streams from top and left sides of the array.
TVM Relay
TVM is a compiler for machine learning frameworks that can optimize and target kernels to several different backends. Relay is a high level intermediate representation for the TVM framework. The goal of Relay is to replace old computation graph based IRs with a more expressive IR. More information can be found in this paper.
The TVM Relay frontend lives in the relay-lang folder in the Calyx repository and generates Calyx components from the Relay intermediate representation.
Installation
-
Clone the TVM repository with commit hash
ccacb1ec1
):git clone --recursive git@github.com:apache/incubator-tvm.git cd incubator-tvm && git checkout ccacb1ec1
-
Set up to build (the default configuration is fine because we don't need any fancy backends like LLVM or CUDA):
mkdir build && cd build cp ../cmake/config.cmake .
-
Build TVM:
cmake -G Ninja .. && ninja
-
Install the
tvm
Python package by building a wheel:cd ../python && python3 setup.py bdist_wheel pip3 install --user dist/tvm-*.whl
-
Install the accompanying
topi
Python package:cd ../topi/python && python3 setup.py bdist_wheel pip3 install --user dist/topi-*.whl
-
Install ANTLR v4.7.2 (required for the Relay text format parser):
pip3 install -Iv antlr4-python3-runtime==4.7.2
-
To run the MLP net and VGG net examples, install
pytest
:pip3 install pytest
-
Install Dahlia, which is used when lowering Relay call nodes to Calyx.
-
Install the calyx-py library.
Run an Example
Try this to run a simple example:
cd calyx/frontends/relay
python3 example.py tensor_add
-h
: Help option; shows available examples.-r
: Dumps the Relay IR. Otherwise, it dumps the Calyx output.
Simulate an ONNX Model
A simple script is provided to run an Open Neural Network Exchange (ONNX) model. In addition to installing TVM Relay above, you'll need the following PIP installations for ONNX simulation and image pre-processing:
pip3 install opencv-python Pillow mxnet onnx simplejson
For example, we can simulate the LeNet ONNX model found here using the following command:
python3 frontends/relay/onnx_to_calyx.py \
-n "lenet" \
-d "MNIST" \
-i "/path/to/image.png" \
-onnx "/path/to/model.onnx" \
-o calyx
-n
: The name of the input net. This is mostly used for naming the output files.-d
: The dataset for which the input will be classified against. This is necessary to determine what preprocessing should be done on the image. e.g."mnist"
or"imagenet"
.-i
: The file path to the input image which you want classified.-onnx
: The file path to the ONNX model.-o
: The type of output.tvm
: Executes the ONNX model using the TVM executor. Prints the final softmax value to console. No postprocessing is conducted.relay
: Output a file with the corresponding Relay program.<net_name>.relay
calyx
: Output a.data
file and Calyx program for simulation.<net_name>.futil
,<net_name>.data
all
: All the above.
Number Theoretic Transform (NTT)
The number theoretic transform is a generalization of the fast Fourier transform that uses nth primitive root of unity based upon a quotient ring instead of a field of complex numbers.
It is commonly used to speed up computer arithmetic, such as the multiplication of large integers and large degree polynomials. The pipeline produced here is based upon this paper, which also provides some background information on NTT.
The NTT pipeline frontend lives in the ntt folder in the Calyx repository and generates the pipeline for the NTT transform.
The gen-ntt-pipeline.py
file contains the entire program required to
generate NTT pipelines. In order to generate a pipeline with
bit width 32
, input size 4
, and modulus value 97
:
./frontends/ntt-pipeline/gen-ntt-pipeline.py -b=32 -n=4 -q=97
Installation
Install the calyx-py library.
The generator also produces a table to illustrate which operations are occurring during each stage of the pipeline. This requires installing PrettyTable:
pip3 install prettytable numpy
Fud Stage
The NTT pipeline defines an [external fud stage][../fud/external.md] to transform configuration files into Calyx programs. To install, run:
fud register ntt -p frontends/ntt-pipeline/fud/ntt.py && fud check
This should report the newly installed ntt
stage in the configuration.
Configuration Files
Configurations files simply specify command line parameters:
{
"input_bitwidth": 32,
"input_size": 4,
"modulus": 97
}
Command Line Options
The command line options configure the bit width, size, and modulus value of the pipeline.
--input_bitwidth
: The bit width of each value in the input array.--input_size
: The length (or size) of the input array.--modulus
: The (prime) modulus value used during the transformation.--parallel_reduction
: Decreases fan-out by reducing the number of groups executed in parallel by this factor.
MrXL
The MrXL frontend is a toy frontend developed for the frontend tutorial. As such, it is less rigorously tested and might have bugs.
MrXL is an example DSL for demonstrating Calyx. MrXL programs consist of map
and reduce
operations on arrays. For example, this is a dot product implementation:
input avec: int[1024]
input bvec: int[1024]
output dot: int
prodvec := map 16 (a <- avec, b <- bvec) { a * b }
dot := reduce 4 (a, b <- prodvec) 0 { a + b }
The numbers that come right after map
and reduce
are parallelism factors that guide the generation of hardware.
Install
Install the calyx-py library.
The MrXL implementation is in Python and uses Flit.
First, install flit (pip install flit
or similar), and then type the
following inside frontend/mrxl
:
flit install --symlink
This creates a symbolic link the mrxl directory and installs the mrxl
command
line tool.
By default, fud looks for the mrxl
executable to enable
the mrxl
compilation stage.
Type fud check
to make sure fud
reports that the mrxl
compiler has been
found.
Interpreter
To run the interpreter, do this:
mrxl <program> --data <indata> --interpret
where <program>
is a MrXL source code file and <indata>
is a JSON file containing values for all the variables declared as input
in the program. The interpreter dumps the output
variables as JSON to stdout.
You can try this, for example:
mrxl test/dot.mrxl --data test/dot.json --interpret
Compiling to Calyx
To run the compiler, leave off the --interpret
and --data
flags:
mrxl test/dot.mrxl
In order to run the compiler through fud
, pass the --from mrxl
flag:
fud e --from mrxl <mrxl file> --to futil
To simulate the Verilog generated from the mrxl compiler, set the -s verilog.data
as usual:
fud e --from mrxl <mrxl file> --to dat -s verilog.data <data file>
Runt
Runt (Run Tests) is the expectation testing framework for Calyx. It organizes collections of tests into test suites and specifies configuration for them.
Runt uses runt.toml
to define the test suites and configure them.
Cheatsheet
Runt workflow involves two things:
- Running tests and comparing differences
- Saving new or changed golden files
To run all the tests in a directory, run runt
with a folder containing runt.toml
.
The following commands help focus on specific tests to run:
-i
: Include files that match the given pattern. The pattern is matched against<suite name>:<file path>
so it can be used to filter both test suites or specific paths. General regex patterns are supported.-x
: Exclude files that match the pattern-o
: Filter out reported test results based on test status. Running withmiss
will only show the tests that don't have an.expect
file.
Differences. -d
or --diff
shows differences between the expected test output and the generated output. Use this in conjunction with -i
to focus on particular failing tests.
Saving Files. -s
is used to save test outputs when the have expected changes. In the case of miss
tests, i.e. tests that currently don't any expected output file, this saves a completely new .expect
file.
Dry run. -n
flag shows the commands that runt
will run for each test. Use this when you directly want to run the command for the test directly.
For other options, look at runt --help
which documents other features in runt
.
For instruction on using runt, see the official documentation.
exp
Generator
The exp
generator uses a Taylor series approximation to calculate the fixed point value of the natural
exponential function e^x
. The Maclaurin series
for the function can be written as:
e^x = 1 + x + x^2/2! + x^3/3! + ... + x^n/n!
where n
is the nth degree or order of the polynomial.
For signed values, we can take the reciprocal value:
e^(-x) = 1/e^x
The gen_exp.py
file can generate an entire Calyx program for testing purposes.
The main
component contains memories x
(for the input) and ret
for the result of e^x
.
In order to generate an example program with degree 4
, bit width 32
, integer bit width 16
, and x
interpreted as a signed value:
./calyx-py/calyx/gen_exp.py -d 4 -w 32 -i 16 -s true
Similarly, it provides a function to produce only the necessary components to be dropped into other Calyx programs.
Installation
Install the calyx-py library.
Command Line Options
The command line options configure the degree (or order) of the taylor series, bit width, integer bit width, and sign.
--degree
: The degree of the Taylor polynomial.--width
: The bit width of the valuex
.--int_width
: The integer bit width of the valuex
. The fractional bit width is then inferred aswidth - int_width
.--is_signed
: The signed interpretation of the valuex
. Ifx
is a signed value, this should betrue
and otherwise,false
.
Editor Highlighting
Vim
The vim extension highlights files with the extension .futil
.
It can be installed using a plugin manager such as vim-plug using a
local installation.
Add the following to your vim plug configuration:
Plug '<path-to-calyx>/tools/vim'
And run:
:PlugInstall
Emacs
futil-mode
is implements highlighting for .futil
files in emacs.
It is located in <repo>/tools/emacs/futil-mode
.
Clone the repository, add the above path to your load path, and require
futil-mode
.
If you use Spacemacs, this looks like adding the following lines to
dotspacemacs/user-config
in your .spacemacs
file:
(push "~/.emacs.d/private/local/fuse-mode" load-path)
(require 'fuse-mode)
I imagine it looks very similar for pure emacs, but haven't actually tried it myself.
Visual Studio Code
Add a link to the Calyx VSCode extension directory to your VSCode extensions directory.
cd $HOME/.vscode/extensions
ln -s <calyx root directory/tools/vscode calyx.calyx-0.0.1
Restart VSCode.
Contributors
Here is a list of all the people who have worked on Calyx:
Current Contributors
- Sam Thomas
- Rachit Nigam
- Griffin Berlstein
- Chris Gyurgyik
- Adrian Sampson
- YoungSeok Na
- Jan-Paul Ramos
- Jiaxuan (Crystal) Hu
Previous Contributors
- Neil Adit
- Kenneth Fang
- Zhijing Li
- Yuwei Ye
- Ted Bauer
- YooNa Chang
- Karen Zhang
- Andrii Iermolaiev
- Alma Thaler
If you're missing from this list, please add yourself!