Getting Started

Calyx is an intermediate language and infrastructure for building compilers that generate custom hardware accelerators. These instructions will help you set up the Calyx compiler and associated tools. By the end, you should be able to compile and simulate hardware designs generated by Calyx.

Compiler Installation

Install Rust (it should automatically install cargo).

Clone the repository:

git clone

Then build the compiler:

cargo build

You can invoke the compiler in one of two ways:

cargo run -- --help # Rebuilds the compiler if the sources changed
./target/debug/futil --help # Default debug build of the compiler

Running Core Tests

The core test suites tests the Calyx compiler passes. Install the following tools for running the core tests:

  1. runt hosts our testing infrastructure. Install with: cargo install runt
  2. jq is a command-line JSON processor:
    • Ubuntu: sudo apt install jq
    • Mac: brew install jq
    • Other platforms: JQ installation

Build the compiler:

cargo build

Then run the core tests with:

runt -i core

If everything has been installed correctly, this should not produce any failing tests.

Installing the Command-Line Driver

The Calyx driver wraps the various compiler frontends and backends to simplify running Calyx programs.

Install Flit:

pip3 install flit

Install fud (from the root of the repository):

flit -f fud/pyproject.toml install -s --deps production

Configure fud:

fud config global.futil_directory <full path to Calyx repository>

Check the fud configuration:

fud check

fud will report certain tools are not available. This is expected.


There are three ways to run Calyx programs: Verilator, Icarus Verilog, and Calyx's native interpreter. You'll want to set up at least one of these options so you can try out your code.

Icarus is an easy way to get started on most platforms. On a Mac, you can install Icarus Verilog with Homebrew by typing brew install icarus-verilog. On Ubuntu, install from source. Then install the relevant fud support by running:

fud register icarus-verilog -p fud/icarus/

Type fud check to make sure the new stage is working. (Some missing tools are expected; just pay attention to the report for stages.icarus-verilog.exec.)

You can instead consider setting up Verilator for faster long-running simulations or using the interpreter to avoid RTL simulation altogether.

Running a Hardware Design

We're all set to run a Calyx hardware design now. Run the following command:

fud e examples/tutorial/language-tutorial-iterate.futil \
  -s examples/tutorial/data.json \
  --to dat --through icarus-verilog -v

(Change the last bit to --through verilog to use Verilator instead.)

This command will compile examples/tutorial/language-tutorial-iterate.futil to Verilog using the Calyx compiler, simulate the design using the data in examples/tutorial/data.json, and generate a JSON representation of the final memory state.

Congratulations! You've simulated your first hardware design with Calyx.

Where to go next?

Calyx Language Tutorial

This tutorial will familiarize you with the Calyx language by writing a minimal program by hand. Usually, Calyx code will be generated by a frontend. However, by writing the program by hand, you can get familiar with all the basic constructs you need to generate Calyx yourself!

Complete code for each example can be found in the tutorial directory in the Calyx repository.

Get Started

The basic building block of Calyx programs is a component which corresponds to hardware modules (or function definitions for the software-minded).

Here is an empty component definition for main along with an import statement to import the standard library:

import "primitives/core.futil";

component main(go: 1) -> (done: 1) {
  cells {}
  wires {}
  control {}

Put this in a file—you can call it language-tutorial-mem.futil, for example. (The futil file extension comes from an old name for Calyx.)

You can think of a component as a unit of Calyx code roughly analogous to a function: it encapsulates a logical unit of hardware structures along with their control. Every component definition has three sections:

  • cells: The hardware subcomponents that make up this component.
  • wires: A set of guarded connections between components, possibly organized into groups.
  • control: The imperative program that defines the component's execution schedule: i.e., when each group executes.

We'll fill these sections up minimally in the next sections.

A Memory Cell

Let's turn our skeleton into a tiny, nearly no-op Calyx program. We'll start by adding a memory component to the cells:

  cells {
    @external mem = std_mem_d1(32, 1, 1);

This new line declares a new cell called mem and the primitive component std_mem_d1 represents a 1D memory. You can see the definition of std_mem_d1, and all the other standard components, in the primitives/core.futil library we imported.

This one has three parameters: the data width (here, 32 bits), the number of elements (just one), and the width of the address port (one bit).

The @external syntax is an extra bit of magic that allows us to read and write to the memory.

Next, we'll add some assignments to the wires section to update the value in the memory. Insert these lines to put a constant value into the memory:

  wires {
      mem.addr0 = 1'b0;
      mem.write_data = 32'd42;
      mem.write_en = 1'b1;
      done = mem.done;

These assignments refer to four ports on the memory component: addr0 (the address port), write_data (the value we're putting into the memory), write_en (the write enable signal, telling the memory that it's time to do a write), and done (signals that the write was committed). Constants like 32'd42 are Verilog-like literals that include the bit width (32), the base (d for decimal), and the value (42).

Assignments at the top level in the wires section, like these, are "continuous". They always happen, without any need for control statements to orchestrate them. We'll see later how to organize assignments into groups.

The complete program for this section is available under examples/tutorial/language-tutorial-mem.futil.

Compile & Run

We can almost run this program! But first, we need to provide it with data. The Calyx infrastructure can provide data to programs from JSON files. So make a file called something like data.json containing something along these lines:

  "mem": {
    "data": [10],
    "format": {
      "numeric_type": "bitnum",
      "is_signed": false,
      "width": 32

The mem key means we're providing the initial value for our memory called mem. We have one (unsigned integer) data element, and we indicate the bit width (32 bits).

If you want to see how this Calyx program compiles to Verilog, here's the fud incantation you need:

fud exec language-tutorial-mem.futil --to verilog

Not terribly interesting! However, one nice thing you can do with programs is execute them.

To run our program using Icarus Verilog, do this:

fud exec language-tutorial-mem.futil --to dat --through icarus-verilog \
    -s data.json

Using --to dat asks fud to run the program, and the extra -s <filename> argument tells it where to find the input data. The --through icarus-verilog option tells fud which Verilog simulator to use (see the chapter about fud for alternatives such as Verilator). Executing this program should print:

  "cycles": 0,
  "memories": {
    "mem": [

Meaning that, after the program finished, the final value in our memory was 42.

Note: Verilator may report a different cycle count compared to the one above. As long as the final value in memory is correct, this does not matter.

Add Control

Let's change our program to use an execution schedule. First, we're going to wrap all the assignments in the wires section into a name group:

  wires {
    group the_answer {
      mem.addr0 = 1'b0;
      mem.write_data = 32'd42;
      mem.write_en = 1'b1;
      the_answer[done] = mem.done;

We also need one extra line in the group: that assignment to the_answer[done]. Here, we say that the_answer's work is done once the update to mem has finished. Calyx groups have compilation holes called go and done that the control program will use to orchestrate their execution.

The last thing we need is a control program. Add one line to activate the_answer and then finish:

  control {

If you execute this program, it should do the same thing as the original group-free version: mem ends up with 42 in it. But now we're controlling things with an execution schedule.

If you're curious to see how the Calyx compiler lowers this program to a Verilog-like structural form of Calyx, you can do this:

fud exec language-tutorial-mem.futil --to futil-lowered

Notably, you'll see control {} in the output, meaning that the compiler has eliminated all the control statements and replaced them with continuous assignments in wires.

The complete program for this section is available under examples/tutorial/language-tutorial-control.futil.

Add an Adder

The next step is to actually do some computation. In this version of the program, we'll read a value from the memory, increment it, and store the updated value back to the memory.

First, we will add two components to the cells section:

  cells {
    @external(1) mem = std_mem_d1(32, 1, 1);
    val = std_reg(32);
    add = std_add(32);

We make a register val and an integer adder add, both configured to work on 32-bit values.

Next, we'll create three groups in the wires section for the three steps we want to run: read, increment, and write back to the memory. Let's start with the last step, which looks pretty similar to our the_answer group from above, except that the value comes from the val register instead of a constant:

    group write {
      mem.addr0 = 1'b0;
      mem.write_en = 1'b1;
      mem.write_data = val.out;
      write[done] = mem.done;

Next, let's create a group read that moves the value from the memory to our register val:

    group read {
      mem.addr0 = 1'b0; = mem.read_data;
      val.write_en = 1'b1;
      read[done] = val.done;

Here, we use the memory's read_data port to get the initial value out.

Finally, we need a third group to add and update the value in the register:

    group upd {
      add.left = val.out;
      add.right = 32'd4; = add.out;
      val.write_en = 1'b1;
      upd[done] = val.done;

The std_add component from the standard library has two input ports, left and right, and a single output port, out, which we hook up to the register's in port. This group adds a constant 4 to the register's value, updating it in place. We can enable the val register with a constant 1 because the std_add component is combinational, meaning its results are ready "instantly" without the need to wait for a done signal.

We need to extend our control program to orchestrate the execution of the three groups. We will need a seq statement to say we want to the three steps sequentially:

  control {
    seq {

Try running this program again. The memory's initial value was 10, and its final value after execution should be 14.

The complete program for this section is available under examples/tutorial/language-tutorial-compute.futil.


Next, we'd like to run our little computation in a loop. The idea is to use Calyx's while control construct, which works like this:

while <value> with <group> {

A while loop runs the control statements in the body until <value>, which is some port on some component, becomes zero. The with <group> bit means that we activate a given group in order to compute the condition value that determines whether the loop continues executing.

Let's run our memory-updating seq block in a while loop. Change the control program to look like this:

  control {
    seq {
      while lt.out with cond {
        par {
          seq {

This version uses while, the parallel composition construct par, and a few new groups we will need to define. The idea is that we'll use a counter register to make this loop run a fixed number of times, like a for loop. First, we have an outer seq that invokes an init group that we will write to set the counter to zero. The while loop then uses a new group cond, and it will run while a signal lt.out remains nonzero: this signal will compute counter < 8. The body of the loop runs our old seq block in parallel with a new incr group to increment the counter.

Let's add some cells to our component:

    counter = std_reg(32);
    add2 = std_add(32);
    lt = std_lt(32);

We'll need a new register, an adder to do the incrementing, and a less-than comparator.

We can use these raw materials to build the new groups we need: init, incr, and cond. First, the init group is pretty simple:

    group init { = 32'd0;
      counter.write_en = 1'b1;
      init[done] = counter.done;

This group just writes a zero into the counter and signals that it's done. Next, the incr group adds one to the value in counter using add2:

    group incr {
      add2.left = counter.out;
      add2.right = 32'd1; = add2.out;
      counter.write_en = 1'b1;
      incr[done] = counter.done;

And finally, cond uses our comparator lt to compute the signal we need for our while loop. We use a comb group to denote that the assignments inside the condition can be run combinationally:

    comb group cond {
      lt.left = counter.out;
      lt.right = 32'd8;

By comparing with 8, we should now be running our loop body 8 times.

Try running this program again. The output should be the result of adding 4 to the initial value 8 times, so 10 + 8 × 4.

The complete program for this section is available under examples/tutorial/language-tutorial-iterate.futil.

Take a look at the full language reference for details on the complete language.

Multi-Component Designs

Calyx designs can define and instantiate other Calyx components that themselves encode complex control programs.

As an example, we'll build a Calyx design that uses a simple Calyx component to save a value in a register and use it in different component.

We define a new component called identity that has an input port in and an output port out.

component identity(in: 32) -> (out: 32) {
  cells {
    r = std_reg(32);
  wires {
    group save { = in;
      r.write_en = 1'd1;
      save[done] = r.done;

    // This component always outputs the current value in r
    out = r.out;
  control {

The following line defines a continuous assignment, i.e., an assignment that is always kept active, regardless of the component's control program being active.

    // This component always outputs the current value in r
    out = r.out;

By defining this continuous assignment, we can execute our component and later observe any relevant values.

Next, we can instantiate this component in any other Calyx component. The following Calyx program instantiates the id component and uses it to save a value and observe it.

component main() -> () {
  cells {
    // Instantiate the identity element
    id = identity();
    current_value = std_reg(32);
  wires {
    group run_id {
      // We want to "save" the value 10 inside the identity group. = 32'd10;
      // All components have a magic "go" and "done" port added to them.
      // Execute the component.
      id.go = 1'd1;
      run_id[done] = id.done;
    group use_id {
      // We want to "observe" the current value saved in id.
      // The out port on the `id` component always shows the last saved
      // element. We don't need to set the `go` because we're not executing
      // and control. = id.out;
      current_value.write_en = 1'd1;
      use_id[done] = current_value.done;
  control {
    seq { run_id; use_id; }

Our first group executes the component by setting the go signal for the component to high and placing the value 10 on the input port. The second group simply saves the value on the output port. Importantly, we don't have to set the go signal of the component to high because we don't need to save a new value into it. The component executes the two groups in-order.

To see the output from running this component, run the command:

fud e examples/futil/multi-component.futil --to vcd_json

Passing Cells by Reference

One question that may arise when using Calyx as a backend is how to pass a memory "by reference" between components. In C++, this might look like:

#include <array>
#include <cstdint>

// Adds one to the first element in `v`.
void add_one(std::array<uint32_t, 1>& v) {
  v[0] = v[0] + 1;

int main() {
  std::array<uint32_t, 1> x = { 0 };
  add_one(x); // The value at x[0] is now 1.

There are two steps to passing a memory by reference in Calyx:

  1. Define the component in a manner such that it can accept a memory by reference.
  2. Pass the desired memory by reference.

The language provides two ways to doing this.

The Easy Way

Calyx uses the ref keyword to describe cells that are passed in by-reference:

component add_one() -> () {
  cells {
    ref mem = std_mem_d1(32, 4, 3); // A memory passed in by reference.

This component define mem as a memory that is passed in by reference to the component. Inside the component we can use the cell like any other cell in the program.

Next, to pass the memory to the component, we can use the invoke syntax:

component add_one() -> () { ... }
component main() -> () {
  cells {
    A = std_mem_d1(32, 4, 3); // A memory passed in by reference.
    one = add_one();
  wires { ... }
  control {
    invoke one[mem = A]()(); // pass A as the `mem` for this invocation

The Calyx compiler will correctly lower the add_one component and the invoke call such that the memory is passed in by-reference. In fact, any cell can be passed in by-reference in a Calyx program. Read the next section if you're curious about how this process is implemented.

The Hard Way

We recommend using the ref syntax is almost all cases since it enables the compiler to perform more optimizations.

In the C++ code above, we've constructed an "l-value reference" to the array, which essentially means we can both read and write from x in the function add_one.

Now, let's allow similar functionality at the Calyx IR level. We define a new component named add_one which represents the function above. However, we also need to include the correct ports to both read and write to x:

Read from xWrite to x
address portswrite_data
address ports

Since we're both reading and writing from x, we'll include the union of the columns above:

component add_one(x_done: 1, x_read_data: 32) ->
                 (x_write_data: 32, x_write_en: 1, x_addr0: 1) {

One tricky thing to note is where the ports belong, i.e. should it be an input port or an output port of the component? The way to reason about this is to ask whether we want to receive signal from or send signal to the given wire. For example, with read_data, we will always be receiving signal from it, so it should be an input port. Conversely, address ports are used to mark where in memory we want to access, so those are used as output ports.

We then simply use the given ports to both read and write to the memory passed by reference. Note that we've split up the read and write to memory x in separate groups, to ensure we can schedule them sequentially in the execution flow. We're also using the exposed ports of the memory through the component interface rather than, say, x.write_data.

    group read_from_x {
      x_addr0 = 1'd0;            // Set address port to zero. = x_read_data;  // Read the value at address zero.
      tmp_reg.write_en = 1'd1;
      read_from_x[done] = tmp_reg.done;
    group write_to_x {
      x_addr0 = 1'd0;            // Set address port to zero.
      add.left = one.out;
      add.right = tmp_reg.out;   // Saved value from previous read.

      x_write_data = add.out;    // Write value to address zero.
      x_write_en = 1'd1;         // Set write enable signal to high.

      write_to_x[done] = x_done; // The group is done when the write is complete.

Bringing everything back together, the add_one component is written accordingly:

component add_one(x_done: 1, x_read_data: 32) ->
                 (x_write_data: 32, x_write_en: 1, x_addr0: 1) {
  cells {
    one = std_const(32, 1);
    add = std_add(32);
    tmp_reg = std_reg(32);
  wires {
    group read_from_x {
      x_addr0 = 1'd0;            // Set address port to zero. = x_read_data;  // Read the value at address zero.
      tmp_reg.write_en = 1'd1;
      read_from_x[done] = tmp_reg.done;
    group write_to_x {
      x_addr0 = 1'd0;            // Set address port to zero.
      add.left = one.out;
      add.right = tmp_reg.out;   // Saved value from previous read.

      x_write_data = add.out;    // Write value to address zero.
      x_write_en = 1'd1;         // Set write enable signal to high.

      write_to_x[done] = x_done; // The group is done when the write is complete.
  control {
    seq { read_from_x; write_to_x; }

The final step is creating a main component from which the original component will be invoked. In this step, it is important to hook up the proper wires in the call to invoke to the corresponding memory you'd like to read and/or write to:

  control {
    invoke add_one0(x_done = x.done, x_read_data = x.read_data)
                   (x_write_data = x.write_data, x_write_en = x.write_en, x_addr0 = x.addr0);

This gives us the main component:

component main() -> () {
  cells {
    add_one0 = add_one();
    @external(1) x = std_mem_d1(32, 1, 1);
  wires {
  control {
    invoke add_one0(x_done = x.done, x_read_data = x.read_data)
                   (x_write_data = x.write_data, x_write_en = x.write_en, x_addr0 = x.addr0);

To see this example simulated, run the command:

fud e examples/futil/memory-by-reference/memory-by-reference.futil --to dat \
-s examples/futil/memory-by-reference/

Multi-dimensional Memories

Not much changes for multi-dimensional arrays. The only additional step is adding the corresponding address ports. For example, a 2-dimensional memory will require address ports addr0 and addr1. More generally, an N-dimensional memory will require address ports addr0, ..., addr(N-1).

Multiple Memories

Similarly, multiple memories will just require the ports to be passed for each of the given memories. Here is an example of a memory copy (referred to as mem_cpy in the C language), with 1-dimensional memories of size 5:

import "primitives/core.futil";
component copy(dest_done: 1, src_read_data: 32, length: 3) ->
              (dest_write_data: 32, dest_write_en: 1, dest_addr0: 3, src_addr0: 3) {
  cells {
    lt = std_lt(3);
    N = std_reg(3);
    add = std_add(3);
  wires {
    comb group cond {
      lt.left = N.out;
      lt.right = length;
    group upd_index<"static"=1> {
      add.left = N.out;
      add.right = 3'd1; = add.out;
      N.write_en = 1'd1;
      upd_index[done] = N.done;
    group copy_index_N<"static"=1> {
      src_addr0 = N.out;
      dest_addr0 = N.out;
      dest_write_en = 1'd1;
      dest_write_data = src_read_data;
      copy_index_N[done] = dest_done;
  control {
    while lt.out with cond {
      seq {

component main() -> () {
  cells {
    @external(1) d = std_mem_d1(32,5,3);
    @external(1) s = std_mem_d1(32,5,3);
    length = std_const(3, 5);
    copy0 = copy();
  wires {
  control {
    seq {
      invoke copy0(dest_done=d.done, src_read_data=s.read_data, length=length.out)
                  (dest_write_data=d.write_data, dest_write_en=d.write_en, dest_addr0=d.addr0, src_addr0=s.addr0);

Calyx Language Reference

Top-Level Constructs

Calyx programs are a sequence of import statements followed by a sequence of extern statements or component definitions.

import statements

import "<path>" has almost exactly the same semantics to that of #include in the C preprocessor: it copies the code from the file at path into the current file.

extern definitions

extern definitions allow Calyx programs to link against arbitrary RTL code. An extern definition looks like this:

extern "<path>" {

<path> should be a valid file system path that points to a Verilog module that defines the same names as the primitives defined in the extern block. When run with the -b verilog flag, the Calyx compiler will copy the contents of every such Verilog file into the generated output.

primitive definitions

The primitive construct allows specification of the signature of an external Verilog module that the Calyx program uses. It has the following syntax:

[comb] primitive name<attributes>[PARAMETERS](ports) -> (ports);

The syntax for primitives resembles that for components, with some additional pieces:

  • The comb keyword signals that the primitive definition wraps purely combinational RTL code. This is useful for certain optimizations.
  • Attributes specify useful metadata for optimization.
  • PARAMETERS are named compile-time (metaprogramming) parameters to pass to the Verilog module definition. The primitive definition lists the names of integer-valued parameters; the corresponding Verilog module definition should have identical parameter declarations. Calyx code provides values for these parameters when instantiating a primitive as a cell.
  • The ports section contain sized port definitions that can either be positive number or one of the parameter names.

For example, the following is the signature of the std_reg primitive from the Calyx standard library:

  primitive std_reg<"state_share"=1>[WIDTH](
    @write_together(1) in: WIDTH,
    @write_together(1) @static(1) @go write_en: 1,
    @clk clk: 1,
    @reset reset: 1
  ) -> (
    @stable out: WIDTH,
    @done done: 1

The primitive defines one parameter called WIDTH, which describes the sizes for the in and the out ports.

Calyx Components

components are the primary encapsulation unit of a Calyx program. They look like this:

component name<attributes>(ports) -> (ports) {
  cells { ... }
  wires { ... }
  control { ... }

Like primitive definitions, component signatures consist of a name, an optional list of attributes, and input/output ports. Unlike primitives, component definitions do not have parameters; ports must have a concrete (integer) width. A component encapsulates the control and the hardware structure that implements a hardware module.


A port definition looks like this:

[@<attr>...] <name>: <width>

Ports have a bit width but are otherwise untyped. They can also include optional attributes. For example, this component definition:

component counter(left: 32, right: 32) -> (@stable out0: 32, out1: 32) { .. }

defines two input ports, left and right, and two output ports, out0 and out1. All four ports are 32-bit signals. Additionally, the out0 port has the attribute @stable.


A component's cells section instantiates a set of sub-components. It contains a list of declarations with this syntax:

[ref]? <name> = <comp>(<param...>);

Here, <comp> is the name of an existing primitive or component definition, and <name> is the fresh, local name of the instance. The optional ref parameter turns the cell into a by-reference cell. Parameters are only allowed when instantiating primitives, not Calyx-defined components.

For example, the following definition of the counter component instantiates a std_add and std_reg primitive as well as a foo Calyx component

component foo() -> () { ... }
component counter() -> () {
  cells {
    r = std_reg(32);
    a = std_add(32);
    f = foo();
  wires { ... }
  control { ... }

When instantiating a primitive definition, the parameters are passed within the parenthesis. For example, we pass 32 for the WIDTH parameter of the std_reg in the above instantiation. Since an instantiation of a Calyx component does not take any parameters, the parameters are always empty.

The wires Section

A component's wires section contains guarded assignments that connect ports together. The assignments can either appear at the top level, making them continuous assignments, or be organized into named group and comb group definitions.

Guarded Assignments

Assignments connect ports between two cells together, with this syntax:

<cell>.<port> = [<guard> ?] <cell>.<port>;

The left-hand and right-hand side are both port references, which name a specific input or output port within a cell declared within the same component. The optional guard condition is a logical expression that determines whether the connection is active.

For example, this assignment: = add.out;

unconditionally transfers the value from a port named out in the add cell to r's in port.

Assignments are simultaneous and non-blocking. When a block of assignments runs, they all first read their right-hand sides and then write into their left-hand sides; they are not processed in order. The result is that the order of assignments does not matter. For example, this block of assignments: = add.out;
add.left = y.out;
add.right = z.out;

is a valid way to take the values from registers y and z and put the sum into r. Any permutation of these assignments is equivalent.


An assignment's optional guard expression is a logical expression that produces a 1-bit value, as in these examples: = cond.out ? add.out; = !cond.out ? 32'd0;

Using guards, Calyx programs can assign multiple different values to the same port. Omitting a guard expression is equivalent to using 1'd1 (a constant "true") as the guard.

Guards can use the following constructs:

  • port: A port access on a defined cell
  • port op port: A comparison between values on two ports. Valid op are: >, <, >=, <=, ==
  • !guard: Logical negation of a guard value
  • guard || guard: Disjunction between two guards
  • guard && guard: Conjunction of two guards

Well-formedness: For each input port on the LHS, only one guard should be active in any given cycle during the execution of a Calyx program.

Continuous Assignments

When an assignment is appears directly inside a component's wires section, it is called a continuous assignment and is permanently active, even when the control program of the component is inactive.

group definitions

A group is a way to name a set of assignments that together represent some meaningful action:

group name<attributes> {
  name[done] = done_cond;

Assignments within a group can be reasoned about in isolation from assignments in other groups. Unlike continuous assignments, a group's encapsulated assignments only execute as dictated by the control program. These means that seemingly conflicting writes to the same ports are allowed:

group foo { = 32'd10;
  foo[done] = ...;
group bar { = 32'd22;
  bar[done] = ...;

However, group assignments must not conflict with continuous assignments defined in the component:

group foo { = 32'd10; ... // Malformed because it conflicts with the write below.
  foo[done] = ...
} = 32'd50;

Groups can take any (nonzero) number of cycles to complete. To indicate to the outside world when their execution has completed, every group has a special done signal, which is a special port written as <group>[done]. The group should assign 1 to this port to indicate that its execution is complete.

Groups can have an optional list of attributes.

Well-formedness: All groups are required to run for at least one cycle. (Sub-cycle logic should use comb group instead.)

comb group definitions

Combinational groups are a restricted version of groups which perform their computation purely combinationally and therefore run for "less than one cycle":

comb group name<attributes> {

Because their computation is required to run for less than a cycle, comb group definitions do not specify a done condition.

Combinational groups cannot be used within normal control operators. Instead, they only occur after the with keyword in a control program.

The Control Operators

The control section of a component contains an imperative program that describes the component's behavior. The statements in the control program consist of the following operators:

Group Enable

Simply naming a group in a control statement, called a group enable, executes the group to completion. This is a leaf node in the control program.


invoke acts like the function call operator for Calyx and has the following syntax:

invoke instance[ref cells](inputs)(outputs) [with comb_group];
  • instance is the name of the cell instance that needs to be invoked.
  • inputs and outputs define connections for a subset of input and output ports of the instance.
  • The with comb_group section is optional and names a combinational group that is active during the execution of the invoke statement.
  • ref cells is a list of cell names to pass to the invoked component's required cell reference. (It can be omitted if the invoked component contains no cell references.)

Invoking a instance runs its control program to completion before returning. Any Calyx component or primitive that implements the go-done interface can be invoked. Like the group enable statement, invoke is a leaf node in the control program.


The syntax for sequential composition is:

seq { c1; ...; cn; }

where each ci is a nested control statement. Sequences run the control programs in sequence, guaranteeing that each program runs fully before the next one starts executing. seq does not provide any cycle-level guarantees on when a succeeding group starts executing after the previous one finishes.


The syntax for parallel composition is:

par { c1; ...; cn; }

The statement runs the nested control programs in parallel, guaranteeing that each program only runs once. par does not provide any guarantees on how the execution of child programs is scheduled. It is therefore not safe to assume that all children begin execution at the same time.

Well-formedness: The assignments in the children should never conflict with each other.


The syntax is:

if <port> [with comb_group] {
} else {

The conditional execution runs either true_c or false_c using the value of <port>. The optional with comb_group syntax allows running a combinational group that computes the value of the port.

Well-formedness: The combinational group is considered to be running during the entire execution of the control program and therefore should not have conflicting assignments with either true_c or false_c.


The syntax is:

while <port> [with comb_group] {

Repeatedly executes body_c while the value on port is non-zero. The optional with comb_group enables a combinational group that computes the value of port.

Well-formedness: The combinational group is considered active during the execution of the while loop and therefore should not have conflicting assignments with body_c.

The go-done Interface

By default, calyx components implement a one-sided ready-valid interface called the go-done interface. During compilation, the Calyx compiler will add an input port marked with the attribute @go and an output port marked with the attribute @done to the interface of the component:

component counter(left: 32, right: 32, @go go: 1) -> (out: 32, @done done: 1)

The interface provides a way to trigger the control program of the counter using assignments. When the go signal of the component is set to 1, the control program starts executing. When the component sets the done signal to 1, its control program has finished executing.

Well-formedness: The go signal to the component must remain set to 1 while the done signal is not 1. Lowering the go signal before the done signal is set to 1 will lead to undefined behavior.

The clk and reset Ports

The compiler also adds special input ports marked with @clk and @reset to the interface. By default, Calyx components are not allowed to look at or use these signals. They are automatically threaded to any primitive that defines @clk or @reset ports.

Advanced Concepts

ref cells

Calyx components can specify that a cell needs to be passed in "by-reference":

// Component that performs mem[0] += 1;
component update_memory() -> () {
  cells {
    ref mem = std_mem_d1(...)
  wires { ... }
  control { ... }

When invoking such a component, the calling component must provide a binding for each defined cell:

component main() -> () {
  cells {
    upd = update_memory();
    m1 = std_mem_d1(...);
    m2 = std_mem_d2(...);
  wires { ... }
  control {
    seq {
      invoke upd[mem=m1]()(); // Pass `m1` by reference
      invoke upd[mem=m2]()(); // Pass `m2` by reference

As the example shows, each invocation can take different bindings for each ref cell. See the tutorial for longer example on how to use this feature.

Data Format

Calyx's fud-based workflows specifies a JSON-based data format which can be used with software simulators.

External memories

First, a Calyx program must mark memories using the @external attribute to tell the compiler that the memory is used as either an input or an ouput.

component main() {
    cells {
        @external ext = std_mem_d1(...);
        internal = std_mem_d1(...);

In the above program, the memory ext will be assumed to be an input-output memory by the compiler while internal is considered to be an internal memory. For all external memories, the compiler will generate code to read initial values and dump out final values. The external attribute is recognized for all the std_mem and seq_mem primitives. The @external attribute can only be used on the top-level component.

When the --synthesis flag is passed to the compiler, this behavior is disabled and instead the memories are turned into ports on the top-level component.

The Data Format

The JSON-based data format allows us to provide inputs for a memory and specify how the values should be interpreted:

  "mem": {
    "data": [10],
    "format": {
      "numeric_type": "bitnum",
      "is_signed": false,
      "width": 32

The data field represents the initial data for the memory and can use mutlidimensional arrays to describe it. For example, the following is the initial data for a two-dimensional (std_mem_d2) memory:

    "data": [ [1, 2], [3, 4] ],
    "format": {...}

The format specifier tells fud how to interpret the values in the data field. The original program specifies that all values should be treated as 32-bit, unsigned values.

In order to specify fixed-point values, we must specify both the total width and fraction widths:

"root": {
  "data": [
  "format": {
      "numeric_type": "fixed_point",
      "is_signed": false,
      "width": 32,
      "frac_width": 16

The format states that all values have a fractional width of 16-bits while the remainder is used for the integral part.

Note: fud requires that for each memory marked with the @external attribute in the program.

Using fud

All software simulators supported by fud, including Verilator and Icarus Verilog, as well as the Calyx interpreter can use this data format. To pass a JSON file with initial values, use the -s flag:

# Use Icarus Verilog
fud e --to dat --through icarus-verilog <CALYX FILE> -s <JSON>
# Use Verilator
fud e --to dat --through verilog <CALYX FILE> -s <JSON>
# Use the Calyx Interpreter
fud e --to interpreter-out <CALYX FILE> -s <JSON>

Generating Random Values

Often times, it can be useful to automatically generate random values for a large memory. The data-gen tool takes a Calyx program as an input and automatically generates random values for each memory marked with @external in the above data format.

Static Timing

By default, Calyx programs use a latency-insensitive model of computation. This means that the compiler does not track the number of cycles it takes to perform a computation or run a control operator In general, latency-insensitivity makes it easier to compose programs together and gives the compiler freedom to schedule operators however it wants. However, the generated hardware to schedule the execution may not be efficient–especially if the program can take advantage of the latency information.

More crucially, however, it is impossible for latency-insensitive programs to interact with latency-sensitive hardware implemented in RTL.

Calyx uses the @static attribute to provide latency information to various constructs and provides strong guarantees about the generated programs.


Tricks & Tips

Delay by n cycles

Sometimes it can be useful to delay the execution of a group by n cycles:

seq {
  @static(1) a; // Run in first cycle
  @static(2) b; // Run in the 10th cycle

A simple trick to achieve this is adding an empty group with @static(n) attribute on it:

cell {
  r = @std_reg(0);
@static(9) group delay_9 {
  delay_9[done] = r.out; // Don't use r.done here
seq {
    @static(1) a; // Run in first cycle
    @static(9) delay_9;
    @static(2) b; // Run in the 10th cycle

The static compilation pass tdst will never attempt to use the delay_9's done condition and since there are no assignments in the group, it'll not generate any additional hardware.

Experimental: Synchronization

Calyx's default semantics do not admit any predictable form of language-level synchronization in presence of parallelism. We're currently experimenting with a suite of new primitives that add synchronization to the language.

@sync attribute


Consider the following control program in calyx:

    par {
      /// thread A
      while lt.out with comp {
        seq {

      /// thread B
      while lt.out with comp {
        seq {

where groups add_r_to_accm and incr_r reads value and increments value in register r, respectively, as indicated by their names.

Because calyx does not make any guarantee of the order of execution for threads running in parallel, it is impossible for us to determine which thread will access r first for each iteration.

Nondeterminism when running parallel threads is beneficial on the compiler's end, as it will give the compiler more freedom for optimization. However, we sometimes do want to give parallel threads a measure of ordering while still taking advantage of the performance boost of parallelism. The @sync attribute allows us to do that.

Using the @sync attribute

Now we want to modify the program above so that in every iteration, thread A always reads after thread B finishes incrementing using the @sync attribute with the following:

    par {
      // thread A
      while lt.out with comp {
        seq {

      // thread B
      while lt.out with comp {
        seq {

First and foremost, always remember to import "primitives/sync.futil" when using the @sync attribute.

The @sync syntax can only be marked with empty statements. @sync means that the thread marked with a certain value, now called barrier index, for this attribute, must stop and wait for all other threads marked with the same barrier index to arrive, at which point they can proceed.

In the modified program above, we see that incr_idx and incr_r must both finish in order for either thread to go forth. Because add_r_to_accm is executed after incr_idx in thread A, we know that in each iteration, incr_r will always increment r before add_r_to_accm reads it. We've also inserted another barrier at the end of the while loops for each thread, which essentially means add_r_to_accm has to finish before either thread enters the next iteration.

Synchronization in Branches

We can also have "barriers" in if branches:

    par {
      // thread 1
      while lt.out with comp {
        if eq.out with st_0 {
          seq {
        else {
          seq {

      // thread 2
      while lt.out with comp {
        seq {

In this control program, both branches of thread 1 have statements marked with @sync(1), which syncs it up with thread 2.

Be really really careful when using the @sync attribute in conditional branches! If the other thread sharing one "barrier" with your thread is blocked unconditionally, then you would probably want to have the same @sync value in both branches; since having the same @sync value in only one branch would likely lead to a "deadlock" situation: if thread A is running in the unlocked branch while the thread B has a "barrier" that is expecting two threads, thread B may never proceed because thread A never arrives at the "barrier".

More Complex Example

If you want to see a more complex design using @sync, see sync-dot-product


Currently we only support two threads sharing the same "barrier", i.e., only two threads can have control with the @sync attribute marked with the same value.

Undefined Behaviors

Undefined behavior in Calyx is either intentional or unintentional. This page tracks the various issues discussing the undefined behaviors that are known to currently exist in Calyx.

Interface Signals

The go and done signals form the two core interface signals in Calyx. Their semantics are baked into the compiler and pervasively used to define the meaning of programs.

Undriven Ports

Calyx's continuous assignments do not make any static guarantees about which ports need to be driven when. Current efforts attempt to codify when reading from an undriven port is incorrect.

Semantics of par

par blocks in Calyx represent parallel execution of groups. Currently, there is no clear semantics for interactions between groups executing in parallel. The interpreter implements a form of lockstep semantics that disallows certain forms of concurrent reads and writes while the code generated by the compiler allows for arbitrary communication.

Isolation Guarantees

Calyx groups have a strong isolation guarantee---they must execute for at least one cycle and guarantee that signals inside are not visible after they are done executing. However, the isolation guarantees for combinational groups, continuous assignments, and with blocks is not clear.


Calyx has an attribute system that allows information to be associated with every basic Calyx construct. This information can then be used to optimize the program or change how the program is compiled.

Attributes can decorate lots of things in Calyx: components, groups, cells, ports, and control statements. The syntax looks like name<"attr"=value> for components and groups or @attr(value) for other constructs. Attributes always map keys to values. Because it's common to have a "Boolean" attribute that always maps to the value 1, the syntax @attr is a shorthand for @attr(1).

Here is the syntax for attributes in different parts of the AST:

Component and Port Attributes

component main<"static"=10>(@go go: 1) -> (@done done: 1) {

Cell Attributes

cells {
  @external mem = std_mem_d1(32, 8, 4);
  reg = std_reg(32);

Group Attributes

group cond<"static"=1> {

Control Attributes

control {
  @static(3) seq {
    @static(1) A;
    @static(2) B;

Meaning of Attributes


The entrypoint for the Calyx program. If no component has this attribute, then the compiler looks for a component named main. If neither is found, the compiler errors out.

go, done, clk and reset

These four ports are part of the interface to Calyx components. These are automatically added by the parser if they are missing from the component definition. go and done provide the mechanism for how an "outer" component invokes an "inner" cell that it contains. clk and reset thread through the global clock and resetting signal in a design.


By default, interface ports are automatically added to a component by the parser if they are missing. Adding this attribute disables this behavior.


The external attribute has meaning when it is attached to a cell. It has two meanings:

  1. If the externalize pass is enabled, the cell is turned into an "external" cell by exposing all its ports through the current component and rewriting assignments to the use the ports. See the documentation on See externalize for more information.
  2. If the cell is a memory and has an external attribute on it, the verilog backend (-b verilog) generates code to read <cell_name>.dat to initialize the memory state and dumps out its final value after execution.


Can be attached to components, groups, and control statements. They indicate how many cycles a component, group, or control statement will take to run and are used by -p static-timing to generate more efficient control FSMs.

The go and done attributes are, in particular, used by the infer-static-timing pass to configure which ports are used like go and done signals. Along with the static(n) attribute, this allows the pass to calculate when a particular done signal of a primitive will be high.


Used by the inline pass on cell definitions. Instructs the pass to completely inline the instance into the parent component and replace all invokes of the instance with the control program of the instance.


Used by the canonicalize pass. Only meaningful on output ports and states that their value is provided by a sequential element and is therefore available outside combinational time.

For example, after invoking a multiplier, the value on its out port remains latched till the next invocation.

For example

cells {
  m = std_mult_pipe(32);
wires {
  group use_m_out { // uses m.out }
control {
  invoke m(left = 32'd10, right = 32'd4)();

The value of m.out in use_m_out will be 32'd40.

This annotation is currently used by the primitives library and the Dahlia frontend.


Can be attached to a component and indicates that a component can be shared across groups. This is used by the -p cell-share to decide which components can be shared.


Can be attached to a component and indicates that a component can be shared across groups. Different than share since state_share components can have internal state. This is used by -p cell-share to decide which components can be shared. Specifically, a component is state shareable if each write to that component makes any previous writes to the component irrelevant. The definition of a "write to a component" is an activiation of the component's "go" port, followed by a read of the component's "done" port (in other words, the read of a "done" port still counts as part of a "write" to the component). For c1 and c2, instances of a state_shareable component: instantiate c1 instantiate c2 any write to c1 any write to c2 write value v to port p in c1 write value v to port p in c2 c1 and c2 should be equal.


Used in infer-static-timing and static-timing when the number of iterations of a While control is known statically, as indicated by n.


Added by ir::Builder to denote that the cell was added by a pass.


Marks the special clock signal inserted by the clk-insertion pass, which helps with lowering to RTL languages that require an explicit clock.


Used by the papercut pass. Defines a group n of signals that all must be driven together:

primitive std_mem_d2<"static"=1>[WIDTH, D0_SIZE, D1_SIZE, D0_IDX_SIZE, D1_IDX_SIZE](
  @write_together(2) addr0: D0_IDX_SIZE,
  @write_together(2) addr1: D1_IDX_SIZE,
  @write_together(1) write_data: WIDTH,
  @write_together(1) @go write_en: 1,
) -> (...);

This defines two groups. The first group requires that write_en and write_data signals together while the second requires that addr0 and addr1 are driven together.

Note that @write_together specifications cannot encode implication of the form "if port x is driven then y should be driven".


Used by papercut and canonicalize. Defines a combinational path n between a set of an input ports and an output port.

primitive std_mem_d1<"static"=1>[WIDTH, SIZE, IDX_SIZE](
  @read_together(1) addr0: IDX_SIZE, ...
) -> (
  @read_together(1) read_data: WIDTH, ...

This requires that when read_data is used then addr0 must be driven. Note that each group must have exactly one output port in it.


Marks a cell or a port as a purely datapath component, i.e., the output does not propagate into a guard or another control signal. See this issue for the full set of constraints.

When we have following two conditions:

  1. An input port is marked with @data in the component definitions, and
  2. The cell instance is marked as @data

The backend generate 'x as the default value for the assignment to the port instead of '0. Additionally, if the port has exactly one assignment, the backend removes the guard entirely and produces a continuous assignment.

This represents the optimization:

in = g ? out : 'x


in = out;

Since the value 'x can be replaced with anything.

fud: The Calyx Driver

Working with Calyx involves a lot of command-line tools. For example, an incomplete yet daunting list of CLI tools used by Calyx is:

  • All the Calyx frontends.
  • Calyx compiler and its various command line tools
  • Verilator, the Verilog simulation framework used to test Calyx-generated designs.
  • Waveform viewers to see the results of simulation

fud aims to provide a simple interface for using these toolchains and executing them in a pipeline. The source for fud is here.


Fud requires Python 3.9 or higher to work correctly.

You need Flit to install fud. Install it with pip3 install flit.

You can then install fud with

cd fud && flit install

(If using this method to install fud, pip3 should be version >= 20)

If you are working on fud itself, you can install it with a symlink with:

cd fud && flit install --symlink

You can also install fud with

flit build
pip3 install dist/fud-0.1.0-py3-none-any.whl

Finally, point fud to the root of the repository:

fud config global.futil_directory <full path to Calyx repository>


Fud uses a global configuration file to locate tool paths and default values. To view the configuration, use fud c or fud config.

Check. Fud can automatically check if your configuration is valid and can help you set certain variables. Perform this check with:

fud check

Viewing keys. To view the current value of a key, use fud config key. For example, the following shows the path to the Calyx compiler.

fud config stages.futil.exec

Updating keys. Keys can be updated using fud config key value. For example, the following command updates the path to the Calyx compiler.

fud config stages.futil.exec ./target/debug/futil

Adding Backends

fud wraps both frontends and backends for Calyx. For a useful fud installation, you need to configure the Verilator backend and accompanying tools.


We use the open source Verilator tool to simulate Verilog programs generated by the Calyx compiler. Install Verilator by following the official instructions. On macOS, you can use brew install verilator or, otherwise, compile from source:

git clone
cd verilator
git pull
git checkout master
sudo make install

Run fud check to make sure you have the right version. We currently require >=v5.002 of Verilator with fud.

By default, fud will use the verilator executable to run Verilator. To use a different binary, configure the path by:

fud config stages.verilog.exec <binary>

Vcdump. Vcdump is a tool for converting vcd (Value Change Dump) files to JSON for easier analysis with the command line.

Install it with:

cargo install vcdump

Icarus Verilog

fud also supports Icarus Verilog as a simulation backend. Since Icarus Verilog interprets programs, it can often be faster than verilator. First install Icarus Verilog and ensure that iverilog and vvp are on your path. Next, register the icarus-verilog external stage with fud:

fud register icarus-verilog -p fud/icarus/

To test the simulation backend, run:

fud e --to dat examples/tutorial/language-tutorial-iterate.futil -s examples/tutorial/data.json --through icarus-verilog

Registering this stage will define multiple paths in the fud transformation graph between futil and the vcd and dat stages. If you'd like fud to default to verilog stage, give it a high priority:

fud c stages.verilog.priority 1

Dahlia Frontend

In order to use the Dahlia frontend with Fud, first install Dahlia. Once Dahlia is compiled, point fud to the Dahlia compiler binary:

fud config stages.dahlia.exec <full path to dahlia repo>/fuse

Python frontends (Systolic array, NTT, MrXL, TVM Relay)

You need flit to install our Python frontends.

pip3 install flit

Our Python frontends use a Calyx ast library written in Python. Install with:

cd calyx-py && flit install -s

Frontend specific instructions:

  • Systolic array: Nothing else needed.
  • NTT:
    • Install dependencies: pip3 install prettytable
    • Install external fud stage: fud register ntt -p frontends/ntt-pipeline/fud/
  • MrXL:
    • Install mrxl binary: cd frontends/mrxl && flit install -s
    • Install mrxl external stage for fud: fud register mrxl -p frontends/mrxl/fud/
  • TVM Relay: See instructions.

Adding Synthesis Backends

fud supports wraps the Vivado (synth-verilog) and Vivado HLS (vivado-hls) tools to generate area and resource estimates for Calyx designs. See the instructions to configure them.

Working with Stages

Fud is structured as a sequence of stages that transform inputs of one form to outputs.

Stages. fud transforms a file in one stage into a file in a later stage. The --from and --to options specify the input and output stages to the fud exec subcommand. Use fud info to view all possible stages.

Guessing stages. fud will try to guess the starting stage by looking at the extension of the input file and output file (if specified using the -o flag). If it fails to guess correctly or doesn't know about the extension, you can manually set the stages using --to and --from.


Fud provides some very basic profiling tools through the use of the --dump_prof (or -pr) flag. You can get overall stage durations for a fud run by simply using -pr. For example,

  fud e examples/dahlia/dot-product.fuse --to dat \
  -s examples/dahlia/ \

will output:

stage                           elapsed time (s)
dahlia                          1.231
futil                           0.029
verilog                         4.915

If you want time elapsed for each step in a stage, you can also provide one or more stages after the flag. For example,

  fud e examples/dahlia/dot-product.fuse --to dat \
  -s examples/dahlia/ \
  -pr verilog

will output:

verilog                         elapsed time (s)
mktmp                           0.0
json_to_dat                     0.004
compile_with_verilator          5.584
simulate                        0.161
output_json                     0.003
cleanup                         0.003

If you want time elapsed for a single step, you can specify the given step, e.g.

  fud e examples/dahlia/dot-product.fuse --to dat \
  -s examples/dahlia/ \
  -pr verilog.simulate

will output:

verilog                         elapsed time (s)
simulate                        0.161

Lastly, the -csv flag will provide the profiling information in CSV format.


These commands will assume you're in the root directory of the Calyx repository.

Compiling Calyx. Fud wraps the Calyx compiler and provides a set of default compiler options to compile Calyx programs to Verilog.

# Compile Calyx source in the test simple.expect
# to Verilog. We must explicitly specify the input
# file type because it can not be guessed from
# the file extension.
fud exec examples/futil/simple.expect --from futil --to verilog

Fud can explain its execution plan when running a complex sequence of steps using the --dry-run option.

# Dry run of compiling the Dahlia dot product file
# to Calyx. As expected, this will *only* print
# the stages that will be run.
fud exec examples/dahlia/dot-product.fuse --to futil --dry-run

Simulating Calyx. Fud can compile a Calyx program to Verilog and simulate it using Verilator.

# Compile and simulate a vectorized add implementation
# in Calyx using the data provided,
# then dump the vcd into a new file for debugging.
# === Calyx:   examples/futil/vectorized-add.futil
# === data:    examples/dahlia/
# === output:  v-add.vcd
fud exec \
  examples/futil/vectorized-add.futil \
  -o v-add.vcd \
  -s examples/dahlia/

Simulating Dahlia. The following command prints out the final state of all memories by specifying --to dat.

# Compile a Dahlia dot product implementation and
# simulate in verilog using the data provided.
# === Dahlia: examples/dahlia/dot-product.fuse
# === data:   examples/dahlia/
#     (`.data` is used as an extension alias for `.json`)
fud exec \
  examples/dahlia/dot-product.fuse \
  --to dat \
  -s examples/dahlia/

Interpreting Calyx. In addition to Verilator, fud can execute Calyx programs using the experimental interpreter.

# Execute a Calyx program without compiling it,
# producing a JSON snapshot of the final state.
# === Calyx:   tests/correctness/while.futil
# === data:    tests/correctness/
fud exec \
  tests/correctness/while.futil \
  -s tests/correctness/ \
  --to interpreter-out

Xilinx Toolchain

Working with vendor EDA toolchains is never a fun experience. Something will almost certainly go wrong. If you're at Cornell, you can at least avoid installing the tools yourself by using our lab servers, Gorgonzola or Havarti.

fud can interact with the Xilinx tools (Vivado, Vivado HLS, and Vitis). There are three main things it can do:

  • Synthesize Calyx-generated RTL designs to collect area and resource estimates.
  • Compile Dahlia programs via C++ and Vivado HLS for comparison with the Calyx backend.
  • Compile Calyx programs for actual execution in Xilinx emulation modes or on real FPGA hardware.

You can set fud up to use either a local installation of the Xilinx tools or one on a remote server, via SSH.

Synthesis Only

The simplest way to use the Xilinx tools is to synthesize RTL or HLS designs to collect statistics about them. This route will not produce actual, runnable executables; see the next section for that.

Installing Dependencies

fud uses extra dependencies to invoke the Xilinx toolchains. Run the following command to install all required dependencies:

cd fud && flit install -s --deps all

Setting up Remote Tools

Follow these instructions if you're attempting to run vivado or vivado-hls on a server from your local machine. If you are working directly on a server with these tools, skip to the run instructions.

To set up to invoke the Xilinx tools over SSH, first tell fud your username and hostname for the server:

# Vivado
fud config stages.synth-verilog.ssh_host <hostname>
fud config stages.synth-verilog.ssh_username <username>

# Vivado HLS
fud config stages.vivado-hls.ssh_host <hostname>
fud config stages.vivado-hls.ssh_username <username>

The following commands enable remote usage of vivado and vivado-hls by default:

fud config stages.synth-verilog.remote 1
fud config stages.vivado-hls.remote 1

The server must have vivado and vivado_hls available on the remote machine's path. (If you need the executable names to be something else, please file an issue.)

Setting up Local Tools

To instead invoke the Xilinx tools locally, just let fud run the vivado and vivado_hls commands. You can optionally tell fud where these commands exist on your machine:

fud config stages.synth-verilog.exec <path> # update vivado path
fud config stages.vivado-hls.exec <path> # update vivado_hls path

Setting the remote option for the stages to 0 ensure that fud will always try to run the commands locally.

fud config stages.synth-verilog.remote 0
fud config stages.vivado-hls.remote 0


To run the entire toolchain and extract statistics from RTL synthesis, use the resource-estimate target state. For example:

fud e --to resource-estimate examples/futil/dot-product.futil

To instead obtain the raw synthesis results, use synth-files.

To run the analogous toolchain for Dahlia programs via HLS, use the hls-estimate target state:

fud e --to hls-estimate examples/dahlia/dot-product.fuse

There is also an hls-files state for the raw results of Vivado HLS.

Emulation and Execution

fud can also compile Calyx programs for actual execution, either in the Xilinx toolchain's emulation modes or for running on a physical FPGA. This route involves generating an AXI interface wrapper for the Calyx program and invoking it using Xilinx's PYNQ interface.

Set Up

As above, you can invoke the Xilinx toolchain locally or remotely, via SSH. To set up SSH execution, you can edit your config.toml to add settings like this:

ssh_host = "havarti"
ssh_username = "als485"
remote = 1

To use local execution, just leave off the remote = true line.

You can also set the Xilinx mode and target device:

mode = "hw_emu"
device = "xilinx_u50_gen3x16_xdma_201920_3"

The options for mode are hw_emu (simulation) and hw (on-FPGA execution). The device string above is for the Alveo U50 card, which we have at Cornell. The installed Xilinx card would typically be found under the directory /opt/xilinx/platforms, where one would be able to find a device name of interest.


The first step in the Xilinx toolchain is to generate an xclbin executable file. Here's an example of going all the way from a Calyx program to that:

fud e examples/futil/dot-product.futil -o foo.xclbin --to xclbin

On our machines, compiling even a simple example like the above for simulation takes about 5 minutes, end to end. A failed run takes about 2 minutes to produce an error.

By default, the Xilinx tools run in a temporary directory that is deleted when fud finishes. To instead keep the sandbox directory, use -s xclbin.save_temps true. You can then find the results in a directory named fud-out-N for some number N.


Now that you have an xclbin, the next step is to run it. Roughly speaking, the command you need is just a fud invocation that goes from the xclbin stage to the fpga stage:

fud e foo.xclbin --from xclbin --to fpga -s examples/dahlia/

Contrary to the name, the fpga stage works for both emulation and on-FPGA execution---fud's mode config option for this stage chooses which to use. The config option provides a normal fud-style JSON data input file for the run.

Currently, you will need to have a bunch of environment variables set up to point to the Xilinx tools before running this fud command. For example, on our group's havarti server, you can do this:

source /scratch/opt/Xilinx/Vitis/2020.2/
source /opt/xilinx/xrt/
export EMCONFIG_PATH=`pwd`

To prepare for hardware emulation of an xclbin compiled appropriately, run:

export XCL_EMULATION_MODE=hw_emu

If preparing for actual hardware execution, ensure the XCL_EMULATION_MODE environment variable is unset:


These steps source the setup scripts for both Vitis and XRT, and set a special EMCONFIG_PATH to your current directory so that fud can generate a special JSON configuration file for Xilinx emulation. You also need to tell PYNQ whether you want hw_emu (emulation) or the default on-device execution to occur by exporting or unsetting the XCL_EMULATION_MODE environment variable. Of course, it would be better if all this could come from fud's configuration itself instead of requiring you to set it up ahead of time; issue #872 covers this work.

Waveform Debugging

In emulation mode, this stage can produce a waveform trace in Xilinx's proprietary WDB file format as well as a standard VCD file.

Use the fud options -s fpga.waveform true -s fpga.save_temps true when emulating your program. The first option instructs XRT to use the batch debug mode and to dump a VCD, and the second asks fud not to delete the directory where the waveform files will appear.

Then, look in the resulting directory, which will be named fud-out-* for some *. In there, the Xilinx trace files you want are named *.wdb and *.wcfg. The VCD file is at .run/*/hw_em/device0/binary_0/behav_waveform/xsim/dump.vcd or similar.

How it Works

The first step is to generate input files. We need to generate:

  • The RTL for the design itself, using the compile command-line flags -b verilog --synthesis -p external. We name this file
  • A Verilog interface wrapper, using XilinxInterfaceBackend, via -b xilinx. We call this toplevel.v.
  • An XML document describing the interface, using XilinxXmlBackend, via -b xilinx-xml. This file gets named kernel.xml.

The fud driver gathers these files together in a sandbox directory. The next step is to run the Xilinx tools.

The rest of this section describes how this workflow works under the hood. If you want to follow along by typing commands manually, you can start by invoking the setup scripts for Vitis and XRT:

source <Vitis_install_path>/Vitis/2020.1/
source /opt/xilinx/xrt/

On some Ubuntu setups, you may need to update LIBRARY_PATH:

export LIBRARY_PATH=/usr/lib/x86_64-linux-gnu

You can check that everything is working by typing vitis -version or vivado -version.

Background: .xo and .xclbin

In the Xilinx toolchain, compilation to an executable bitstream (or simulation blob) appears to requires two steps: taking your Verilog sources and creating an .xo file, and then taking that and producing an .xclbin “executable” file. The idea appears to be a kind of metaphor for a standard C compilation workflow in software-land: .xo is like a .o object file, and .xclbin contains actual executable code (bitstream or emulation equivalent), like a software executable binary. Going from Verilog to .xo is like “compilation” and going from .xo to .xclbin is like “linking.”

However, this analogy is kind of a lie. Generating an .xo file actually does very little work: it just packages up the Verilog source code and some auxiliary files. An .xo is literally a zip file with that stuff packed up inside. All the actual work happens during “linking,” i.e., going from .xo to .xclbin using the v++ tool. This situation is a poignant reminder of how impossible separate compilation is in the EDA world. A proper analogy would involve separately compiling the Verilog into some kind of low-level representation, and then linking would properly smash together those separately-compiled objects. Instead, in Xilinx-land, “compilation” is just simple bundling and “linking” does all the compilation in one monolithic step. It’s kind of cute that the Xilinx toolchain is pretending the world is otherwise, but it’s also kind of sad.

Anyway, the only way to produce a .xo file from RTL code appears to be to use Vivado (i.e., the actual vivado program). Nothing from the newer Vitis package currently appears capable of producing .xo files from Verilog (although v++ can produce these files during HLS compilation, presumably by invoking Vivado).

The main components in an .xo file, aside from the Verilog source code itself, are two XML files: kernel.xml, a short file describing the argument interfaces to the hardware design, and component.xml, a much longer and more complicated IP-XACT file that also has to do with the interface to the RTL. We currently generate kernel.xml ourselves (with the xilinx-xml backend described above) and then use Vivado, via a Tcl script, to generate the IP-XACT file.

In the future, we could consider trying to route around using Vivado by generating the IP-XACT file ourselves, using a tool such as DUH.

Our Workflow

The first step is to produce an .xo file. We also use a static Tcl script, gen_xo.tcl, which is a simplified version of a script from Xilinx's Vitis tutorials. The gist of this script is that it creates a Vivado project, adds the source files, twiddles some settings, and then uses the package_xo command to read stuff from this project as an "IP directory" and produce an .xo file. The Vivado command line looks roughly like this:

vivado -mode batch -source gen_xo.tcl -tclargs xclbin/kernel.xo

That output filename after -tclargs, unsurprisingly, gets passed to gen_xo.tcl.

Then, we take this .xo and turn it into an .xclbin. This step uses the v++ tool, with a command line that looks like this:

v++ -g -t hw_emu --platform xilinx_u50_gen3x16_xdma_201920_3 --save-temps all:all:all --profile.exec all:all:all -lo xclbin/kernel.xclbin xclbin/kernel.xo

Fortunately, the v++ tool doesn't need any Tcl to drive it; all the action happens on the command line.

AXI Interface Generation

Calyx currently generates a fairly complex AXI interface that can be daunting to deal with if confronting for the first time. The following is an overview of how the generation occurs and how the output AXI interface behaves as of 2022-9-11.

In general, when fud is asked to create an .xclbin file a kernel.xml,, and toplevel.v are created as intermediate steps for xilinx tools to properly work. contains the SystemVerilog needed for our computations to perform correctly. It is implemented as an FSM that derives from the original Calyx program. kernel.xml defines register maps and ports of our toplevel xilinx tools needs. toplevel.v wraps our computation kernel and contains the AXI interface for each memory defined in a Calyx program and corresponds to the standard Calux lowering process.

For more info on file generation see how the Xilinx Toolchain works


Our toplevel is generated through files in src/backend/xilinx/.

AXI memory controller

Here, the toplevel component of a Calyx program is queried and memories marked @external are turned into AXI buses. To note, separate AXI interfaces are created for each memory (meaning, there is no shared bus between memories). Each memory has its own (single port) BRAM which writes data taken from an mi_axi_RDATA wire where i is the index of the memory. Eventually the BRAMs are read and fed into the computation kernel of, which outputs results directly into the relevant memories as defined in the original Calyx program. Address and data widths and sizes are determined from cell declerations.

There is always the possiblity that something is hardcoded as a remnant from previous versions of our AXI generation. If something is hardcoded where it shouldn't be please open an issue.

AXI memory controllers are constructed as (full) AXI4 managers that lack a small amount of functionality. For example, xPROT signals are not currently supported. Additionally, things like bursting are not currently supported, but should be easy to implement due to the existing infrastructure and generation.

A list of current signals that are hardcoded follows:

  • xLEN is set to 0, corresponding to a burst length of 1.
  • xBURST is set to 01, corresponding to INCR type of bursts.
  • xSIZE is set to the width of the data we are using in bytes.
  • xPROT is not generated, and is therefore not supported.
  • xLOCK is not generated, defaulting to 0 (normal accesses).
  • xCACHE is not generated, making accesses non-modifiable, non-bufferable.
  • xQOS is not generated. See QoS signaling.
  • xREGION is not generated. See Multiple region signaling.
  • No low power signals are generated.

Subordinate AXI control controller

In addition to our manager memory controllers, a subordinate controller for our control module is also generated. This module is responsible for signaling our computational kernel to start working, as well as calculating the correct base addresses to use for our memory controllers. Things like address and data widths are hard coded at the moment. It is suspected that this hardcoding is okay for the types of programs we generate. But more work needs to be done to see if our control structure works for arbitrary programs or needs to be changed to allow this.

External Stages

fud supports using stages that aren't defined in its main source tree. These are known as 'external stages' and the provide a mechanism for projects using Calyx to take advantage of fud. You can register an external stage with:

fud register stage_name -p /path/to/

Once an external stage is registered, it behaves exactly like any other stage.

You can remove an external stage with:

fud register stage_name --delete

The following defines a stage that transforms MrXL programs to Calyx programs.

from fud.stages import Stage, SourceType
from fud.utils import shell

class MrXLStage(Stage):
    Stage that invokes the MrXL frontend.

    name = "mrxl"

    def __init__(self):
        Initialize this stage. Initializing a stage *does not* construct its
        computation graph.
            description="Compiles MrXL to Calyx.",

    def pre_install():

    def defaults():
        Specify defaults that should be added to fud's configuration file when
        this stage is registered.
        return {"exec": "mrxl"}

    def _define_steps(self, input, builder, config):
        Define the steps that will execute in this stage. Each step represents
        a delayed computation that will occur when the stage is executed.

        # Commands at the top-level are evaluated when the computation is being
        # staged
        cmd = config["stages",, "exec"]

        # Computations within a step are delayed from being executed until
        # the full execution pipeline is generated.
        def run_mrxl(mrxl_prog: SourceType.Path) -> SourceType.Stream:
            return shell(f"{cmd} {str(mrxl_prog)}")

        # Define a schedule using the steps.
        # A schedule *looks* like an imperative program but actually represents
        # a computation graph that is executed later on.
        return run_mrxl(input)

# Export the defined stages to fud
__STAGES__ = [MrXLStage]

External stages must define default values for configuration keys using the Stage.defaults() static method and the name of the stage using the static name field.

Stage Configuration

Like normal stages, external stages can have persistent configuration information saved using fud config.

To add persistent stage configuration, run:

fud config stages.<stage-name>.<key> <value>

To dynamically override the value of a field during execution, use the -s flag:

fud e -s <stage-name>.<key> <value> ...

The override order for stage configuration is:

  1. Dynamic values provided by -s.
  2. Configuration value in the fud config.
  3. Default value provided by Stage.defaults()

Multiple Paths

fud can define a stage graph that can have multiple paths between the source and target. For example, if you register the Icarus Verilog simulator stage, then multiple paths can be used to generate VCD files from Dahlia programs:

% fud e --from dahlia --to vcd
[fud] ERROR: Multiple stage pipelines can transform dahlia to vcd:
dahlia → futil → verilog → vcd
dahlia → futil → icarus-verilog → vcd
Use the --through flag to select an intermediate stage

fud says that both the verilog and icarus-verilog stages can be used to generate the VCD file and you need to provide the --through flag to decide which stage to select.

The following command will simulate the program using the icarus-verilog stage:

% fud e --from dahlia --to vcd --through icarus-verilog

In general, the --through flag can be repeated as many times as needed to get a unique fud transformation pipeline.

Using Stage Priority

If the common workflow uses the same stage every time, it can be annoying to specify the stage name using the --through flag. You can specify a priority field in the configuration of a stage to ensure it fud automatically selects it when multiple paths exists.

For example, to always select the verilog stage, add the priority 1 to the stage:

fud c stages.verilog.priority 1

Now, the command fud e --from dahlia --to vcd is no longer ambiguous; fud will always choose the verilog stage to transform programs from Dahlia sources to VCD.

In case multiple paths have the same cost, fud will again require the --through flag to disambiguate paths.


An ongoing effort is under way to establish Calyx as a dialect in the LLVM umbrella project CIRCT. There is documentation about the Calyx dialect on the MLIR site. While semantically equivalent, they are syntactically different. Because the Calyx dialect is still under progress and does not include all the optimizations that the native Rust compiler supports, we have crafted an emitter from the Calyx dialect (MLIR) to the native compiler representation (used by the Rust compiler). This means you can lower from your favorite frontend in MLIR to the Calyx dialect, and continue all the way to SystemVerilog (with spunky optimizations) using the native compiler.

The native compiler also supports round-tripping back into the MLIR representation. We'll assume you've already built the Rust compiler and installed fud. Here are the steps below to round-trip:

MLIR to Native Representation

  1. Set up the CIRCT project with these instructions.

  2. There should be a circt-translate binary in <root-directory>/build/bin. To emit the native compiler representation, use the command:

path/to/circt-translate --export-calyx /path/to/file

For example, you can use the expected output of the test tests/backend/mlir/simple.expect:

calyx.program "main" {

calyx.component @main(%go: i1 {go=1}, %clk: i1 {clk=1}, %reset: i1 {reset=1}) -> (%out: i1, %done: i1 {done=1}) {, %r1.write_en, %r1.clk, %r1.reset, %r1.out, %r1.done = calyx.register @r1 : i1, i1, i1, i1, i1, i1
  %_1_1.out = hw.constant 1 : i1
  calyx.wires { @Group1 {
      calyx.assign = %_1_1.out : i1
      calyx.assign %r1.write_en = %_1_1.out : i1
      calyx.group_done %r1.done : i1

  calyx.control {
    calyx.seq {
      calyx.enable @Group1


Using the command:

# Don't worry too much about the file alias; this is used for testing purposes.
path/to/circt-translate --export-calyx tests/backend/mlir/simple.expect

This should output:

// -p well-formed -b mlir
import "primitives/core.futil";
component main(@go go: 1, @clk clk: 1, @reset reset: 1) -> (out: 1, @done done: 1) {
  cells {
    r1 = std_reg(1);
  wires {
    group Group1 { = 1'd1;
      r1.write_en = 1'd1;
      Group1[done] = r1.done;
  control {
    seq { Group1; }

Native Representation to MLIR

To round-trip back to the Calyx dialect, we can use fud:

fud exec path/to/file --to mlir

For example,

fud exec tests/backend/mlir/simple.futil --to mlir

This should emit the Calyx dialect once again.

The Calyx Interpreter

The experimental Calyx interpreter resides in the interp/ directory of the repository. The interpreter supports all Calyx programs---from high-level programs that make heavy use of control operators, to fully lowered Calyx programs. (RTL simulation, in contrast, only supports execution of fully lowered programs.)

There are two ways to use the interpreter: you can directly invoke it, or you can use fud.

Basic Use

To run an example program, try:

cd interp && cargo run tests/control/if.futil

You can see the available command-line options by typing cargo run -- --help.

Interpreting via fud

The interpreter is available as a stage in fud, which lets you provide standard JSON data files as input and easily execute passes on the input Calyx program before interpretation.

You'll want to build the interpreter first:

cd interp && cargo build

Here's how to run a Calyx program:

fud e --to interpreter-out interp/tests/control/if.futil

To provide input data, set the variable, like so:

fud e --to interpreter-out \
    -s tests/correctness/ \

By default, fud will not transform the Calyx code before feeding it to the interpreter. To run passes before the interpreter, use the futil.flags variable in conjunction with the -p flag. For example, to fully lower the Calyx program before interpreting it:

fud e --to interpreter-out \
    -s futil.flags '-p all' \

The Calyx Compiler

The source code documentation for the compiler can be found here.

The Calyx compiler has several command line to control the execution of various passes and backends.

Specifying Primitives Library

The compiler implementation uses a standard library of components to compile programs.

The only standard library for the compiler is located in:

<path to Calyx repository>/primitives

Specify the location of the library using the -l flag:

cargo run -- -l ./primitives

Primitive Libraries Format

The primitive libraries consist of a .futil file paired with a .sv file. The .futil file defines a series of Calyx shim bindings in extern blocks which match up with SystemVerilog definitions of those primitives. These libraries may also expose components written in Calyx, usually defined using primitives exposed by the file.

No Calyx program can work without the primitives defined in the Core Library.

Controlling Passes

The compiler is organized as a sequence of passes that are run when the compiler executes.

To get a complete list of all passes, run the following from the repository root:

cargo run -- --list-passes

This generates results of the form:

- collapse-control
- compile-control

- all: well-formed, papercut, remove-external-memories, ...

The first section list all the passes implemented in the compiler. The second section lists aliases for combination of passes that are commonly run together. For example, the alias all is an ordered sequence of default passes executed when the compiler is run from the command-line.

The command-line provides two options to control the execution of passes:

  • -p, --pass: Execute this pass or alias. Overrides default alias.
  • -d, --disable-pass: Disable this pass or alias. Takes priority over -p.

For example, we can run the following to disable the static-timing pass from the default execution alias all:

cargo run -- examples/futil/simple.futil -p all -d static-timing

Core Library

This library defines a standard set of components used in most Calyx programs such as registers and basic bitwise operations.


Numerical Operators


A WIDTH-wide register.


  • in: WIDTH - An input value to the register WIDTH-bits.
  • write_en: 1 - The one bit write enabled signal. Indicates that the register should store the value on the in wire.


  • out: WIDTH - The value contained in the register.
  • done: 1 - The register's done signal. Set high for one cycle after writing a new value.


A constant WIDTH-bit value with value VAL.

Inputs: None


  • out: WIDTH - The value of the constant (i.e. VAL)


A left bit shift. Performs LEFT << RIGHT. This component is combinational.


  • left: WIDTH - A WIDTH-bit value to be shifted
  • right: WIDTH - A WIDTH-bit value representing the shift amount


  • out: WIDTH - A WIDTH-bit value equivalent to LEFT << RIGHT


A right bit shift. Performs LEFT >> RIGHT. This component is combinational.


  • left: WIDTH - A WIDTH-bit value to be shifted
  • right: WIDTH - A WIDTH-bit value representing the shift amount


  • out: WIDTH - A WIDTH-bit value equivalent to LEFT >> RIGHT


Bitwise addition without a carry flag. Performs LEFT + RIGHT. This component is combinational.


  • left: WIDTH - A WIDTH-bit value
  • right: WIDTH - A WIDTH-bit value


  • out: WIDTH - A WIDTH-bit value equivalent to LEFT + RIGHT


Bitwise subtraction. Performs LEFT - RIGHT. This component is combinational.


  • left: WIDTH - A WIDTH-bit value
  • right: WIDTH - A WIDTH-bit value


  • out: WIDTH - A WIDTH-bit value equivalent to LEFT - RIGHT

std_slice<IN_WIDTH, OUT_WIDTH>

Slice out the lower OUT_WIDTH bits of an IN_WIDTH-bit value. Computes in[out_width - 1 : 0]. This component is combinational.


  • in: IN_WIDTH - An IN_WIDTH-bit value


  • out: OUT_WIDTH - The lower OUT_WIDTH bits of in


Given an IN_WIDTH-bit input, zero pad from the left to an output of OUT_WIDTH-bits. This component is combinational.


  • in: IN_WIDTH - An IN_WIDTH-bit value to be padded


  • out: OUT_WIDTH - The padded value

Logical Operators


Bitwise NOT. This component is combinational.


  • in: WIDTH - A WIDTH-bit input.


  • out: WIDTH - The bitwise NOT of the input (~in)


Bitwise AND. This component is combinational.


  • left: WIDTH - A WIDTH-bit argument
  • right: WIDTH - A WIDTH-bit argument


  • out: WIDTH - The bitwise AND of the arguments (left & right)


Bitwise OR. This component is combinational.


  • left: WIDTH - A WIDTH-bit argument
  • right: WIDTH - A WIDTH-bit argument


  • out: WIDTH - The bitwise OR of the arguments (left | right)


Bitwise XOR. This component is combinational.


  • left: WIDTH - A WIDTH-bit argument
  • right: WIDTH - A WIDTH-bit argument


  • out: WIDTH - The bitwise XOR of the arguments (left ^ right)

Comparison Operators


Greater than. This component is combinational.


  • left: WIDTH - A WIDTH-bit argument
  • right: WIDTH - A WIDTH-bit argument


  • out: 1 - A single bit output. 1 if left > right else 0.


Less than. This component is combinational.


  • left: WIDTH - A WIDTH-bit argument
  • right: WIDTH - A WIDTH-bit argument


  • out: 1 - A single bit output. 1 if left < right else 0.


Equality comparison. This component is combinational.


  • left: WIDTH - A WIDTH-bit argument
  • right: WIDTH - A WIDTH-bit argument


  • out: 1 - A single bit output. 1 if left = right else 0.


Not equal. This component is combinational.


  • left: WIDTH - A WIDTH-bit argument
  • right: WIDTH - A WIDTH-bit argument


  • out: 1 - A single bit output. 1 if left != right else 0.


Greater than or equal. This component is combinational.


  • left: WIDTH - A WIDTH-bit argument
  • right: WIDTH - A WIDTH-bit argument


  • out: 1 - A single bit output. 1 if left >= right else 0.


Less than or equal. This component is combinational.


  • left: WIDTH - A WIDTH-bit argument
  • right: WIDTH - A WIDTH-bit argument


  • out: 1 - A single bit output. 1 if left <= right else 0.



A one-dimensional memory.


  • WIDTH - Size of an individual memory slot.
  • SIZE - Number of slots in the memory.
  • IDX_SIZE - The width of the index given to the memory.


  • `addr0: IDX_SIZE - The index to be accessed or updated
  • write_data: WIDTH - Data to be written to the selected memory slot
  • write_en: 1 - One bit write enabled signal, causes the memory to write write_data to the slot indexed by addr0


  • read_data: WIDTH - The value stored at addr0. This value is combinational with respect to addr0.
  • done: 1: The done signal for the memory. This signal goes high for one cycle after finishing a write to the memory.


A two-dimensional memory.


  • WIDTH - Size of an individual memory slot.
  • D0_SIZE - Number of memory slots for the first index.
  • D1_SIZE - Number of memory slots for the second index.
  • D0_IDX_SIZE - The width of the first index.
  • D1_IDX_SIZE - The width of the second index.


  • addr0: D0_IDX_SIZE - The first index into the memory
  • addr1: D1_IDX_SIZE - The second index into the memory
  • write_data: WIDTH - Data to be written to the selected memory slot
  • write_en: 1 - One bit write enabled signal, causes the memory to write write_data to the slot indexed by addr0 and addr1


  • read_data: WIDTH - The value stored at mem[addr0][addr1]. This value is combinational with respect to addr0 and addr1.
  • done: 1: The done signal for the memory. This signal goes high for one cycle after finishing a write to the memory.


A three-dimensional memory.


  • WIDTH - Size of an individual memory slot.
  • D0_SIZE - Number of memory slots for the first index.
  • D1_SIZE - Number of memory slots for the second index.
  • D2_SIZE - Number of memory slots for the third index.
  • D0_IDX_SIZE - The width of the first index.
  • D1_IDX_SIZE - The width of the second index.
  • D2_IDX_SIZE - The width of the third index.


  • addr0: D0_IDX_SIZE - The first index into the memory
  • addr1: D1_IDX_SIZE - The second index into the memory
  • addr2: D2_IDX_SIZE - The third index into the memory
  • write_data: WIDTH - Data to be written to the selected memory slot
  • write_en: 1 - One bit write enabled signal, causes the memory to write write_data to the slot indexed by addr0, addr1, and addr2


  • read_data: WIDTH - The value stored at mem[addr0][addr1][addr2]. This value is combinational with respect to addr0, addr1, and addr2.
  • done: 1: The done signal for the memory. This signal goes high for one cycle after finishing a write to the memory.


A four-dimensional memory.


  • WIDTH - Size of an individual memory slot.
  • D0_SIZE - Number of memory slots for the first index.
  • D1_SIZE - Number of memory slots for the second index.
  • D2_SIZE - Number of memory slots for the third index.
  • D3_SIZE - Number of memory slots for the fourth index.
  • D0_IDX_SIZE - The width of the first index.
  • D1_IDX_SIZE - The width of the second index.
  • D2_IDX_SIZE - The width of the third index.
  • D3_IDX_SIZE - The width of the fourth index.


  • addr0: D0_IDX_SIZE - The first index into the memory
  • addr1: D1_IDX_SIZE - The second index into the memory
  • addr2: D2_IDX_SIZE - The third index into the memory
  • addr3: D3_IDX_SIZE - The fourth index into the memory
  • write_data: WIDTH - Data to be written to the selected memory slot
  • write_en: 1 - One bit write enabled signal, causes the memory to write write_data to the slot indexed by addr0, addr1, addr2, and addr3


  • read_data: WIDTH - The value stored at mem[addr0][addr1][addr2][addr3]. This value is combinational with respect to addr0, addr1, addr2, and addr3.
  • done: 1: The done signal for the memory. This signal goes high for one cycle after finishing a write to the memory.

The calyx Library

The implementation of the compiler separates the frontend from the internal data structures. This allows developers to use the compiler as a library. The calyx library exports all the data structures used by the compiler can be used to design new tools that make use of Calyx.

See the library documentation for an example of how to use the calyx library.

Dataflow Optimizations

In general, dataflow analysis uses the control and data flow of a program to compute various properties (liveness, reaching definitions, ...) at each point in a program.

For Calyx, dataflow analyses use the explicit control program and knowledge about the dataflow of each group to compute properties about each group.

Basic blocks vs. Groups

Normally, dataflow analyses compute a property at each basic block of a control flow graph (CFG). Calyx doesn't have a notion of basic blocks, and so Calyx computes a property at each group in a program.

Because Calyx separates the control flow of a program from the specification of groups, it's possible for a group to appear multiple times in the control program. For this reason we compute a property at each group enable rather than each group definition. The property at each group definition can easily be computed as the meet over all group enables.

Dataflow on an AST

Dataflow analyses are typically performed by finding the fixed point of a set of equations defined at each node of a control flow graph (CFG) using the worklist algorithm.

Because our control AST is little more than just the edges of a reducible cfg, we don't bother to build an explicit CFG and instead perform the dataflow analysis directly on the AST using Calyx's visitor infrastructure.

Abstract Algorithm

We model each control statement s as a function, f: p -> p where p is the type of the property. Control statements that have children define how information flows between its children.


f for enable A is similar to the transfer function in standard dataflow analysis. It uses information from the definition of group A to modify the input in some way. For example, if p is the set of live variables, the enable f is defined as:

f(enable A, inputs) = (inputs - kill(A)) | gen(A)


seq defines sequential control flow edges between its children. It is implemented by threading its input through all of its children to produce an output.

f(seq { A; B; C; ...; Z; }, inputs) =
     f(A, inputs)
  |> f(B, _)
  |> f(C, _)
  |> ...
  |> f(Z, _)

To implement a backwards dataflow analysis, all you need to do is reverse the order that seq pipes inputs to its children:

// reverse
f(seq { A; B; C; ...; Z; }, inputs) =
     f(Z, inputs)
  |> ...
  |> f(C, _)
  |> f(B, _)
  |> f(A, _)


if passes its inputs to its condition group and then feeds the result of this to both of its children. The output is the union of the outputs of both of its children. This is standard.

f(if some.port with G { True; } else { False; }, inputs) =
  f(True, f(G, inputs)) | f(False, f(G, inputs))


while statements are interesting because the outputs of the body may affect the input to the body. For this reason, we need to find a fixed point:

f(while some.port with G { body; }, inputs) =
  P = inputs;
  loop until P stops changing {
    P = f(body, f(G, inputs))


Par is the only statement that differs substantially from traditional dataflow because control flow graphs don't support nodes running in parallel. In other words, there is only ever one thing executing. However, par changes this and allows multiple things to execute at the same time. Consider the following example where we are computing the liveness of x to see why this is weird:

F; // wr x
par {
  A; // wr x
G; // rd x

Is x alive between X and the beginning of par? The answer is no because we know that both A and B will run. Therefore the write to x in F can not be seen by any group.

At first glance this doesn't seem like a problem. Great, we say, we can just take the union of the outputs of the children of par and call it a day.

This is wrong because B doesn't kill x and so x is alive coming into B. The union preserves this information and results in x being alive above par.

Taking the set intersection is not quite right here either. Consider adding another group C that reads from x.

We have no information about how this read is ordered with the write to x in A so we have to assume that x is alive above par. If we take the intersection here:

  live(A) & live(B) & live(C)
= {} & {x} & {x}
= {}

We get the wrong answer. More generally, we can see that union clobbers any writes and intersection clobbers any reads that happen in the par.

The solution to this problem is solved by passing the gen and kill sets along with the live sets. Then par can set its output to be

(union(live(children)) - union(kill(children))) | union(gen(children))

The final tricky bit is thinking about how information flows between siblings in a par statement. Consider again the picture above with three nodes: A, B, and C. Should x be live at B? For liveness it turns out to be yes, but bare with me for a second for a thought experiment and consider the case where we have the guarantee that statements running in parallel can not interact with each other. This lets us reorder statements in some cases. Then there seems to be an information trade-off for how to define the liveness of x at B:

  • You could say that x is dead at B because it doesn't see any previous writes to x and doesn't read from x. This implies that you could potentially replace writes to other registers with x. However, this by itself would cause x to be written to twice in parallel. You would have to reorder B to run before A. The takeaway here is that calling x dead at B gives the register reuse pass more information to work with. To compute this information B needs the gens and kills from all of its siblings (for the same reason that par) needed it. This is not particularly hard to implement, but it's worthy of noting.
  • If you say that x is live at B, then you can never rewrite B to use x instead of some other register, but you also don't have to worry about reordering statements in a par.

Leaving thought experiment land, in our case we can never reorder statements in a par because siblings may interact with each other in arbitrary ways. For this reason, we must say that x is live at B. However, it's not immediately clear to me that this will be true of all dataflow analyses. That's why I set this thought experiment down in writing.

Equivalence to worklist algorithm

In the normal worklist algorithm, we add a statement back to the worklist when its predecessor has changed.

The intuition for why this algorithm is equivalent to worklist algorithm is that because the only entry into the children for each parent control statement is the parent control statement itself. The only way that a child statement would need to be recomputed is if the inputs to the parent need to be recomputed. Anything above the parent in the tree will take care of this re-computation.


Calyx programs can provide wrong answers for two reasons:

  1. The program implements the wrong algorithm, i.e., it has a logical bug.
  2. The Calyx compiler incorrectly compiles the program, i.e., there is a compilation bug.

First make sure that the program generates the correct values with the Calyx Interpreter. If it produces the wrong values, your Calyx implementation of the algorithm is incorrect. You can use the Calyx Debugger to debug these problems.

If the interpreter produces the right values, try a different Verilog backed. We support both Verilator and Icarus Verilog. If both produce the wrong answer and the interpreter produces the right answer then you likely have a compilation bug on your hands. Use the debugging tips to narrow down the pass that causes the error.

The Calyx Interactive Debugger

The Calyx Interactive Debugger is a prototype debugging tool built on top of the [Calyx Interpreter][interp] which exposes a gdb-like interface for debugging Calyx programs.

Getting Started

If you are using fud getting started with the debugger is easy. Assuming you are trying to debug a program called my_program.futil with data file, invoke the debugger with the following command:

fud e --to debugger -q my_program.futil -s

This will open the target program in the interactive debugger. Note that fud uses the quiet flag, -q, here. This prevents the printing from the fud tool from conflicting the debugger as both tools interact with standard out.

Advancing Program execution


The simplest way to advance the program is via the step command which causes time to advance by a clock tick. It also has a shortcode: s.

 > step
 > s

The above snippet advances the program by two steps.


Another way to advance the program is via the step-over command. Unlike the step command, this command requires a second argument which is the name of the group to advance over. The step-over command then advances the program until the given group is no longer running.

If you want to use the command to advance the program past a group group_1, do the following:

 > step-over group_1

Note that the step-over command will do nothing if the given group is not running.

 > step-over other_group
 Group is not running


Finally, the continue command will run the program until either a breakpoint is hit or the program terminates. This command is used in conjunction with breakpoints and watchpoints to provide more targeted inspection. It may also be accessed with the shortcode c.

 > continue
Main component has finished executing. Debugger is now in inspection mode.


CIDR supports breakpoints on group definitions. This helps focus attention on suspect portions of the code.

Setting a breakpoint

Breakpoints may be set on the main component by simple specifying the group of interest.

 > break group_1

This is identical to

 > break main::group_1

For sub-components, the name of the sub-component must be included with the double colon separating the group name. To break on the do_mul group inside the pow sub-component:

 > break pow::do_mul

Managing breakpoints

To see a list of breakpoints:

 > info break


 > ib

This produces output like this:

 > ib
     Current breakpoints:
    1.  main::group_1  enabled
    2.  pow::do_mul enabled

All breakpoints have a number associated with them and they may be managed with this number or the group name.

To enable or disable a breakpoint:

 > disable group_1 2
 > enable 1 pow::do_mul

Note that this is equivalent to:

 > disable group_1
 > disable 2
 > enable 1
 > enable pow::do_mul

To delete a breakpoint:

 > delete 1
 > del pow::do_mul

Deleted breakpoints will be entirely removed while disabled breakpoints will remain until they are either enabled again or subsequently deleted. Disabled breakpoints will not cause program execution to halt when continue-ing.

Inspecting State


The display command dumps the full state of the main component without formatting. Use the print and print-state commands for targeted inspection with formatting.

Formatting codes

CIDR supports several different formatting codes which do the hard work of interpreting the data in human readable ways.

binaryThe default, a bit vector with the msb on the left
unsigned\uUnsigned bit-num formatting
signed\sTwo's Complement formatting
unsigned fixedpoint\u.NFor N >=1. Unsigned Fixed-point with N fractional bits. The remaining bits are for the integral component.
signed fixedpoint\s.NFor N >=1. Signed Fixed-point with N fractional bits. The remaining bits are for the integral component.

These commands allow inspecting instance state with optional formatting. Note that this is different from breakpoints which operate on definitions. For example to print the ports of the std_mul instance named mul in the pow instance pow_1 attached to the main component:

 > print main.pow_1.mul

as with breakpoints, the leading main may be elided:

 > print pow_1.mul

This will print all the ports attached to this multiplier instance with binary formatting.

Formatting codes may be supplied as the first argument.

 > print \u pow_1.mul

The print may also target specific ports on cells, rather than just the cell itself. To see only the output of the multiplier (with unsigned formatting):

 > print \u pow_1.mul.out

The print-state command works in the same way as the print command, except it displays the internal state of a cell, rather than port values. As such, it can only target cells and only those with some internal state, such as registers or memories. For example, if the main component has a memory named out_mem its contents may be viewed via:

 > print-state main.out_mem

or just

 > print-state out_mem

As with print, print-state supports formatting codes as an optional first argument. So to view the contents of out_mem with a signed interpretation:

 > print-state \s out_mem


Watchpoints are like breakpoints but rather than stop the execution when they are passed, they instead print out some information. Like breakpoints, they are set on group definitions, such as main::group_1 or pow::do_mul

Setting watchpoints

The general form of watchpoints looks like



  • GROUP is the group definition to be watched
  • PRINT-COMMAND is a full print or print-state command to be run by the watchpoint

The optional POSITION argument may either be before or after. This specifies whether the watchpoint should run when the group first becomes active (before) or when the group finishes running (after). This defaults to before if not set.

Managing watchpoints

Watchpoint management is similar to breakpoints. However there may be multiple watchpoints for a single group definition, so deleting watchpoints via the group name will delete all the watchpoints associated with the group. Watchpoints do not currently have an enable/disable state.

To view all the watchpoint definitions:

 > info watch


 > iw

To delete watchpoints:

 > delete-watch 1
 > del-watch main::group_1

Viewing the program counter

There where command (alias pc) displays the currently running portion of the control tree including active subcomponents. This can be used to more easily determine the currently active portion of the design as well as visualize how much of the execution is occurring in parallel at any given point.

Exiting the debugger

Use help to see all commands. Use exit to exit the debugger.

Debugging Compilation Bugs

These tips are directed towards compilation bugs. Before trying these, make sure your program produces the correct values with the Calyx Interpreter

Disabling Optimizations

The first step is disabling optimization passes and running the bare bones compilation pipeline.

To disable the passes, add the flag -p no-opt to compiler invocation:

  1. For the compiler: futil <filename> -p no-opt.
  2. For fud: fud ... -s futil.flags " -p no-opt".

If the output is still incorrect then one of the core compilation passes is incorrect. Our best bet at this point is to reduce the test file such that the output from the interpreter and the Calyx compiler still disagree and report the bug. We can use the waveform debugging to figure out which part of the compilation pipeline generates the incorrect result.

If the execution generates the right result, then one of the optimizations passes is incorrect. To identify which optimization pass is wrong, add back individual passes and see when the execution fails. The calyx/src/ file defines the compilation pipeline. Start by incrementally adding passes to this flag invocation:

-p validate -p remove-comb-groups -p <PASS 1> ... -p <PASS N> -p compile -p lower

Reducing Test Files

It is often possible to reduce the size of the example program that is generating incorrect results. In order to perform a reduction, we need to run the program twice, once with a "golden workflow" that we trust to generate the right result and once with the buggy workflow.

Inlining (Optional)

It is useful to inline all the code into the main component so that we can focus our energy on reducing one control program. This can be done by passing the following flags to the compiler:

-p well-formed -p inline -x inline:always -p post-opt

The -x inline:always flag tells the inlining pass to attempt to inline all components into one. If this command fails, we can just work with the original program and reduce control programs for each component.


At a high-level, we want to do the following:

  1. Delete some part of the control program
  2. See if the error still occurs
  3. If it doesn't, delete a smaller part of the control program
  4. Otherwise, continue deleting more parts of the control program

When deleting parts of the control program, the compiler may complain that certain groups are no longer being used. In this case, run the -p dead-group-removal pass before any other pass runs.

Next, if we've identified the problem to be in one of the Calyx passes, the "golden workflow" is running the program without the pass while the buggy workflow is running the program with the pass enabled. This case is so common that we've written a script that can run programs with different set of flags to the Calyx compiler and show the difference in the outputs after simulation.

The script is invoked as:

tools/ <calyx program> <data>

By default, the script will try to run the programs by simulating them through Verilator by providing fud with the target --to dat. If you'd like to use the Calyx Interpreter instead, run the following command:

tools/ <calyx program> <data> interpreter-out

Reducing Calyx Programs

The best way to reduce Calyx program deleting group enables from the control program and seeing if the generated program still generates the wrong output. While doing this, make sure that you're not deleting an update to a loop variable which might cause infinite loops.

By default, the compiler will complain if the program contains a group that is not used in the control program which can get in the way of minimizing programs. To get around this, run the dead-group-removal pass before the validation passes:

futil -p dead-group-removal -p validate ...

Reducing Dahlia Programs

If you're working with Dahlia programs, it is also possible to reduce the program with the script since it simply uses fud to run the program with the simulator. As with Calyx reduction, try deleting parts of the program and seeing if the flag configurations for the Calyx program still generate different outputs.

Waveform Debugging

Waveform debugging is the final way of debugging Calyx programs. A waveform captures the value of every port at every clock cycle and can be viewed using a wave viewer program like GTKWave or WaveTrace to look at the wave form. Because of this level of granularity, it generates a lot of information. To make the information a little more digestible, we can use information generated by Calyx during compilation.

For waveform debugging, we recommend disabling the optimization passes and static timing compilation (unless you're debugging these passes). In this debugging strategy, we'll do the following:

  1. Dump out the control FSM for the program we're debugging.
  2. Find the FSM states that enable the particular groups that might be misbehaving.
  3. Open the waveform viewer and find clock cycles where the FSM takes the corresponding values and identify other signals that we care about.

Consider the control section from examples/futil/dot-product.futil:

    seq {
      while le0.out with cond0 {
        seq {
          par {

Suppose that we want to make sure that let0 is correctly performing its computation. We can generate the control FSM for the program using:

  futil <filename> -p top-down-cc

This generates a Calyx program with several new groups. We want to look for groups with the prefix tdcc which look something like this:

group tdcc {
  let0[go] = !let0[done] & fsm.out == 4'd0 ? 1'd1; = fsm.out == 4'd1 ? le0.out;
  cs_wh.write_en = fsm.out == 4'd1 ? 1'd1;
  cond0[go] = fsm.out == 4'd1 ? 1'd1;
  par[go] = !par[done] & cs_wh.out & fsm.out == 4'd2 ? 1'd1;
  let1[go] = !let1[done] & cs_wh.out & fsm.out == 4'd3 ? 1'd1;
  let2[go] = !let2[done] & cs_wh.out & fsm.out == 4'd4 ? 1'd1;
  upd2[go] = !upd2[done] & cs_wh.out & fsm.out == 4'd5 ? 1'd1;
  upd3[go] = !upd3[done] & cs_wh.out & fsm.out == 4'd6 ? 1'd1;

The assignments to let0[go] indicate what conditions make the let0 group execute. In this program, we have:

let0[go] = !let0[done] & fsm.out == 4'd0 ? 1'd1;

Which states that let0 will be active when the state of the fsm register is 0 along with some other conditions. The remainder of the group defines how the state in the fsm variable changes:

  ... = fsm.out == 4'd0 & let0[done] ? 4'd1;
  fsm.write_en = fsm.out == 4'd0 & let0[done] ? 1'd1; = fsm.out == 4'd1 & cond0[done] ? 4'd2;
  fsm.write_en = fsm.out == 4'd1 & cond0[done] ? 1'd1;

For example, we can see that when the value of the FSM is 0 and let0[done] becomes high, the FSM will take the value 1.

Once we have this information, we can open the VCD file and look at points when the fsm register has the value 1 and check to see if the assignments in let0 activated in the way we expected.

Emitting Calyx from Python

Our frontends are written in Python3 and make use of the calyx library to generate their code.

To install the library, run the following from the repository root (requires flit installation):

cd calyx-py && flit install -s

The library provides an example:

python calyx-py/test/

Building a Frontend for Calyx

In this tutorial, we're going to build a compiler for a small language called MrXL.

MrXL Overview

MrXL provides constructs to create arrays, and perform map and reduce operations on those arrays. Here's an example of a dot product implementation in MrXL:

input avec: int[4]
input bvec: int[4]
output dot: int[4]
prodvec := map 1 (a <- avec, b <- bvec) { a * b }
dot := reduce 1 (a, b <- prodvec) 0 { a + b }

We define the interface of program by specifying input and output arrays. Input arrays have their values populated by an external harness while the output arrays must be computed using the program.

A map expression iterates over multiple arrays of the same element and produces a new vector using the function provided in the body. In the above example, the map expression multiplies the values of avec and bvec. map 1 states that the operation has a parallelism factor of 1 which means that the loop iterations are performed sequentially.

reduce expressions walk over memories and accumulate a result into a register. In the above code snippet, we add together all the elements of prodvec and place them in a register named dot. Since the reduce parallelism factor is also 1, the reduction is performed sequentially.

Run a MrXL Program

Once we have installed the mrxl command line tool, we can run MrXL programs using fud.

To provide MrXL program with input values, we use fud's JSON-based data format. Let's try to run this program, which has a parallelism factor of two:

input foo: int[4]
output baz: int[4]
baz := map 2 (a <- foo) { a + 5 }

In order to take advantage of the parallelism in the program, the MrXL compiler automatically partitions the input memory foo into two different physical banks: foo_b0 and foo_b1. Therefore, we split up our logical foo input of [1, 2, 3, 4] into [1,2] and [3,4]:

  "foo_b0": {
    "data": [
    "format": {
      "numeric_type": "bitnum",
      "is_signed": false,
      "width": 32
  "foo_b1": {
    "data": [
    "format": {
      "numeric_type": "bitnum",
      "is_signed": false,
      "width": 32

Our complete data file similarly splits up the input for baz.

Run the program with the complete data by typing:

fud exec frontends/mrxl/test/add.mrxl \
    --from mrxl \
    --to vcd -s frontends/mrxl/test/

Compiling MrXL to Calyx

This guide will walk you through the steps to build a Python program that compiles MrXL programs to Calyx code. The guide assumes some basic familiarity with Calyx. Take a look at Calyx tutorial if you need a refresher.

To simplify things, we'll make a few assumptions about MrXL programs:

  • Every array in a MrXL program has the same length.
  • Every integer in our generated hardware will be 32 bits.
  • Every map and reduce body will be either a multiplication or addition of either an array element or an integer.

The following sections will outline these two high level tasks:

  1. Parse MrXL into a representation we can process with Python
  2. Generate Calyx code

You can find our complete implementation in the Calyx repository.

Parse MrXL into an AST

To start, we'll parse this MrXL program into a Python AST representation. We chose to represent AST nodes with Python dataclass. A program is a sequence of array declarations followed by computation statements:

class Prog:
    decls: List[Decl] # Memory declarations
    stmts: List[Stmt] # Map and reduce statements

Decl nodes correspond to array declarations like input avec: int[1024], and carry data about whether they're an input or output array, their name, and their type:

class Decl:
    input: bool  # Otherwise, output.
    name: str
    type: Type

Stmt nodes represent statements such as dot := reduce 4 (a, b <- prodvec) 0 { a + b }, and contain more nested nodes representing their function header and body, and type of operation.

class Stmt:
    dest: str
    op: Union[Map, Reduce]

The complete AST defines the remaining AST nodes required to represent a MrXL program.

Generate Calyx Code

The skeleton of a Calyx program has three sections, and looks like this:

component main() -> {
  cells {}
  wires {}
  control {}

The cells section instantiates hardware units like adders, memories and registers. The wires section contains groups that connect together hardware instances to perform some logical task such as incrementing a specific register. Finally, the control section schedules the execution of groups using control operators such as seq, par, and while.

We perform syntax-directed compilation by walking over nodes in the above AST and generating cells, wires, and control operations.

Calyx Embedded DSL

To make it easy to generate the hardware, we'll use Calyx's builder module in Python:

import calyx.builder as cb

prog = cb.Builder() # A Calyx program
main = prog.component("main") # Create a component named "main"

Decl nodes

Decl nodes instantiate new memories and registers. We need these to be instantiated in the cells section of our Calyx output. We use Calyx's std_reg and std_mem_d1 primitives to represent registers and memories:

import "primitives/core.futil"; // Import standard library

component main() -> () {
  cells {
    // A memory with 4 32-bit elements. Indexed using a 6-bit value.
    foo = std_mem_d1(32, 4, 6);
    // A register that contains a 32-bit value
    r = std_reg(32);

For each Decl node, we need to determine if we're instantiating a memory or a register, and then translate that to a corresponding Calyx declaration and place that inside the cells section of our generated program.

If a memory is used in a parallel map or reduce, we might need to create different physical banks for it. We define a function to walk over the AST and compute the parallelism factor for each memory:

    # Collect banking factors.
    par_factor = compute_par_factors(prog.stmts)

Using this information, we can instantiate registers and memories for our inputs and outputs:

            for i in range(par):
                main.mem_d1(f"{name}_b{i}", 32, arr_size // par, 32, is_external=True)

The main.mem_d1 call is a function defined by the Calyx builder module to instantiate memories for a component. By setting is_external=True, we're indicating that a memory declaration is a part of the program's input-output interface.

Compiling Map Operations

For every map or reduce node, we need to generate Calyx code that iterates over an array, performs some kind of computation, and then stores the result of that computation. For map operations, we'll perform a computation on an element of an input array, and then store the result in a result array. We can use Calyx's while loops to iterate over an input array, perform the map's computation, and store the final value. At a high level, we want to generate the following pieces of hardware:

  1. A register to store the current value of the loop index.
  2. A comparator to check of the loop index is less than the array size.
  3. An adder to increment the value of the index.
  4. Hardware needed to implement the loop body computation.

Loop Condition

We define a combinational group to perform the comparison idx < arr_size that uses an lt cell.

    group_name = f"cond_{suffix}"
    cell = f"lt_{suffix}"
    lt = comp.cell(cell, Stdlib.op("lt", 32, signed=False))
    with comp.comb_group(group_name):
        lt.left = idx.out
        lt.right = cb.const(32, arr_size)

Index Increment

    group_name = f"incr_idx_{suffix}"
    adder = comp.add(f"incr_{suffix}", 32)
    with as incr:
        adder.left = idx.out
        adder.right = 1
        idx.in_ = adder.out
        idx.write_en = 1
        incr.done = idx.done

The loop index increment is implemented using a [group][lf-group] and an adder (adder). We provide the index's previous value and the constant 1 to the adder and write the adder's output into the register. Because we're performing a stateful update of the register, we must wait for the register to state that it's committed the write by setting the group's done condition to the register's done signal.

Body Computation

The final piece of the puzzle is the body's computation. The corresponding group indexes into the input memories:

        with"eval_body_{suffix}") as ev:
            # Index each array
            for bind in stmt.bind:
                # Map bindings have exactly one dest
                mem = comp.get_cell(f"{name2arr[bind.dest[0]]}")
                mem.addr0 = idx.out

Because the builder module is an embedded DSL, we can simply use Python's for loop to generate all the required assignments for indexing.

This code instantiates an adder or a multiplier depending on the computation needed using the expr_to_port helper function:

            # Provide inputs to the op
            op.left = expr_to_port(body.lhs)
            op.right = expr_to_port(body.rhs)

And writes the value from the operation into the output memory:

            out_mem = comp.get_cell(f"{dest}_b{bank}")
            out_mem.addr0 = idx.out
            out_mem.write_data = op.out
            # Multipliers are sequential so we need to manipulate go/done signals
            if body.op == "mul":
                op.go = 1
                out_mem.write_en = op.done
                out_mem.write_en = 1
            ev.done = out_mem.done

This final operation is complex because we must account for whether we're using an adder or a multiplier. Adders are combinational–they produce their output immediately–while multipliers are sequential and require multiple cycles to produce its output.

When using a mutliplier, we need to explicitly set its go signal to one and only write the output from the multiplier into the memory when its done signal is asserted. We do this by assigning the memory's write_en (write enable) signal to the multiplier's done signal. Finally, the group's computation is done when the memory write is committed.

Generating Control

Once we have generated the hardware needed for our computation, we can schedule its computation using control operators:

                CompPort(CompVar(port), "out"),

We generate a while loop that checks that the index is less than the array size. Then, it sequentially executes the computation for the body and increments the loop index.

Add Parallelization

MrXL allows you to parallelize your map and reduce operations. Let's revisit the map example from earlier:

input foo: int[4]
output baz: int[4]
baz := map 4 (a <- foo) { a + 5 }

The number 4 specifies that four copies of the loop bodies should be executed in parallel. Our implementation already creates memory banks to allow for parallel accesses. At a high-level, we can change the compilation for the map operation to produce n copies of the hardware we generate above and generate a control program that looks like this:

par {
  while le_b0.out with cond_b0 { seq { eval_body_b0; incr_idx_b0; } }
  while le_b1.out with cond_b1 { seq { eval_body_b1; incr_idx_b1; } }
  while le_b2.out with cond_b2 { seq { eval_body_b2; incr_idx_b2; } }
  while le_b3.out with cond_b3 { seq { eval_body_b3; incr_idx_b3; } }

The par operator executes all the loops in parallel. The full implementation shows the necessary code to accomplish this which simply creates an outer loop to generate distinct hardware for each copy of the loop.


Hopefully this should be enough to get you started with writing your own MrXL compiler. Some more follow-up tasks you could try if you're interested:

  • Read the code for compiling reduce statements and extend to support parallel reductions using reduction trees.
  • Implement code generation that allows memories that differ from one another in size.
  • Implement complex function body expressions. We only support binary operations with two operands, like a + 5.
  • Add a new filter operation to MrXL.

Frontend Compilers

Several compiler generate Calyx programs from other high-level languages.


Dahlia is an imperative, loop-based programming language for designing hardware accelerators.


First, install sbt and scala.

Then, clone the repository and build the Dahlia compiler:

git clone
cd dahlia
sbt compile
sbt assembly
chmod +x ./fuse

The Dahlia compiler can be run using the ./fuse binary:

./fuse --help

Finally, configure fud to use the Dahlia compiler:

fud c stages.dahlia.exec <path to Dahlia repository>/fuse

Use fud to check if the compiler was installed correctly:

fud check

fud should report that the Dahlia compiler is available and has the right version.

If something went wrong, try following the instructions to build the Dahlia compiler from its repository.

Compiling Dahlia to Calyx

Dahlia programs can be compiled to Calyx using:

fud e --from dahlia <input file> --to futil

The Dahlia backed for Calyx is neither complete nor stable. If you find a confusing error or wrong program, please open an issue.

Systolic Array

Systolic arrays are commonly used to implement fast linear-algebra computations. See this paper for an overview on systolic arrays.

The systolic array frontend lives in the systolic-lang folder in the Calyx repository and generates systolic arrays that can perform matrix multiplies.

The contains the entire program required to generate systolic arrays. In order to generate an 8 X 8 systolic array, run:

./frontends/systolic-lang/ -tl 8 -td 8 -ll 8 -ld 8


Install the calyx-py library.

Command Line Options

The command line options configure the dimensions of the generated systolic array. There are no other properties of the systolic array that can be configured.

  • --top-length, --left-length: The length of top and left sides of the systolic array.
  • --top-depth, --left-depth: The length of the input streams from top and left sides of the array.

TVM Relay

TVM is a compiler for machine learning frameworks that can optimize and target kernels to several different backends. Relay is a high level intermediate representation for the TVM framework. The goal of Relay is to replace old computation graph based IRs with a more expressive IR. More information can be found in this paper.

The TVM Relay frontend lives in the relay-lang folder in the Calyx repository and generates Calyx components from the Relay intermediate representation.


  1. Clone the TVM repository and checkout the tag v0.10.dev0:

     git clone
     cd incubator-tvm && git checkout v0.10.dev0
     git submodule init && git submodule update
  2. Set up to build (the default configuration is fine because we don't need any fancy backends like LLVM or CUDA):

     mkdir build && cd build
     cp ../cmake/config.cmake .
  3. Build TVM:

     cmake -G Ninja .. && ninja
  4. Install the tvm Python package by building a wheel:

     cd ../python && python3 bdist_wheel
     pip3 install --user dist/tvm-*.whl

    If you get an error with shutil, try deleting the python/ directory, restoring it, and rerunning the above command: cd .. && rm -rf python && git checkout -- python
    If you are on MacOS - Big Sur and are getting an error similar to "(wheel).whl is not a supported wheel on this platform", try changing part of the wheel's filename from 11_0 to 10_9. See this github issue for more information.

  5. Install ANTLR v4.7.2 (required for the Relay text format parser):

     pip3 install -Iv antlr4-python3-runtime==4.7.2
  6. To run the MLP net and VGG net examples, install pytest:

     pip3 install pytest
  7. Install Dahlia, which is used when lowering Relay call nodes to Calyx.

  8. Install the calyx-py library.

Run an Example

Try this to run a simple example:

cd calyx/frontends/relay
python3 tensor_add
  • -h: Help option; shows available examples.
  • -r: Dumps the Relay IR. Otherwise, it dumps the Calyx output.

Simulate an ONNX Model

A simple script is provided to run an Open Neural Network Exchange (ONNX) model. In addition to installing TVM Relay above, you'll need the following PIP installations for ONNX simulation and image pre-processing:

pip3 install opencv-python Pillow mxnet onnx simplejson

For example, we can simulate the LeNet ONNX model found here using the following command:

python3 frontends/relay/ \
-n "lenet" \
-d "MNIST" \
-i "/path/to/image.png" \
-onnx "/path/to/model.onnx" \
-o calyx
  • -n: The name of the input net. This is mostly used for naming the output files.
  • -d: The dataset for which the input will be classified against. This is necessary to determine what preprocessing should be done on the image. e.g. "mnist" or "imagenet".
  • -i: The file path to the input image which you want classified.
  • -onnx: The file path to the ONNX model.
  • -o: The type of output.
    1. tvm: Executes the ONNX model using the TVM executor. Prints the final softmax value to console. No postprocessing is conducted.
    2. relay: Output a file with the corresponding Relay program. <net_name>.relay
    3. calyx: Output a .data file and Calyx program for simulation. <net_name>.futil, <net_name>.data
    4. all: All the above.
  • -s: This is an optional boolean argument that signifies save_mem, and is set to true by default. If this flag is set to true, then it will produce a Calyx design that requires less internal memory usage compared to the design that is produced when this flag is false.

Number Theoretic Transform (NTT)

The number theoretic transform is a generalization of the fast Fourier transform that uses nth primitive root of unity based upon a quotient ring instead of a field of complex numbers.

It is commonly used to speed up computer arithmetic, such as the multiplication of large integers and large degree polynomials. The pipeline produced here is based upon this paper, which also provides some background information on NTT.

The NTT pipeline frontend lives in the ntt folder in the Calyx repository and generates the pipeline for the NTT transform.

The file contains the entire program required to generate NTT pipelines. In order to generate a pipeline with bit width 32, input size 4, and modulus value 97:

./frontends/ntt-pipeline/ -b=32 -n=4 -q=97


Install the calyx-py library.

The generator also produces a table to illustrate which operations are occurring during each stage of the pipeline. This requires installing PrettyTable:

pip3 install prettytable numpy

Fud Stage

The NTT pipeline defines an [external fud stage][../fud/] to transform configuration files into Calyx programs. To install, run:

fud register ntt -p frontends/ntt-pipeline/fud/ && fud check

This should report the newly installed ntt stage in the configuration.

Configuration Files

Configurations files simply specify command line parameters:

  "input_bitwidth": 32,
  "input_size": 4,
  "modulus": 97

Command Line Options

The command line options configure the bit width, size, and modulus value of the pipeline.

  • --input_bitwidth: The bit width of each value in the input array.
  • --input_size: The length (or size) of the input array.
  • --modulus: The (prime) modulus value used during the transformation.
  • --parallel_reduction: Decreases fan-out by reducing the number of groups executed in parallel by this factor.


The MrXL frontend is a toy frontend developed for the frontend tutorial. As such, it is less rigorously tested and might have bugs.

MrXL is an example DSL for demonstrating Calyx. MrXL programs consist of map and reduce operations on arrays. For example, this is a dot product implementation:

input avec: int[1024]
input bvec: int[1024]
output dot: int
prodvec := map 16 (a <- avec, b <- bvec) { a * b }
dot := reduce 4 (a, b <- prodvec) 0 { a + b }

The numbers that come right after map and reduce are parallelism factors that guide the generation of hardware.


Install the calyx-py library.

The MrXL implementation is in Python and uses Flit. First, install flit (pip install flit or similar), and then type the following inside frontend/mrxl:

flit install --symlink

This creates a symbolic link the mrxl directory and installs the mrxl command line tool.

By default, fud looks for the mrxl executable to enable the mrxl compilation stage. Type fud check to make sure fud reports that the mrxl compiler has been found.


To run the interpreter, do this:

mrxl <program> --data <indata> --interpret

where <program> is a MrXL source code file and <indata> is a JSON file containing values for all the variables declared as input in the program. The interpreter dumps the output variables as JSON to stdout.

You can try this, for example:

mrxl test/dot.mrxl --data test/dot.json --interpret

Compiling to Calyx

To run the compiler, leave off the --interpret and --data flags:

mrxl test/dot.mrxl

In order to run the compiler through fud, pass the --from mrxl flag:

fud e --from mrxl <mrxl file> --to futil

To simulate the Verilog generated from the mrxl compiler, set the -s as usual:

fud e --from mrxl <mrxl file> --to dat -s <data file>


Runt (Run Tests) is the expectation testing framework for Calyx. It organizes collections of tests into test suites and specifies configuration for them.

Runt uses runt.toml to define the test suites and configure them.


Runt workflow involves two things:

  1. Running tests and comparing differences
  2. Saving new or changed golden files

To run all the tests in a directory, run runt with a folder containing runt.toml.

The following commands help focus on specific tests to run:

  • -i: Include files that match the given pattern. The pattern is matched against <suite name>:<file path> so it can be used to filter both test suites or specific paths. General regex patterns are supported.
  • -x: Exclude files that match the pattern
  • -o: Filter out reported test results based on test status. Running with miss will only show the tests that don't have an .expect file.

Differences. -d or --diff shows differences between the expected test output and the generated output. Use this in conjunction with -i to focus on particular failing tests.

Saving Files. -s is used to save test outputs when the have expected changes. In the case of miss tests, i.e. tests that currently don't any expected output file, this saves a completely new .expect file.

Dry run. -n flag shows the commands that runt will run for each test. Use this when you directly want to run the command for the test directly.

For other options, look at runt --help which documents other features in runt.

For instruction on using runt, see the official documentation.

Data Gen

Data Gen is a tool that can generate a memory json from a Calyx file. It reads the Calyx file and generates an entry in the json for each cell marked with the @external attribute. Currently, there are two types of number representation formats that can be generated as the data for the memory json: 1) unsigned 32 bit bitnums all equal to 0 and 2) signed, 32 bit fixed point numbers with frac_width = 16, and the data is randomly generated.

How to Run

The following command can be run to generate unsigned, 32-bit zeroes:
cargo run -p data_gen -- <calyx file>

To generate random fixed point numbers, run:
cargo run -p data_gen -- <calyx file> -f true

It will print the json in the command line

Current Limitations

As you can see, the tool is right now pretty limited, because it only supports 2 different representations of numbers. What if you want to generate random, 8 bit, unsigned ints in your memory? Data Gen currently wouldn't support that. Ideally, we would want each Calyx memory cell to have its own attribute(s) which can hold information about what type of number representation format the memory wants. This github issue goes into more detail about future improvements for the tool.

exp Generator

The exp generator uses a Taylor series approximation to calculate the fixed point value of the natural exponential function e^x. The Maclaurin series for the function can be written as:

e^x = 1 + x + x^2/2! + x^3/3! + ... + x^n/n!

where n is the nth degree or order of the polynomial.

For signed values, we can take the reciprocal value:

e^(-x) = 1/e^x

The fp_pow_full component can calculate the value of b^x where b and x are any fixed point numbers. This can be calculated by first observing:

b^x = e^(ln(b^x)) = e^(x*ln(b))

Therefore, we just calculate x*ln(b), and then we can feed the result into the exp
component to get our answer.

To calculate ln(p) for fixed point values p, we use the second order Padé Approximant of ln(p). We calculated the approximant using Wolfram Alpha.

The file can generate an entire Calyx program for testing purposes. can generate two different types of designs, depending on the base_is_e flag: if base_is_e is true, then the design can only caclulate values for e^x. The main component contains memories x (for the input) and ret for the result of e^x. If base_is_e is false, then the design can calculate values for b^x for any base e. Therefore, the main component contains memories x (the exponent input), b (the base intput), and ret for the result of b^x. In order to generate an example program (that can only calculate exponent values with base e), with degree 4, bit width 32, integer bit width 16, and x interpreted as a signed value:

./calyx-py/calyx/ -d 4 -w 32 -i 16 -s true -e true 

Similarly, it provides a function to produce only the necessary components to be dropped into other Calyx programs.


Install the calyx-py library.

Command Line Options

The command line options configure the degree (or order) of the taylor series, bit width, integer bit width, and sign.

  • --degree: The degree of the Taylor polynomial.
  • --width: The bit width of the value x.
  • --int_width: The integer bit width of the value x. The fractional bit width is then inferred as width - int_width.
  • --is_signed: The signed interpretation of the value x. If x is a signed value, this should be true and otherwise, false.
  • --base_is_e: A boolean that determines whether or not to generate components needed to just calculate e^x, or to generate components needed to calculate b^x for any base b.

Editor Highlighting


The vim extension highlights files with the extension .futil. It can be installed using a plugin manager such as vim-plug using a local installation. Add the following to your vim plug configuration:

Plug '<path-to-calyx>/tools/vim'

And run:



futil-mode is implements highlighting for .futil files in emacs. It is located in <repo>/tools/emacs/futil-mode.

Clone the repository, add the above path to your load path, and require futil-mode. If you use Spacemacs, this looks like adding the following lines to dotspacemacs/user-config in your .spacemacs file:

(push "~/.emacs.d/private/local/fuse-mode" load-path)
(require 'fuse-mode)

I imagine it looks very similar for pure emacs, but haven't actually tried it myself.

Visual Studio Code

Add a link to the Calyx VSCode extension directory to your VSCode extensions directory.

cd $HOME/.vscode/extensions
ln -s <calyx root directory/tools/vscode calyx.calyx-0.0.1

Restart VSCode.


Here is a list of all the people who have worked on Calyx:

Current Contributors

Previous Contributors

If you're missing from this list, please add yourself!