Chapter 11: Baking Silicon: Open ISAs and Custom Accelerators
If you survived the previous chapter, you now have a highly optimized, quantized int8 neural network model. But if you run that model on a standard, general-purpose microcontroller, you are still fighting the architecture. General-purpose CPUs are designed to do a little bit of everything; they are not designed to blast through millions of matrix multiplications per second.
To achieve the massive energy efficiency and performance required for edge AI, we must follow the ultimate rule of Domain-Specific Architectures (DSAs): we must build custom silicon tailored directly to the math. In this chapter, we are going to look at why the open-source RISC-V architecture fundamentally changes the rules of hardware-software co-design, and we will write the actual hardware code to build a custom Tensor Coprocessor.
11.1 Open ISAs: The RISC-V Revolution
Historically, if you wanted to build a custom Cyber-Physical System (CPS) chip, your options were severely limited. The Instruction Set Architectures (ISAs) that dominate the industry, such as x86 and ARM, are proprietary intellectual property. If you wanted to add a custom matrix-multiply instruction to an ARM core to speed up your AI workload, you couldn’t simply modify the silicon. You would have to negotiate a profoundly expensive architectural license, assuming the vendor would even allow you to fragment their ecosystem.
RISC-V changes everything.
Developed in 2011 at the University of California, Berkeley by computer scientists including Krste Asanović and David Patterson, RISC-V (pronounced “risk-five”) is an entirely open-source, free-to-use ISA specification. Unlike legacy ISAs that evolved by endlessly piling on new instructions to maintain backward compatibility, RISC-V started with a clean slate.
The architecture is strictly modular. It defines a mandatory, minimal base integer instruction set—such as RV32I (32-bit) or RV64I (64-bit)—which contains fewer than 50 standard instructions. Everything else is an optional extension. If your edge device needs floating-point math, you add the F and D extensions; if it needs compressed 16-bit instructions to save memory, you add the C extension.
But the true game-changer for hardware-software co-design is that the RISC-V specification explicitly reserves opcodes for custom, user-defined instructions and coprocessors.
System-on-Chip (SoC) designers can now tightly integrate custom Domain-Specific Architectures (DSAs) directly into the CPU’s execution pipeline. Because the instruction set is free and open, you can download an open-source RISC-V core (like PicoRV32 or VexRiscv), modify the Verilog to recognize your custom neural-network opcodes, and compile your software using the standard open-source GCC or Clang toolchains. You don’t have to sign a contract, and you avoid the “Turing tax” of trying to force a general-purpose processor to do a specialized job.
11.2 Building a Custom AI Accelerator
If we want to build a custom AI accelerator—a Tensor Coprocessor—to attach to our RISC-V core, we need to design the digital logic.
Traditionally, hardware engineers use Register Transfer Languages (RTLs) like Verilog or VHDL. However, as SoC complexity has exploded, these legacy languages have shown their age. Verilog and VHDL are verbose, require meticulous manual wiring between components, and offer very little help with managing concurrency or parameterization. Building a highly scalable, parameterized systolic array in raw Verilog is a nightmare of repetitive code.
To solve this, modern hardware designers are moving toward Hardware Construction Languages (HCLs).
Enter Chisel
One of the most prominent HCLs is Chisel (Constructing Hardware in a Scala Embedded Language). Chisel is embedded in Scala, a powerful, modern programming language.
By using Chisel, you are not writing raw logic gates; you are writing a Scala program that generates the hardware logic gates. This gives you the full power of object-oriented and functional programming to define your silicon. Chisel automatically infers bus widths, simplifies hierarchical wiring, and provides native support for powerful hardware paradigms, eventually compiling down into highly optimized, synthesizable Verilog.
TIP: Hardware is Not Software When writing Chisel, remember that even though the syntax looks like software, you are describing physical wires, multiplexers, and flip-flops. An
ifstatement in software evaluates a condition and branches execution. Awhenblock in Chisel physically instantiates a hardware multiplexer in silicon, wiring both potential data paths into a logic gate.
Defining the Tensor Coprocessor (MAC Unit)
The heart of any AI accelerator is the Multiply-Accumulate (MAC) unit. As we learned in Chapter 10, deep neural networks are essentially gigantic matrices being multiplied together. A MAC unit takes two inputs (a weight and an activation), multiplies them, and adds the product to a running accumulator register.
Let’s use Chisel to build a highly efficient MAC engine for the quantized int8 data we created earlier.
Here is the Chisel code to define a basic, synchronous hardware MAC coprocessor:
import chisel3._
import chisel3.util._
// 1. Define the Hardware Interface (IO)
class TensorMacIO extends Bundle {
// Inputs from the RISC-V core
val activation = Input(SInt(8.W)) // 8-bit signed integer input
val weight = Input(SInt(8.W)) // 8-bit signed integer weight
val clear = Input(Bool()) // Signal to reset the accumulator
val enable = Input(Bool()) // Signal to perform the MAC operation
// Output back to the RISC-V core
val result = Output(SInt(32.W)) // 32-bit accumulated result
}
// 2. Define the Hardware Module
class TensorCoprocessor extends Module {
val io = IO(new TensorMacIO)
// 3. Create a 32-bit hardware register to hold the accumulated sum.
// It is automatically clocked by the system clock.
val accumulator = RegInit(0.S(32.W))
// 4. Describe the Hardware Logic
when (io.clear) {
// If the clear signal is high, route a zero into the register
accumulator := 0.S
} .elsewhen (io.enable) {
// If enable is high, multiply the 8-bit inputs and add to the accumulator.
// Chisel automatically provisions the hardware multiplier and adder circuits.
accumulator := accumulator + (io.activation * io.weight)
}
// 5. Wire the register's current value to the output pin
io.result := accumulator
}
Code Walkthrough: From Scala to Silicon
Let’s break down exactly what this Chisel code physically constructs on the FPGA or ASIC:
1. The Interface (TensorMacIO):
In hardware, a component must have physical pins. We define a Bundle to group our wires. We declare two 8-bit signed integer (SInt) inputs for our neural network data, boolean control wires (clear and enable), and a 32-bit output wire to prevent our accumulated sum from overflowing.
2. The Module (TensorCoprocessor):
We extend the Chisel Module class, which tells the compiler to synthesize this as a distinct hardware block.
3. The State Register (RegInit):
Unlike combinational logic which has no memory, a MAC unit needs to remember the running total. The RegInit(0.S(32.W)) command physically instantiates a 32-bit D-type flip-flop register, initialized to zero upon system reset. Because Chisel assumes synchronous design, the clock and reset lines are implicitly routed to this register for us.
4. The Control Logic (when / .elsewhen):
This block creates the data path. The when construct generates a hardware multiplexer. If the RISC-V processor asserts the io.clear wire, the multiplexer routes a hardwired 0 into the accumulator register’s input. If io.enable is asserted, the logic instantiates an 8-bit hardware multiplier connected to a 32-bit hardware adder, routing the sum back into the register.
By defining this module in Chisel, we can easily parameterize it. With a few loops in Scala, we could instantiate an entire 16x16 systolic array of these MAC units, generating thousands of dedicated logic gates. We then map this coprocessor to one of RISC-V’s reserved custom instruction opcodes.
In the next section, we will see how to write the specific inline assembly instructions in our C/C++ firmware to fire data into this newly forged silicon and retrieve the AI inferences.
11.3 Executing Custom Instructions
If you followed along in the last section, you used Chisel to define a custom Tensor Coprocessor. The hardware synthesizes beautifully, the multiplexers are wired, and the 32-bit accumulator flip-flops are ready and waiting on the silicon.
But there is a problem. You are writing your drone’s flight control software in C or C++. If you compile result = activation * weight;, the standard GCC compiler will look at its target architecture (RV32I) and generate a standard sequence of integer load, multiply, add, and store instructions. The compiler is completely blind to your shiny new hardware accelerator.
To bridge the gap between the custom Domain-Specific Architecture (DSA) we just designed and the high-level software stack, we have to bypass the C compiler.
The RISC-V Custom Opcodes
If you were working with a proprietary Instruction Set Architecture (ISA) like x86 or ARM, adding a new instruction would be impossible without a multi-million-dollar architectural license. But RISC-V was built for exactly this scenario.
To support DSAs and custom hardware, the RISC-V specification explicitly reserves four opcodes—custom-0, custom-1, custom-2, and custom-3—exclusively for user-defined instructions. Each of these opcodes can be further extended using standard 3-bit and 7-bit function codes (funct3 and funct7), leaving room for literally thousands of custom instructions in your silicon.
Let’s assume we mapped the hardware MAC unit from the previous section to a custom instruction we will call VMAG (Vector Multiply-Accumulate Generator).
Bridging the Gap with Inline Assembly
In Chapter 2, we used inline assembly as a scalpel to access system control registers and put the CPU to sleep. Now, we will use that exact same scalpel to invoke our custom VMAG instruction.
You might think that to use a new instruction, you have to download the source code for the GNU Assembler, add your instruction mnemonic to its parsing tables, and recompile the entire toolchain. Thankfully, the RISC-V GNU toolchain provides a brilliant backdoor: the .insn directive. This directive allows you to construct raw machine instructions on the fly directly inside your C code.
Here is the exact inline assembly required to wrap our new custom silicon in a callable C function:
#include <stdint.h>
// A C-callable wrapper for our custom hardware instruction
static inline int32_t execute_vmag(int8_t activation, int8_t weight) {
int32_t accumulator_result;
// Inject the raw instruction into the execution pipeline
__asm__ volatile (
// The RISC-V .insn directive formats a raw R-type instruction:
// .insn r opcode, funct3, funct7, rd, rs1, rs2
".insn r custom0, 0, 0, %0, %1, %2 \n\t"
: "=r" (accumulator_result) // Output operand (mapped to %0)
: "r" (activation), // Input operand 1 (mapped to %1)
"r" (weight) // Input operand 2 (mapped to %2)
// No clobber list needed; we only modify the output register
);
return accumulator_result;
}
Line-by-Line Walkthrough
Let’s look at exactly how this code ties the software to the silicon:
1. The static inline Wrapper
We wrap the assembly in a static inline C function. This means whenever you call execute_vmag() in your neural network loop, the compiler won’t actually perform a function call (which burns clock cycles pushing the return address to the stack). Instead, it will drop our custom machine instruction directly into the C code at that exact location.
2. The .insn r Directive
The .insn r directive tells the assembler, “I am building a standard RISC-V R-type instruction (Register-to-Register).” We provide it with the custom0 opcode, and set funct3 and funct7 to 0 (since we only have one custom instruction right now).
3. The Operand Placeholders (%0, %1, %2)
This is where the magic happens. We don’t hardcode CPU registers like x10 or x11. We let the C compiler’s register allocator do its job.
%1is mapped to theactivationinput. The compiler will automatically load the activation data into whatever general-purpose register happens to be free, and seamlessly substitute that register number into the%1slot of ourVMAGinstruction.%2is mapped to theweightinput.%0is mapped to theaccumulator_resultoutput.
When the CPU pipeline encounters this instruction, the hardware decoder instantly recognizes the custom0 opcode. Instead of routing the activation and weight registers to the standard Arithmetic Logic Unit (ALU), the CPU’s internal crossbar routes those register values directly to the physical Chisel input pins we defined in Section 11.2. On the next clock edge, the hardware multiplies them, adds them to the accumulator, and routes the 32-bit result back over the bus into the destination register (%0).
TIP: The True Power of Hardware-Software Co-Design This tiny inline assembly block represents the holy grail of Domain-Specific Architectures. You have effectively extended the C programming language to natively understand a neural network hardware accelerator. To the software engineer writing the AI model,
execute_vmag()just looks like a fast C function. But beneath the surface, you have bypassed the fetch-decode-execute overhead of thousands of standard instructions, directly stimulating custom digital logic to do the heavy lifting at a fraction of the power cost.