Chapter 3: Moving Data: Interconnects, AMBA AXI, and DMA

3.1 The Myth of the Bus

If you look at the motherboard of a vintage 1990s PC, you can literally see the “bus”—thick parallel traces of copper running from the CPU slot to the memory and expansion cards. It was a shared electrical pathway. If the CPU was reading a file from the hard drive, it seized the bus, and every other device had to politely wait its turn.

When we talk about computer architecture today, we still use the word “bus.” We talk about the memory bus, the peripheral bus, and the system bus. But if you are building a modern Cyber-Physical System (CPS) on a System-on-Chip (SoC), the bus is a lie.

As SoCs grew to encompass multiple processor cores, GPU accelerators, and dozens of high-bandwidth peripherals, the concept of a single shared electrical bus collapsed under its own weight. If a shared bus has multiple initiators (like a CPU and a DMA controller), they must constantly arbitrate for access. Only one requestor-completer pair can communicate at any given time, forcing high-speed devices to stall and waste precious clock cycles while low-speed devices finish their transfers.

To solve this contention, engineers moved to Full Crossbar Switch Architectures. A crossbar layers a physical switch in front of every completer (target) in the system. Instead of sharing one set of wires, a crossbar provides a matrix of connections, allowing multiple requestors to talk to multiple completers simultaneously. If Core 0 wants to talk to the memory controller while the DMA engine talks to the UART, a crossbar allows both transactions to happen at the exact same time.

However, crossbars scale quadratically. If you have 64 requestors and 64 completers, you need 4,096 switching nodes. In a modern SoC, that consumes a massive amount of physical silicon area and creates impossibly dense wiring congestion.

Welcome to the Network-on-Chip (NoC)

To escape the physical limits of crossbar wiring, modern high-performance SoCs use a Network-on-Chip (NoC).

Rather than holding open a dedicated electrical circuit between the CPU and the memory, a NoC operates exactly like a miniature Internet. It is a packet-switched network right on the silicon. When your processor writes a value to a memory-mapped peripheral, that write command is digitized into a packet, broken down into smaller flow-control units called flits, and routed through a mesh of microscopic switching elements (routers).

The NoC multiplexes all forms of data—memory reads, peripheral writes, and cache coherency messages—onto a shared network fabric. This yields massive bandwidth, but introduces a new reality for the software engineer: latency is no longer deterministic. Just like pinging a remote server on the Internet, a memory read on a NoC might arrive instantly, or it might be delayed because the network routers are congested with other traffic.

WAR STORY: The Hidden Cost of the NoC In a shared bus, if you trigger a peripheral write, it happens sequentially. On a NoC, if you fire off a write to a motor controller, and immediately fire a write to a brake controller, those two packets might take different physical routes through the chip’s switching fabric. If the motor’s route is clear but the brake’s route is congested, they might arrive out of order. This is why you must explicitly use memory barrier (fence) instructions when the sequence of hardware events is critical.

3.2 Inside AMBA AXI

To standardize how IP blocks communicate over these complex fabrics, ARM introduced the Advanced Microcontroller Bus Architecture (AMBA). While early versions like AHB (Advanced High-performance Bus) still relied on older shared-bus paradigms, the need for extreme performance led to the creation of the AXI (Advanced eXtensible Interface) protocol.

AXI is the de facto standard for high-performance SoC design. If you are configuring a custom FPGA or programming a high-end ARM Cortex-A processor, your data is moving over AXI.

AXI completely abandons the idea of a single set of shared wires. Instead, it breaks every connection into five independent, unidirectional channels.

Write Address Channel (AW): The manager sends the target memory address and control information.
Write Data Channel (W): The manager sends the actual payload data.
Write Response Channel (B): The subordinate confirms if the write succeeded or failed.
Read Address Channel (AR): The manager sends the target memory address to read from.
Read Data Channel (R): The subordinate sends the requested data (and status) back to the manager.

Why five channels? Because decoupling the addresses from the data allows the hardware to pipeline transactions. A CPU doesn’t have to wait for a write to finish before starting a read. It can blast out ten write addresses on the AW channel, simultaneously accept read data on the R channel, and eventually push the write payloads down the W channel. Furthermore, because data flows in only one direction per channel, hardware engineers can easily insert pipeline registers to hit massive clock frequencies.

(Note: You might wonder why there are three channels for writing but only two for reading. In a read, data flows from the subordinate to the manager, so the status response simply piggybacks on the Read Data channel. In a write, data flows from the manager to the subordinate, so a dedicated reverse channel is needed just to send the success/fail response back.)

The VALID/READY Handshake

At the absolute core of AXI is the flow-control mechanism used independently on all five channels: the VALID/READY handshake.

Because AXI connects devices operating at vastly different speeds, the protocol must support strict backpressure to prevent fast devices from overwhelming slow devices.

The sender (source) drives the data lines and asserts the VALID signal when the data is legitimate.
The receiver (destination) asserts the READY signal when it has the buffer space to accept new data.

A successful transfer only occurs on the rising edge of the clock cycle where both VALID and READY are HIGH.

Let’s look at a text-based timing diagram of an AXI Write Data transfer where the subordinate is applying backpressure:

Clock Cycle:      |   T1   |   T2   |   T3   |   T4   |   T5   |
                  |        |        |        |        |        |
Manager WDATA:    |  0xAA  |  0xBB  |  0xBB  |  0xBB  |  0xCC  |
Manager WVALID:   |  HIGH  |  HIGH  |  HIGH  |  HIGH  |  HIGH  |
Subord. WREADY:   |  HIGH  |  LOW   |  LOW   |  HIGH  |  HIGH  |
                  |        |        |        |        |        |
Transfer Occurs?  |  YES   |  NO    |  NO    |  YES   |  YES   |

What happened here?

T1: The Manager places 0xAA on the bus and yells VALID. The Subordinate is READY. The clock ticks, and the data is successfully transferred.
T2: The Manager wants to send the next byte, 0xBB. It asserts VALID. But the Subordinate’s internal FIFO is full, so it drops READY to LOW. The clock ticks, but no transfer occurs.
T3: The Subordinate is still choking. The Manager must hold 0xBB on the data bus and keep VALID asserted. It cannot cancel the transaction or change the data.
T4: The Subordinate clears its buffer and raises READY. Because VALID is still high, the clock ticks and 0xBB is finally transferred.
T5: The Manager immediately pushes 0xCC, the Subordinate is still READY, and the transfer continues seamlessly.

TIP: Debugging AXI Hangs If you are writing a bare-metal driver or integrating custom FPGA logic, the most common hardware bug is a system lockup caused by a botched AXI handshake. If your CPU writes to a peripheral address and the whole system freezes, it almost always means the processor asserted WVALID, but the peripheral’s state machine crashed and never asserted WREADY. The CPU will wait until the end of time for that handshake to complete. When debugging with a logic analyzer or simulation trace, always look for unmatched VALID signals!

3.3 Direct Memory Access (DMA)

WARNING: Polling Burns CPU Cycles and Batteries If you take away only one lesson from this chapter, let it be this: never use the CPU to babysit a peripheral. If you write a while(1) loop that constantly polls a UART’s “TX Ready” bit just to send the next byte of an array, you are keeping the processor pipeline running at full throttle. On a battery-powered device, this continuous switching activity burns maximum dynamic power and will drain your battery in hours. On a thermally constrained system, it generates waste heat for absolutely no computational gain. Stop stuffing envelopes yourself. Let the hardware do the work.

In a traditional fetch-decode-execute cycle, moving a block of data from memory to an I/O device requires the processor to execute a load instruction, followed by a store instruction, over and over again. If you have a 4-kilobyte audio buffer that needs to be sent to a Digital-to-Analog Converter (DAC), having the CPU execute a loop of 4,096 loads and stores is a massive waste of cycles.

To solve this, SoC architects include a dedicated hardware block known as a Direct Memory Access (DMA) controller. You can think of the DMA as a tiny, highly specialized co-processor whose only job in life is to copy data from point A to point B over the system interconnect without any CPU intervention. Once you configure it and pull the trigger, the CPU can either go do crunch some heavy math, or completely power down its instruction pipeline and go to sleep.

3.3.1 The DMA Register Interface

At the hardware level, a single-channel DMA controller is usually controlled by four fundamental Memory-Mapped I/O (MMIO) registers:

Source Pointer Register: The physical address where the data originates.
Destination Pointer Register: The physical address where the data is going.
Length Register: The number of bytes or words to transfer.
Control/Status Register: Used to configure the transfer parameters (like whether to increment the pointers) and to start the transfer.

Let’s look at how we configure this in C. Suppose we have a sensor reading array that we need to blast out over a serial port. We want the DMA to handle the transfer, and we want the CPU to drop into a deep sleep until the transfer is entirely finished.

3.3.2 Blasting Data with C

Here is the comprehensive, bare-metal C code to configure the DMA controller, kick off the transfer, and put the processor to sleep:

#include <stdint.h>

// 1. Define the MMIO addresses for our DMA Controller based on the datasheet
#define DMA_BASE_ADDR   0x40020000
#define DMA_SRC_REG     (*(volatile uint32_t*) (DMA_BASE_ADDR + 0x00))
#define DMA_DEST_REG    (*(volatile uint32_t*) (DMA_BASE_ADDR + 0x04))
#define DMA_LEN_REG     (*(volatile uint32_t*) (DMA_BASE_ADDR + 0x08))
#define DMA_CTRL_REG    (*(volatile uint32_t*) (DMA_BASE_ADDR + 0x0C))

// Define the MMIO address for the target peripheral (e.g., UART TX FIFO)
#define UART0_TX_REG    (*(volatile uint32_t*) 0x40001000)

// 2. Define bit-masks for the DMA Control Register
#define DMA_ENABLE           (1 << 0)  // Start the transfer
#define DMA_INT_ENABLE       (1 << 1)  // Fire an interrupt when finished
#define DMA_SRC_INCREMENT    (1 << 2)  // Increment source address after each read
#define DMA_DEST_INCREMENT   (1 << 3)  // Increment destination address after each write

void send_data_via_dma(uint32_t* data_array, uint32_t element_count) {
    
    // 3. Set the source address to the beginning of our array in RAM
    // We must cast the pointer to a raw 32-bit integer for the hardware register
    DMA_SRC_REG = (uint32_t) data_array;
    
    // 4. Set the destination address to the UART Transmit Register
    DMA_DEST_REG = (uint32_t) &UART0_TX_REG;
    
    // 5. Tell the DMA how many elements to transfer
    DMA_LEN_REG = element_count;
    
    // 6. Configure the behavior and pull the trigger!
    // - We want the source address to increment so we march through the array.
    // - We DO NOT increment the destination address (all data goes to the same UART FIFO).
    // - We enable the completion interrupt to wake up the CPU.
    // - We enable the DMA to start immediately.
    DMA_CTRL_REG = DMA_SRC_INCREMENT | DMA_INT_ENABLE | DMA_ENABLE;
    
    // 7. The DMA is now running. Put the CPU to sleep to save battery.
    // The processor pipeline halts here until a hardware interrupt fires.
    __asm__ volatile ("wfi"); 
    
    // 8. When we reach this line, the DMA has finished and the interrupt woke us up.
}

3.3.3 Line-by-Line Walkthrough

Let’s break down exactly what is happening between the software and the silicon here.

Steps 1 & 2: The MMIO Definitions We map the four registers of the DMA controller by casting hardcoded physical addresses into volatile uint32_t pointers, just as we did in Chapter 1. We also define bit-masks that match the layout of the Control Register in the hardware’s datasheet.

Steps 3, 4 & 5: Loading the Pointers We load the DMA_SRC_REG with the physical address of data_array. Because the DMA hardware bypasses the CPU pipeline, it doesn’t understand C pointers—it only understands raw bus addresses. We load DMA_DEST_REG with the address of the UART’s transmit register, and set the length.

Step 6: The Increment Trap This is a classic embedded systems gotcha. When you copy an array from one part of RAM to another, you want both the source and destination addresses to increment after every byte. However, when streaming data to a peripheral, the peripheral’s data register is a fixed address representing a hardware FIFO.

If we accidentally set the DMA_DEST_INCREMENT bit here, the DMA would write the first byte to 0x40001000 (the UART), the second byte to 0x40001004 (some random hardware configuration register), the third byte to 0x40001008 (perhaps disabling the UART entirely), and then crash the system. By leaving DMA_DEST_INCREMENT cleared, the DMA intelligently reads data_array, data_array, data_array, but blasts every single read into the exact same destination address: the UART TX FIFO.

Step 7: Going Dark The moment we write DMA_ENABLE to the control register, the DMA controller asserts its bus mastery. Behind the scenes, it negotiates with the AXI interconnect, stealing unused bus cycles or bursting data over its own dedicated channel.

Meanwhile, the CPU executes the wfi (Wait For Interrupt) instruction. The processor’s clock gating kicks in, the instruction pipeline freezes, and dynamic power consumption plummets to near zero.

The DMA silently marches through the data_array in SRAM. For each element, it reads the data across the bus, then writes it directly to the UART, decrementing its internal length counter.

When the length counter hits zero, the DMA hardware pulls the interrupt line high. The processor’s Nested Vectored Interrupt Controller (NVIC) detects the signal, wakes the CPU from its deep sleep, and instruction execution resumes exactly where it left off. You successfully moved a massive block of data with almost zero CPU overhead.

Keyboard shortcuts

Foundations of Computer Architecture and Cyber-Physical Systems 2.0