Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chapter 9: The End of Moore’s Law and the Rise of DSAs

If you have been writing software for a while, you are used to what David Patterson affectionately calls the “La-Z-Boy era” of programming. For decades, if your code was too slow, you didn’t really need to optimize it. You just sat back, waited 18 months, and bought a new processor that ran twice as fast.

That era is officially dead.

The free lunch of infinitely scaling hardware performance has crashed headfirst into the laws of physics. If we are going to build the next generation of Cyber-Physical Systems (CPS)—where we need to process complex neural networks for vision and control on battery-powered edge devices—we have to fundamentally change how we think about computer architecture.

In this chapter, we are going to look at the quantitative reality of modern silicon. We will explore why general-purpose CPUs have hit a thermal brick wall, why simply adding more cores breaks your heart, and why Domain-Specific Architectures (DSAs) are the absolute only path forward for edge AI.

9.1 Hitting the Power Wall: Dennard Scaling and Dark Silicon

To understand why CPUs stopped getting faster, we have to look at two semiconductor rules that drove the industry for 50 years, and how they both recently failed us.

First is Moore’s Law, Gordon Moore’s famous 1965 prediction that the number of transistors on a chip would double every year (later amended to every two years). For a long time, this held true. But physics is unforgiving. As transistors shrink to atomic scales, the doubling time has stretched significantly. If the historical Moore’s Law trend had continued perfectly, a high-end microprocessor in 2022 would have had over 100 billion transistors; instead, SOTA chips like the Apple M2 delivered around 20 billion—off by a factor of 5.

But the real killer was the death of Dennard Scaling.

In 1974, Robert Dennard observed that as transistors shrank, their power density remained constant. Because the dimensions were smaller, you could drop the operating voltage and current. This meant you could cram more transistors into the same area and clock them much faster, all without increasing the total power consumption of the chip. It was magic: processors got smaller, faster, and more energy-efficient simultaneously.

Around 2004, Dennard Scaling abruptly ended. As transistors reached deep submicron sizes, threshold voltages could no longer be dropped without causing massive current leakage. Suddenly, smaller transistors still consumed significant power.

This created the Power Wall. You can still pack billions of transistors onto a silicon die, but if you attempt to clock them all at 5 GHz simultaneously, the chip will literally melt. This phenomenon birthed the concept of Dark Silicon: we now have the ability to build processors with so many transistors that we simply cannot afford to turn them all on at the same time.

9.2 Amdahl’s Heartbreaking Law

When CPU designers hit the Power Wall in 2004, they panicked. They canceled their high-frequency uniprocessor projects and pivoted entirely to multicore designs. The idea was simple: if we can’t build one massive, fast 10 GHz core, we will give the programmer four (or forty) slower, highly efficient 2.5 GHz cores.

But throwing multiple cores at a problem immediately runs into a mathematical buzzkill known as Amdahl’s Law.

Amdahl’s Law calculates the maximum theoretical speedup of a system when you only improve a fraction of it. In the context of multicore processing, the equation looks like this:

$$ \text{Speedup} = \frac{1}{(1 - F) + \frac{F}{S}} $$

Where:

  • $F$ is the fraction of your code that can be parallelized.
  • $S$ is the speedup factor (the number of cores).

Here is why it breaks your heart. Suppose you are writing an object detection algorithm for a drone. You manage to refactor your code so that an incredible 90% of it runs perfectly in parallel across multiple cores. Only 10% of the code remains strictly serial (perhaps setting up memory, or calculating the final bounding box logic).

If you run this on a 100-core processor, what is your speedup? $$ \text{Speedup} = \frac{1}{(1 - 0.90) + \frac{0.90}{100}} = \frac{1}{0.10 + 0.009} \approx 9.17 $$

Read that again. You bought 100 cores, but you only got a 9x speedup.

WARNING: The Serial Bottleneck Amdahl’s Law dictates that the execution time of the sequential portion of your program places a hard ceiling on your maximum performance. If just 10% of your robot’s control loop is serial, your absolute maximum speedup is 10x—even if you throw a million cores at it. Multicore is not a silver bullet.

9.3 The Rise of Domain-Specific Architectures (DSAs)

If we can’t crank the clock frequency (because of the Power Wall) and we can’t just slap a thousand cores on a chip (because of Amdahl’s Law), how do we get the massive computational leaps required to run AI models on the edge?

We have to abandon the “general-purpose” part of the CPU.

A general-purpose CPU is incredibly inefficient. To execute a single 32-bit addition, an out-of-order CPU spends up to 99% of its energy fetching the instruction, decoding it, renaming registers, checking for pipeline hazards, and moving data. The actual math costs almost nothing compared to the bureaucratic overhead.

The only path left to improve energy-cost-performance is specialization. We build a Domain-Specific Architecture (DSA).

A DSA trades away the flexibility of a CPU to achieve massive energy efficiency for a very narrow set of tasks. A DSA might be terrible at running a database or a web browser, but it will perform matrix multiplications for a Deep Neural Network (DNN) orders of magnitude faster, and at a fraction of the power.

The 5 Guidelines for Designing a DSA

When designing DSAs (like Google’s Tensor Processing Unit or Apple’s Neural Engine), top hardware architects follow five quantitative guidelines:

1. Use dedicated memories. Moving data across a chip costs exponentially more energy than doing math. A general-purpose CPU uses complex, power-hungry, multi-level cache hierarchies to guess what data you might need next. A DSA ditches the caches. Instead, it uses software-controlled scratchpad memories dedicated specifically to the domain, keeping the data exactly where the arithmetic logic units (ALUs) need it.

2. Invest saved resources into massive arithmetic units. By stripping out all the complex microarchitecture required for general-purpose code (like branch predictors, out-of-order schedulers, and speculative execution logic), you free up a massive amount of silicon area and power. You reinvest that power directly into raw computation. While a CPU might have a handful of ALUs, a DSA like the TPU packs in thousands of multiply-accumulate (MAC) units.

3. Use the easiest form of parallelism. Instead of complex, thread-level parallelism that falls victim to Amdahl’s Law, DSAs exploit the natural parallelism of the domain. For edge AI, this usually means massively wide Single Instruction, Multiple Data (SIMD) architectures or 2D systolic arrays that blast through matrix math in a highly structured, predictable way.

4. Reduce data size and types. A desktop CPU is built around 64-bit floating-point math. But neural networks don’t need IEEE 754 double-precision accuracy. DSAs aggressively shrink data down to 16-bit floats, 8-bit integers, or even 4-bit integers. Narrower data allows you to pack exponentially more ALUs onto the chip and dramatically reduces the energy burned transferring data over the memory bus.

5. Program with Domain-Specific Languages (DSLs). You do not program a DSA in raw C or assembly. Attempting to write a C compiler that automatically figures out how to map a for loop onto a 65,000-ALU systolic array is a fool’s errand. Instead, developers write code in high-level DSLs like TensorFlow, PyTorch, or JAX. These frameworks understand the high-level matrix operations natively, making it trivial for the software stack to map the math down to the custom silicon.

Summary

The era of relying on semiconductor physics to automatically make our code faster is over. To bring high-performance intelligence to the physical edge, we must bridge the gap between software and silicon. In the next chapter, we will put Guideline #4 to the test. We will take a massive, power-hungry floating-point AI model and brutally shrink it down using Quantization and TensorFlow Lite, preparing it to run on the bare metal of an edge microcontroller.