Chapter 14: Resilient Control and Byzantine Faults

In Chapter 13, we locked down the front door. We implemented Secure Boot, added hardware trust anchors, and encrypted our Over-The-Air firmware updates. But what happens if the attacker doesn’t break into the main CPU? What if, instead, they compromise the sensors feeding data to the CPU?

If your drone’s flight controller is completely secure, but it receives data telling it the drone is falling out of the sky, the perfectly secure flight controller will confidently and securely drive the motors to maximum throttle, inadvertently launching the drone into the stratosphere.

In this chapter, we bridge the gap between cybersecurity and control theory. We will explore Cyber-Physical Threat Intelligence, learn how to survive sensors that actively lie to us using resilient consensus algorithms, and write C code to implement an autonomous hardware fallback.

14.1 Cyber-Physical Threat Intelligence (CPTI)

Historically, organizations treated physical security and cybersecurity as completely separate silos. If an intruder broke through a fence, the physical security team handled it. If a firewall dropped a malicious packet, the IT team handled it. But in a modern Cyber-Physical System (CPS), this separation is a fatal vulnerability.

As demonstrated by the Stuxnet worm that destroyed nuclear centrifuges, or the 2015 cyberattacks that caused massive power blackouts in Ukraine, modern adversaries exploit vulnerabilities in the digital realm to attack physical assets. To counter this, security operations must adopt Cyber-Physical Threat Intelligence (CPTI), an integrated approach that models security knowledge holistically, correlating cyber and physical events to understand cascading effects.

From a control systems perspective, CPTI allows us to categorize attacks based on how they impact the physical plant. As defined by Yan et al., an adversary taking over communication networks can launch a deception attack (also known as a false data injection attack). In a deception attack, the adversary maliciously modifies transmitted sensor data. Because the system appears to be operating normally, the receiver is fooled into believing an incorrect version of reality and takes physical actions that benefit the attacker.

To survive a deception attack, your software cannot simply trust its inputs. It must be mathematically resilient by design.

14.2 Resilient Consensus: Surviving the Byzantine Generals

In distributed computing, a Byzantine fault is the worst-case scenario: a component doesn’t just fail; it actively and maliciously lies to the rest of the system.

Suppose you are designing a flight controller for a drone. To ensure reliability, you equip the drone with 5 identical velocity sensors. Under normal conditions, you would read all 5 sensors, add the values together, and divide by 5 to get a smooth, averaged velocity.

But what happens if a cyber-physical attack successfully compromises 2 out of the 5 sensors? If the drone is hovering at 0 m/s, the 3 benign sensors will report 0. But the 2 compromised sensors, attempting to crash the drone, report +10,000 m/s. If you use a simple average, the math looks like this: (0 + 0 + 0 + 10000 + 10000) / 5 = 4000 m/s

The flight controller averages the data, concludes the drone is ascending at Mach 11, and instantly cuts the motors. The drone drops like a rock.

WARNING: The Average Will Kill You Standard mathematical averaging (or standard least-squares filtering) offers absolutely zero Byzantine fault tolerance. A single compromised sensor capable of reporting an infinitely large value can completely control the output of an arithmetic mean.

The “Middle Points” Algorithm

To survive malicious nodes in multi-agent systems, we must use a resilient consensus algorithm. The goal of resilient consensus is to guarantee that the final agreed-upon value always remains within the convex hull (the safe, bounded range) formed strictly by the benign, uncompromised sensors.

As detailed in Yan’s research on secure coordination, if you have $m$ sensors and you suspect that up to $f$ of them might be actively malicious, you cannot use an average. Instead, you use a sorting algorithm.

If we have $m = 5$ sensors, and $f = 2$ are actively lying, the algorithm dictates that the controller must sort the 5 received values from lowest to highest. It then entirely discards the $f$ largest values and the $f$ smallest values. By slicing off the extreme edges of the dataset, the algorithm physically bounds the remaining data. For a 1-dimensional array, discarding the 2 highest and 2 lowest values out of 5 leaves you with exactly one value: the median.

By utilizing this “middle points” approach, even if the 2 compromised sensors scream +10,000 or -10,000, their data is mathematically guaranteed to be discarded. The controller’s chosen value is forced to remain securely within the boundaries of the 3 benign sensors.

14.3 Reactive Autonomy and the Fallback Brake

What happens when an attack is so severe that it thrusts the system completely outside of its safe operating regime? What if the attacker manages to blind the primary sensors entirely, or bypasses the consensus layer?

In these high-impact, low-probability adversarial events, standard primary controls are myopic—they cannot see the bigger picture and will fail. To save the physical equipment, the system must trigger Reactive Autonomy.

Reactive autonomous control acts as a minimally invasive add-on layer. It continuously monitors critical system states independently of the primary control loop. If an unforeseen event (like a massive cyberattack) pushes the system state outside of the safe operating limits, the reactive controls immediately hijack the setpoints. The objective of reactive decentralized control is to steer the system back to the safe region, guaranteeing the earliest robust return to safety. Once the system is back within safe bounds, reactive autonomy hands control back over to the primary (proactive) systems.

Implementing the Resilient Fallback in C

Let’s put this into practice. We will write a C function for a cyber-physical braking system. It reads our 5 velocity sensors, applies the resilient consensus algorithm to defeat up to 2 Byzantine attackers, and then implements a Reactive Autonomy fallback: if the consensus velocity violates the absolute safety limits (indicating the system has been thrust out of the safe regime), it overrides the primary controller and forcefully applies the emergency brakes.

#include <stdint.h>
#include <stdbool.h>

// Define hardware memory-mapped registers for the brake actuator
#define BRAKE_ACTUATOR_REG (*((volatile uint32_t*) 0x40021000))
#define BRAKE_ENGAGE       0x01
#define BRAKE_RELEASE      0x00

// Safe operating limits for the physical system
#define SAFE_VELOCITY_MAX  100.0f 

// Helper function to sort an array of 5 floats (Bubble sort for simplicity)
void sort_sensors(float* arr, int n) {
    for (int i = 0; i < n - 1; i++) {
        for (int j = 0; j < n - i - 1; j++) {
            if (arr[j] > arr[j + 1]) {
                float temp = arr[j];
                arr[j] = arr[j + 1];
                arr[j + 1] = temp;
            }
        }
    }
}

/**
 * @brief Computes a Byzantine-resilient velocity and enforces Reactive Autonomy.
 * @param sensor_readings Array of 5 raw velocity sensor readings.
 * @return The resilient consensus velocity.
 */
float compute_resilient_velocity_and_protect(float sensor_readings) {
    
    // 1. Sort the readings from lowest to highest
    sort_sensors(sensor_readings, 5);
    
    // 2. Resilient Consensus ("Middle Points" Algorithm)
    // We assume f = 2 sensors might be Byzantine (maliciously lying).
    // We discard the f highest (indices 3, 4) and f lowest (indices 0, 1).
    // This leaves us with the median value at index 2, which is mathematically
    // guaranteed to be within the convex hull of the benign sensors.
    float consensus_velocity = sensor_readings;

    // 3. Reactive Autonomy Check
    // If an unforeseen adversarial event has thrust the system out of the safe 
    // operating regime, we must hijack the controls to guarantee a robust return 
    // to the safe region.
    if (consensus_velocity > SAFE_VELOCITY_MAX) {
        
        // The system is moving dangerously fast, overriding primary myopic controls.
        // We trigger the hardware brake to steer the system back to safety.
        BRAKE_ACTUATOR_REG = BRAKE_ENGAGE;
        
    } else {
        // We are within the safe operating regime. Proactive autonomy remains in control.
        BRAKE_ACTUATOR_REG = BRAKE_RELEASE;
    }

    // Return the safe, filtered velocity to the rest of the RTOS tasks
    return consensus_velocity;
}

By layering these concepts—using resilient consensus to discard Byzantine sensor data at the input, and deploying a reactive autonomous safety mechanism at the output—you isolate the physical operation of your system from the chaos of the cyber domain. The software refuses to be fooled, and the hardware refuses to self-destruct.