CrowdStrike Falcon: anatomy of Channel File 291

BSOD CrowdStrike at LaGuardia airport

19 July 2024, 04:09 UTC. Through the Rapid Response Content channel, CrowdStrike pushes a file named Channel File 291, a configuration for the Falcon sensor that adds detection rules over named pipes. The kernel-mode driver CSAgent.sys loads it, its interpreter tries to read 21 fields from a template instance that only carries 20, accesses a pointer outside the array, and Windows responds with the only thing it can respond with when a driver with kernel privileges touches memory that isn’t its own: BSOD.

Microsoft estimates 8.5 million Windows machines affected. Delta cancels more than 7,000 flights over five days and loses around 550 million dollars. Hospitals reschedule surgeries, broadcasters go off air, airport terminals sit there blue-screened in front of passengers. Recovery is manual (Safe Mode, delete the file, reboot) because the machine is in a reboot loop before anything can reach it over the network.

CrowdStrike publishes the Preliminary Post Incident Review on 24 July and the external Root Cause Analysis on 6 August. What follows is that RCA told in order, with the detail that’s missing when the story is just “a bad update”.

The analysis of the CSAgent.sys binary is based on public copies of the driver and on the crash dumps examined by Patrick Wardle and other researchers. No access to production systems required.

The product: two content channels, not one

To understand the bug you have to understand the Falcon deployment model. The sensor has two classes of content updated through separate channels:

Sensor Content. Permanent capabilities: native driver code, embedded ML models, Template Types. A Template Type is a schema: it defines which fields a detection of a certain kind carries, in what order, with what semantics. Sensor Content ships with each sensor release (N, N-1, N-2). The customer approves it and rolls it out with whatever testing they want.
Rapid Response Content. Template Instances: concrete data that fills in an existing Template Type. It’s the layer that lets CrowdStrike push a new detection without touching the driver code. Distributed from the cloud, in minutes, to every connected sensor. The customer has no test ring for this: the sensor receives it and loads it.

Channel File 291 is Rapid Response Content. Specifically, the configuration file for the IPC Template Type: detections that operate on Inter-Process Communication patterns (named pipes, in this case). The IPC Template Type had been in production since March 2024; file 291 was the umpteenth Template Instance loaded on top of it.

The bug in one line

The IPC Template Type was designed with 21 input fields (input fields). The sensor that interprets it, the native code in CSAgent.sys, was compiled for 20. For months, this mismatch didn’t blow up because every Template Instance going through the channel used a wildcard in field 21: the interpreter never read it.

Channel File 291 is the first one to put a concrete condition (no wildcard) in field 21. The interpreter tries to read it, touches the pointer at position 21 of the input array, and that position isn’t initialised because the sensor only bothered to prepare 20. The external RCA describes it like this: the Content Interpreter performed an out-of-bounds read of the input array.

Patrick Wardle, working off the crash dump, narrows it down to the instruction level: the faulting instruction is mov r9d, [r8], with r8 pointing to an unmapped address. The pointer left the input array, the processor traps, the kernel has nobody to hand the exception to, and the system goes down.

Reproducing the pattern in C

The bug isn’t exotic: it’s a classic OOB read, the Hello World of memory-unsafe security errors. It reproduces with a mini-parser of a few lines:

// oob.c — reproduction of the Channel File 291 pattern
#include <stdio.h>
#include <string.h>

#define EXPECTED_FIELDS 21
#define ACTUAL_FIELDS   20

typedef struct {
    char *name;
    char *value;
} field_t;

// The template_type says "I have 21 fields", like the IPC Template Type.
int template_type_field_count = EXPECTED_FIELDS;

// The instance only carries 20 valid fields. Field 21 doesn't exist.
field_t instance[ACTUAL_FIELDS] = {
    {"f00", "ok"}, {"f01", "ok"}, {"f02", "ok"}, {"f03", "ok"},
    {"f04", "ok"}, {"f05", "ok"}, {"f06", "ok"}, {"f07", "ok"},
    {"f08", "ok"}, {"f09", "ok"}, {"f10", "ok"}, {"f11", "ok"},
    {"f12", "ok"}, {"f13", "ok"}, {"f14", "ok"}, {"f15", "ok"},
    {"f16", "ok"}, {"f17", "ok"}, {"f18", "ok"}, {"f19", "ok"},
    // No [20]. But the code is going to read it.
};

int main(void) {
    // The interpreter iterates over field_count from the Template Type,
    // not over the real size of the instance. Just like CSAgent.sys in July.
    for (int i = 0; i < template_type_field_count; i++) {
        // OOB read at i == 20.
        printf("field[%d] = %s -> %s\n", i,
               instance[i].name, instance[i].value);
    }
    return 0;
}

Compiled with AddressSanitizer, the warning shows up as expected:

$ gcc -fsanitize=address -g -O0 oob.c -o oob && ./oob
field[0] = f00 -> ok
field[1] = f01 -> ok
...
field[19] = f19 -> ok
=================================================================
==12345==ERROR: AddressSanitizer: global-buffer-overflow on address 0x...
READ of size 8 at 0x... thread T0
    #0 0x... in main /tmp/oob.c:28

In user-mode with ASan, it ends with a message and an exit code. In kernel-mode with no ASan and no bounds check on the hot path, it ends with PAGE_FAULT_IN_NONPAGED_AREA BSOD.

Why the validator let it through

CrowdStrike has a component called the Content Validator that runs Rapid Response Content against a test matrix before pushing it to the production channel. The version active in July assumed the Template Type would be used in full: it validated against a mock that did provide 21 entries. To the validator, field 21 was legitimate. The instance passed.

The external RCA admits it plainly: the Content Validator evaluated the new Template Instances, but based its assessment on the expectation that the IPC Template Type would be provided to interpret the 21st entry of the input pointer array. In other words: the validator doesn’t check that the sensor in production can interpret what it’s validating, it checks that what it validates is syntactically correct.

Three things break at the same time:

No versioned contract between Template Type and sensor. No check at build or deploy time verifies that the field_count declared by the Type matches the number of fields the sensor in a specific version actually knows how to process.
No bounds check in the interpreter. The interpreter uses field_count from the Type as the loop bound, not the real size of the buffer it has at hand.
No canary deployment. Rapid Response Content rolls out straight to the whole estate. No 1% first, 10% ten minutes later, 100% an hour in. The philosophy is fast delivery at the expense of blast radius.

Those three together turn a trivial OOB read into a global event.

Why kernel-mode

The question that keeps coming up on X over the following days: why does CrowdStrike live in the kernel? Short answer: because Windows EDR lives in the kernel. To hook NtCreateFile, NtCreateProcessEx, NtMapViewOfSection, the primitives an EDR needs to monitor, you have to sit in a kernel callback. Since Windows 10, Microsoft has provided a set of documented APIs (PsSetCreateProcessNotifyRoutineEx, ObRegisterCallbacks, FltRegisterFilter) that live in the kernel. The alternative would be ETW (Event Tracing for Windows) in user-mode, but as of July 2024 ETW doesn’t cover everything an EDR needs and is easier to silence.

The long answer is that the kernel-mode decision is Windows’s own: on macOS, Apple closed kernel extensions and forced CrowdStrike, SentinelOne and the others to move to EndpointSecurity.framework, which lives in user-mode with special privileges. There is no Windows equivalent.

In September 2024, Microsoft brings the main EDR vendors to an internal conference and shortly after publishes the Windows Resiliency Initiative, announcing work with vendors on a delegate API to run EDR in user-mode with comparable telemetry, without a sensor BSOD knocking the machine over. The roadmap is 2025 onwards.

What Channel File 291 teaches

The blast radius of fast delivery is the model, not the accident. Any product that pushes kernel content to 8.5M machines in minutes with no prior test ring has this risk profile by design. The bug is an excuse; the problem is the deployment curve.
The validator of a critical system can’t assume the runtime: it has to run the real runtime. “The file is syntactically valid” and “the sensor in production can process this file without crashing” are two different things. The July validator checked the first.
The customer approval model should extend to Rapid Response Content. The customer approves which sensor version they install; they don’t approve what Rapid Response they load. After the incident, CrowdStrike adds an option in the console to stage the content channels, not just the sensor. It’s what large operators asked for in 2018 and got ignored for go to market reasons.
Kernel-mode is neither problem nor solution by itself. The bug would have been just as exploitable in user-mode if the driver had been there; what kernel changes is that the fault takes the OS down with it. Moving the agent out of kernel isn’t security — it’s resilience. Two different things that often get conflated.

Detection and mitigation of the actual incident

For those stuck in the reboot loop on 19 July, the official mitigation was:

Boot into Safe Mode or Windows Recovery Environment.
Navigate to C:\Windows\System32\drivers\CrowdStrike.
Delete the file C-00000291*.sys.
Reboot normally.

On machines with BitLocker, step 1 required the recovery key — which multiplied the recovery hours at organisations where the key was in an AD that was also affected and therefore unreachable. CrowdStrike later released a USB recovery tool to automate the steps, but the first reaction was manual.

In the longer term, what changes after July:

CrowdStrike offers Sensor Content Update Controls: customers can pin their sensor to N-2 and pick the adoption pace.
Microsoft opens the conversation with EDR vendors about alternatives to kernel-mode.
CISA publishes guidance on resilient deployment applicable to any software-as-a-fleet, not just EDR.
Cyber insurance policies start to include specific vendor outage clauses, separate from security incident clauses.