CrowdStrike has released a post-incident report (PIR) of the buggy update it released that brought down 8.5 million Windows machines last week. The detailed post has a bug CrowdStrike has been accused of failing to properly validate the content update that was deployed to millions of machines on Friday. CrowdStrike is promising to further test its content updates, improve its error handling, and implement a staggered rollout to prevent this disaster from happening again.
CrowdStrike’s Falcon software is used by companies around the world to combat malware and security vulnerabilities on millions of Windows machines. On Friday, CrowdStrike released a content configuration update to its software that was supposed to “gather telemetry on potential new threat techniques.” Such updates are released regularly, but this particular configuration update caused Windows to crash.
CrowdStrike typically releases configuration updates in two different ways. There is what is called the Sensor Content which directly updates CrowdStrike’s Falcon sensor that runs at the kernel level in Windows, and separately, there is the Quick Response Content which updates the behavior of that sensor to detect malware. A small 40KB Quick Response Content file is the source of Friday’s issue.
Actual sensor updates don’t come from the cloud and typically include AI and machine learning models that will help CrowdStrike improve its detection capabilities over the long term. Some of these features include what’s known as model types, which are code that enables new detection and are configured based on the distinct type of rapid response content that was delivered on Friday.
On the cloud side, CrowdStrike runs its own system that runs validation checks on content before it’s published to prevent an incident like Friday’s from happening. CrowdStrike released two rapid-response content updates last week, or what it also calls template instances. “Due to a bug in the content validator, one of the two template instances passed validation despite containing problematic content data,” CrowdStrike says.
While CrowdStrike does some automated and manual testing on sensor content and template types, it doesn’t appear to be doing as much extensive testing on the rapid response content that was delivered Friday. A March rollout of new template types gave “confidence in the checks performed in the content validator,” so CrowdStrike appears to have assumed that the rapid response content rollout wouldn’t be problematic.
This assumption led the sensor to load the problematic Rapid Response content into its content interpreter and trigger an out-of-bounds memory exception. “This unexpected exception could not be handled properly, resulting in a Windows operating system crash (BSOD),” CrowdStrike explains.
To prevent this from happening again, CrowdStrike promises to enhance its Rapid Response Content testing by using local developer testing, content update and rollback testing, as well as stress testing, fuzzing, and fault injection. CrowdStrike will also perform stability testing and content interface testing on Rapid Response Content.
CrowdStrike is also updating its cloud-based content validation tool to better verify rapid response content releases. “A new check is underway to prevent this type of problematic content from being deployed in the future,” CrowdStrike says.
On the driver side, CrowdStrike will “enhance existing error handling in the content interpreter,” which is part of the Falcon sensor. CrowdStrike will also implement a phased rollout of Rapid Response Content, ensuring that updates are rolled out gradually to larger portions of its install base instead of immediately to all systems. Driver improvements and phased rollouts have been recommended by security experts in the last days.