Crowdstrike Outage - thoughts about updates
Crowdstrike issued a problematic update that impacted about 8.5 million Windows computers sending them into a Blue Screen Of Death and taking down millions of vital systems worldwide including airlines and 911 call centers.
Shortly after the CrowdStrike incident happened, the Cybersecurity and Privacy Institute at University of Waterloo helped publish an article on “Human factors of security and cybersecurity professor at University of Waterloo reflects on CrowdStrike global tech outage” by Kami Vaniea (me) and Regina Ashna Singh.
Below is my more detailed views.
At its core the Crowdstrike outage happened because two related issues were not handled well: testing and update management.
What happened
Normally when software needs to be changed, the vendor creates an update, tests it and release it for other people to install. Large organizations that depend on software running correctly typically conduct further tests on the new update before installing it on their systems. Airlines, for example, have dedicated test machines they will install the update on first to make sure it is working. Even then they will likely also do “staged deployment” which means they will update the least critical computers first, make sure everything is operating as expected, and then update the more critical computers. These tests are done because even assuming the original software vendor did an excellent job of testing, the large organization may have specialized software installed that also needs to be tested against.
Security tools, like Anti-virus, often bypass these tests when they are updating configuration files rather than software code. Anti-virus tools rely on lists of bad things that they are looking for, known as signatures. These lists are sets of patterns or computer-friendly descriptions of what known bad things look like. Some computer viruses can attack very quickly. Slammer in 2003 compromised 75,000 computers in 10 minutes. So when a security company learns about new attacks they want to send out an update to the signature lists quickly and they understandably do not want to wait for testing.
Crowdstrike at some time in the past issued an update that had an “out-of-bounds memory read” error. Their internal testing should have caught the error, but it did not because it only happens if a certain type of configuration file is present, and that type of configuration file was not used when testing the code. Then on Friday July 19th they pushed an update to the configuration file that caused their code to incorrectly try to read something that was not there (out-of-bounds memory read). Windows correctly identified abnormal behavior and shut down to prevent starting in an unsafe way. In other words, Windows experienced a Blue Screen of Death where it shows a blue screen with an error during startup and has to be shut down.
Normally security issues like this are fixed quickly and the public only has to experience a downtime of a few hours. The Crowdstrike issue took much longer to fix because it damaged how Windows starts up. Because CrowdStrike’s code is security software it has to be loaded and run early in the Windows startup process. Once a Windows computer downloaded the problematic configuration file the only way to start it again is to remove or replace the file. That means that a person has to visit each computer individually, start it in something called ``safe mode’’ and remove the problematic file. This process takes time and cannot be fixed remotely or automatically. There were some innovative solutions, such as the person who realized that automate some of the process using a barcode reader.
Testing and update management
Due to the incident CrowdStrike has decided to: “add additional deployment layers and acceptance checks” and “allow customers to control how Rapid Response Content is deployed”. In other words, they have decided to no longer release configuration files wordwide all at once and instead do a form of staged deployment. They have also decided to allow the system administrators at companies to decide when and how the configuration files are deployed on their own computers. So it may now be possible for admins to test configuration files in advance of deploying them.