Monoculture of Insecurity: How CrowdStrike's Outage Exposes the Risks of Unchecked Complexity in Cybersecurity

by Miklos Tomka and Isabella Leandersson on Aug 1st, 2024

A seismic event in the IT world, everyone is talking about the CrowdStrike update that caused global chaos earlier this month. There are many great articles and blog posts dissecting the event and suggesting ways to avoid a repeat. Rather than join our voice to the chorus and explain how a small change could have avoided the entire palaver, we will approach the topic more broadly.

While it is helpful to understand what happened with CrowdStrike, the next major outage will likely arise from a different flaw altogether. As such, and since we expect that another major event is likely to come, it’s essential to consider the risk factors behind these major cyber events and ways to reduce those risks.

The cybersecurity sector is constantly facing new threats and challenges. How can we transform these obstacles into opportunities for growth and improvement, ensuring greater protection for those who rely on our services? This blog post explores the answers to this question, presenting some lesser-known solutions that deserve consideration, and providing a fresh perspective on staying ahead of emerging threats and building more resilient defences.

What we Know so far

Affecting at least 8.5 million Windows machines, the outage significantly affected the aviation, broadcasting, and healthcare industry. The BBC labelled it “probably the largest ever cyber-event” and “one of the worst cyber-incidents in history.” But what actually happened?

Arising from something as commonplace as a software update for the Falcon platform, CrowdStrike’s own preliminary Post Incident Review, indicates that a security content configuration update delivered an undetected error (the now infamous Channel File 291) to user machines. The error slipped through the validation checks due to a bug, and trust in the tests allowed a faulty file with an out-of-bounds memory error to reach production. At its root, the global outage appears to be caused by unsafe parsers (a classic error), resulting in a parsing bug.

What are the Best Practices for the Cybersecurity Sector?

Now that we’ve covered this background, it’s time to look at some of the underlying factors that played a role in the CrowdStrike outage and the ones yet to come. The global outage served as an excellent wake-up call to the entire industry about the necessity of maintaining the checks and balances that keep our global systems secure.

Supply Chain Delivery

Let’s start with the fundamentals: how did the flawed update get delivered to so many? When deploying updates at scale, industry best practices can help mitigate risks by employing methods like staggered deployments and rollback options. In a best-case scenario updates are staggered, or rolled out incrementally, starting with 1% and, if all goes well, 5%, and so on. One can also use automated systems that rollback updates when a fault is detected, undoing the damage and keeping user’s machines operational. By using either method, preferably both in combination, a flaw in an update has negligible impact, voiding the kind of international chaos we saw on Friday.

Another critical aspect of the supply chain is the internal testing performed before deployment to ensure quality and safety. The recent chaos would likely have been preventable if the internal testing processes had caught the flaw earlier.While cost-cutting measures may be tempting, they can ultimately lead to much greater costs in the event of a cybersecurity breach or production downtime.

OS Monocultures: Do you Really Need a Full Windows OS Stack to run an Airport Screen?

The IT industry is increasingly dominated by a few major operating systems, primarily Microsoft Windows. This creates the emergence of 'OS monocultures', with one dominant provider overshadowing a few large ones (Linux and Mac), leaving little space for diverse, smaller suppliers. While these monocultures offer benefits, they also pose significant risks, and any introduced vulnerabilities can cause widespread damage.

This is what happened with CrowdStrike. The cybersecurity firm has around 20% market share, and because everyone uses the same stack, a single bug can have an enormous impact. One way to reduce the risk of a shared stack is to generate a unique stack for each application; that way, bugs are contained in that stack. One way of achieving this is by using unikernels to build a small, highly specified stack with only what is required to run the application. MirageOS is a library operating system that constructs unikernels to create secure, high-performance network applications with small attack surfaces.You don’t actually need to install a generic operating system to manage a single-purpose appliance, such as an airport screen.

Formal Verification, Testing, and Organisational Change

The road to security and reliability is paved with organisational changes that improve the overall stability of systems and reduce risk, impacting the way we develop software from start to finish. Scott Hanselmann, the VP of Developer Community at Microsoft, highlights this dynamic in one of his posts on X:

“It’s always one line of code, but it’s NEVER one person... Engineering practices failed to find a bug multiple times, regardless of the seniority of the human who checked that code in. Solving the larger system thinking SDLC matters more than the null pointer check.”

But what does changing the software development life cycle on an organisational level look like? For one, it involves spending the time and cost of creating comprehensive tests that catch bugs easily missed by developers before they ever reach production.

Including formal verification, for example of the device driver and the code it executes, is another aspect of software development that can prevent faults reaching production. Using formal verification, developers can mathematically prove that a program behaves according to its formal model and correctly performs a defined property.

As it relates to the CrowdStrike incident, formal verification of parsers is challenging but not impossible. For example, the EverParse framework emits secure, formally verified code for parsers that can be used in programs, including OCaml programs. Creating a software development culture that includes formal verification, fuzz testing, and other tests decreases the risk of failures slipping through the net.

The Role of Type Safety

Finally, let’s look at perhaps a more obvious topic in the light of the fault in Channel File 291. Using type- and memory-safe languages, like OCaml and Rust, prevents out-of-bounds memory errors and a whole class of other bugs.

That’s not the key takeaway we can all learn from, however, which is about complexity. The assertion that “the central enemy of reliability is complexity ... complex systems tend to not be entirely understood by anyone” from a cybersecurity paper authored by several industry specialists holds true in this case. By eliminating a whole class of errors at compile time, a language like OCaml with a strong type system critically reduces the kind of complexity that leads to cyber-insecurity. A language like OCaml simplifies the developer workflow when it comes to catching bugs.

Most importantly, from an industry standpoint, and as mentioned above, the faulty file that caused the outage remained undetected , even in the face of extensive stress tests. Building critical systems with secure-by-design principles includes using building blocks that contribute to the robustness of the entire system by preventing faults and reducing complexity. One facet of this puzzle is to use languages immune to certain kinds of bugs, such as type- and memory-safe languages, but that is not enough. Varied and rigorous tests, like fuzz testing, are a necessary complement to any language. For example, the MirageOS network stack has had fuzz testing performed on it to prevent parser issues, providing another layer of safety to the already type-safe OCaml.

Join in the Conversation

In essence, the cause and lessons from the CrowStrike outage are more complex than they may seem at first glance. It’s easy to reduce it to a line of faulty code, but the real takeaway is that the entire industry needs to implement better practices to safeguard users from risk. This outage was not the result of a cyber attack or malware, but the next one could be, and we cannot let the fate of our global networks rest entirely on endpoint security measures like antivirus programs, firewall management, and VPNs. We need to build foundational secure systems from the ground up.

In this context, and in light of global calls for change in how cybersecurity is addressed, it is the right time to have these conversations and strengthen the sector from within. We believe the best approach is to adopt a secure-by-design strategy implemented in a type-safe and reliable language like OCaml.

There are many perspectives on this, however, and we want to hear yours. Connect with us on X (formerly Twitter) and LinkedIn and share your thoughts – we look forward to hearing from you!

Tarides champions open-source development. We create and maintain key features of the OCaml language in collaboration with the OCaml community. To learn more about how you can support our open-source work, discover our page on GitHub.

We are always happy to discuss commercial opportunities around OCaml. We provide core services, including training, tailor-made tools, and secure solutions. Contact us today to learn more about how Tarides can help your teams realise their vision.