What we can learn from the CrowdStrike incident
What we can learn from the CrowdStrike incident
On 19 July 2024, a major global IT disruption occurred, impacting numerous sectors including airports, hospitals and train stations. Systems running Windows displayed the following white letters on a blue background: "Recovery - it looks like Windows didn't load correctly". What was the cause?
Experts refer to this as the "Blue Screen of Death", indicating a complete system crash. The affected Windows 10 clients had CrowdStrike Falcon installed.
The issue originated in Australia and rapidly spread across Asia, Europe, and America, with the travel industry being notably impacted. Media reports have highlighted numerous companies experiencing IT device failures, with some devices crashing and becoming unresponsive. This disruption has now reached global proportions. The root cause was identified as an update to CrowdStrike Falcon, a software designed to detect and block system attacks. Paradoxically, this security tool, intended to enhance cyber stability, has led to significant operational delays.
Understanding the cause
Endpoint Detection and Response (EDR) programs, such as CrowdStrike Falcon, are designed to monitor, detect and respond to threats on devices such as computers and mobile devices. The Falcon system operates its modules in kernel mode. Devices generally have two modes for running programs: user mode and kernel mode.- User Mode: A less privileged mode where application software runs. Applications in this mode don't have direct access to hardware or system memory but interact through the operating system's Application Programming Interface (API). This design protects the system from faulty or malicious applications by limiting functionality and controlling requests.
- Kernel Mode: A privileged mode where core components of the operating system run. Software in this mode has unrestricted access to all system resources, including hardware and memory.
On the morning of July 19, 2024, CrowdStrike proceeded with the deployment of an update to all active systems. This update involved a "channel file" that manages the software's behaviour protection mechanisms, similar to signature updates in signature-based antivirus software. These files are distributed regularly, often multiple times a day, to ensure the EDR system remains up to date against the latest cyber threats. Unfortunately, exactly this channel file, number 291, contained a logical programming error that caused the driver to crash during local deployment, leading to widespread system failures.
Recommendations for affected users
A number of mitigation measures have been published by official and unofficial sources, ranging from deleting a specific ".sys" file to updating the software. It is recommended that users only consider mitigation recommendations and information about the incident from validated websites. It has recently come to our attention that a number of fraudulent websites are in circulation, which are redirecting users to suspicious sites with the aim of conducting phishing, spreading malware or collecting data. It is imperative to note that under no circumstances should software or "updates" be downloaded from unverified websites. Instead, we recommend following guidance directly from CrowdStrike or the Federal Office for Information Security (BSI).What can we learn from this?
While the impact was confined to Windows systems, there is a possibility that other operating systems could also be affected by similar errors. According to Microsoft, approximately 8.5 million systems were impacted by this issue. CrowdStrike Falcon is primarily used by commercial customers. The disruption was particularly severe in critical infrastructure sectors.This global outage of critical IT systems clearly demonstrated the problems that can arise when EDR systems are implemented without a comprehensive recovery strategy. In the field of IT security, the protection goals of Confidentiality, Integrity, and Availability (also known as the "CIA triad") are of the utmost importance. These principles are integral to every aspect of a company's cyber strategy. However, EDR systems tend to prioritise the confidentiality and integrity of information over its availability. This is evident from the fact that, in an emergency, essential system components may be deleted or quarantined to preserve data integrity, rendering them unusable.
How can availability still be ensured?
Efficient threat hunting must respond quickly to emerging cyber threats by promptly providing Indicators of Compromise (IoCs), particularly to counter zero-day attacks. In emergency situations, it may be necessary to halt ongoing processes or block network connections. The necessity to operate EDR software components in kernel mode is justified by the advantages offered in terms of insights, control and responsiveness. However, it is important to note that this approach also carries inherent risks. Materialised in this case, for example, errors in kernel mode can lead to system crashes or severe security vulnerabilities. To mitigate this risk, software users can establish several approaches.Robust error and exception handling mechanisms in kernel-mode code are essential for ensuring system stability in the event of errors. It is also advisable to develop failsafe mechanisms that deactivate kernel-mode code and revert to secure default functions in the event of critical errors. It is also vital to ensure that kernel-mode code operates with only the minimal necessary privileges and affects as few system resources as possible. A modular architecture, in which kernel-mode components are isolated from one another, can limit the impact of potential errors or attacks. Additionally, limiting response measures to specific paths, settings, or networks can mitigate the effects of EDR actions.
If all measures fail, a failure strategy is essential. It is crucial to regularly create system snapshots and backups to enable a swift return to a previous stable state in the event of a kernel-mode failure. Furthermore, it is imperative to implement and assess recovery procedures that facilitate the swift restoration of systems in the event of a kernel-mode code failure.