Blog
How to Repair Azure VMs After CrowdStrike Outage
- July 23, 2024
- Posted by: Trionx AI
- Category: Technology
Personal Experience with the CrowdStrike Outage on Azure Windows VMs
On July 18, 2024, our team faced a significant crisis when a routine update from CrowdStrike caused a global IT outage. The update included a problematic file, “C-00000291*.sys,” which triggered a Blue Screen of Death (BSOD) on Windows systems, causing them to enter an indefinite boot loop. This post shares our experience navigating the turmoil, the solutions we explored, and the ultimate resolution for our Azure Windows VMs.
Problem Description
The issue began at around 19:00 UTC on July 18, 2024. The faulty update led to widespread disruptions due to the faulty file, causing Windows Client and Server VMs running the CrowdStrike Falcon agent to crash. The affected systems became stuck in a restart loop, rendering them unusable. The outage affected numerous sectors globally:
- Airlines: United, Delta, and American Airlines issued global ground stops, leading to numerous flight cancellations and delays.
- Financial Institutions: Banks in Australia, India, and other parts of Asia faced significant operational disruptions.
- Media Services: TV networks, including Sky and the BBC’s CBBC channel, experienced broadcasting issues.
Our Journey Through the Crisis
Our team spent countless hours from the onset of the issue until the following Monday trying to remediate the problem. We attempted various solutions, some of which were more time-consuming than others. Here are the steps we followed:
Step-by-Step Resolution for Azure Windows VMs
- Creating a Rescue VM:
- We began by creating a rescue VM to troubleshoot the affected systems. This VM was created with the same specifications as the original VM and located in the same region. The command used was:
az vm repair create -g <YourResourceGroupName> -n <YourVMName> --verbose
- This step involved making a copy of the OS disk of the problematic VM and attaching it as a data disk to the rescue VM.
- Running Mitigation Script:
- Next, we ran the mitigation script on the rescue VM to address the issue on the attached OS disk:
az vm repair run -g <YourResourceGroupName> -n <YourVMName> --run-id win-crowdstrike-fix-bootloop --run-on-repair --verbose
- Restoring the Fixed OS Disk:
- After fixing the OS disk, we restored it to the original VM. The fixed OS disk was detached from the rescue VM and reattached to the original VM, which was stopped but not deallocated during this process:
az vm repair restore -g <YourResourceGroupName> -n <YourVMName> --verbose
Alternative Recovery Methods
Depending on the VM configurations, different recovery methods were employed:
- From Backup:
- For VMs with a recent backup, restoring from backup was the quickest method. This involved rolling back to the latest stable backup to bring the systems back online.
- Manual Deletion of CrowdStrike Files:
- In some cases, we manually deleted the problematic CrowdStrike files as per Microsoft and CrowdStrike recommendations. This involved:
- Booting into Safe Mode or the Windows Recovery Environment (WRE).
- Navigating to the directory
%WINDIR%\System32\drivers\CrowdStrike
. - Locating and deleting the file matching “C-00000291*.sys”.
- Rebooting the system normally.
- In some cases, we manually deleted the problematic CrowdStrike files as per Microsoft and CrowdStrike recommendations. This involved:
- Automation Script:
- To speed up the process and perform tasks in parallel, we developed an automation script. This script automated the process of creating rescue VMs, running mitigation scripts, and restoring fixed OS disks across multiple affected VMs.
- Using a Recovery USB Tool:
- For some VMs, using Microsoft’s recovery USB tool proved effective. This method involved creating a recovery USB, booting the affected VM from the USB, and following on-screen instructions to repair the system.
Final Thoughts
Navigating the CrowdStrike outage was a challenging experience, filled with long hours and numerous attempts to find a resolution. The various recovery methods highlighted above were instrumental in restoring normalcy to our Azure Windows VMs. Each method had its own set of challenges, but through perseverance and collaboration, we managed to bring our systems back online.
Conclusion
The CrowdStrike outage of July 2024 serves as a stark reminder of the complexities of IT infrastructure and the importance of robust incident response strategies. While the incident caused significant disruptions, it also showcased the resilience and adaptability of IT teams globally. By sharing our experience, we hope to provide valuable insights and practical solutions for those who may face similar challenges in the future.
For more detailed information, you can refer to sources such as Microsoft Tech Community, Tom’s Hardware, Shacknews, and Techepages.
For an in-depth guide on using a PowerShell script to fix Azure VM boot loop issues, check out our related article on TrioNxAI.
Leave a Reply Cancel reply
You must be logged in to post a comment.