This is an unpopular opinion, and I get why – people crave a scapegoat. CrowdStrike undeniably pushed a faulty update demanding a low-level fix (booting into recovery). However, this incident lays bare the fragility of corporate IT, particularly for companies entrusted with vast amounts of sensitive personal information.

Robust disaster recovery plans, including automated processes to remotely reboot and remediate thousands of machines, aren’t revolutionary. They’re basic hygiene, especially when considering the potential consequences of a breach. Yet, this incident highlights a systemic failure across many organizations. While CrowdStrike erred, the real culprit is a culture of shortcuts and misplaced priorities within corporate IT.

Too often, companies throw millions at vendor contracts, lured by flashy promises and neglecting the due diligence necessary to ensure those solutions truly fit their needs. This is exacerbated by a corporate culture where CEOs, vice presidents, and managers are often more easily swayed by vendor kickbacks, gifts, and lavish trips than by investing in innovative ideas with measurable outcomes.

This misguided approach not only results in bloated IT budgets but also leaves companies vulnerable to precisely the kind of disruptions caused by the CrowdStrike incident. When decision-makers prioritize personal gain over the long-term health and security of their IT infrastructure, it’s ultimately the customers and their data that suffer.

You are viewing a single thread.
View all comments
171 points

Please, enlighten me how you’d remotely service a few thousand Bitlocker-locked machines, that won’t boot far enough to get an internet connection, with non-tech-savvy users behind them. Pray tell what common “basic hygiene” practices would’ve helped, especially with Crowdstrike reportedly ignoring and bypassing the rollout policies set by their customers.

Not saying the rest of your post is wrong, but this stood out as easily glossed over.

permalink
report
reply
22 points
*

A decade ago I worked for a regional chain of gyms with locations in 4 states.

I was in TN. When a system would go down in SC or NC, we originally had three options:

  1. (The most common) have them put it in a box and ship it to me.
  2. I go there and fix it (rare)
  3. I walk them through fixing it over the phone (fuck my life)

I got sick of this. So I researched options and found an open source software solution called FOG. I ran a server in our office and had little optiplex 160s running a software client that I shipped to each club. Then each machine at each club was configured to PXE boot from the fog client.

The server contained images of every machine we commonly used. I could tell FOG which locations used which models, and it would keep the images cached on the client machines.

If everything was okay, it would chain the boot to the os on the machine. But I could flag a machine for reimage and at next boot, the machine would check in with the local FOG client via PXE and get a complete reimage from premade images on the fog server.

The corporate office was physically connected to one of the clubs, so I trialed the software at our adjacent club, and when it worked great, I rolled it out company wide. It was a massive success.

So yes, I could completely reimage a computer from hundreds of miles away by clicking a few checkboxes on my computer. Since it ran in PXE, the condition of the os didn’t matter at all. It never loaded the os when it was flagged for reimage. It would even join the computer to the domain and set up that locations printers and everything. All I had to tell the low-tech gymbro sales guy on the phone to do was reboot it.

This was free software. It saved us thousands in shipping fees alone. And brought our time to fix down from days to minutes.

There ARE options out there.

permalink
report
parent
reply
0 points

How removed from IT are that you think fog would have helped here?

permalink
report
parent
reply
5 points
*

How would it not have? You got an office or field offices?

“Bring your computer by and plug it in over there.” And flag it for reimage. Yeah. It’s gonna be slow, since you have 200 of the damn things running at once, but you really want to go and manually touch every computer in your org?

The damn thing’s even boot looping, so you don’t even have to reboot it.

I’m sure the user saved all their data in one drive like they were supposed to, right?

I get it, it’s not a 100% fix rate. And it’s a bit of a callous answer to their data. And I don’t even know if the project is still being maintained.

But the post I replied to was lamenting the lack of an option to remotely fix unbootable machines. This was an option to remotely fix nonbootable machines. No need to be a jerk about it.

But to actually answer your question and be transparent, I’ve been doing Linux devops for 10 years now. I haven’t touched a windows server since the days of the gymbros. I DID say it’s been a decade.

permalink
report
parent
reply
-1 points

Thank you for sharing this. This is what I’m talking about. Larger companies not utilizing something like this already are dysfunctional. There are no excuses for why it would take them days, weeks or longer.

permalink
report
parent
reply
4 points
*

Now your fog servers are dead. What now

permalink
report
parent
reply
5 points

This is a good solution for these types of scenarios. Doesn’t fit all though. Where I work, 85% of staff work from home. We largely use SaaS. I’m struggling to think of a good method here other than walking them through reinstalling windows on all their machines.

permalink
report
parent
reply
-2 points
*
  1. Configure PXE to reboot into recovery image, push out command to remove bad file. Reboot. Done. Workstation laptops usually have remote management already.

or

  1. Have recovery image already installed. Have user reboot & push key to boot into recovery. Push out fix. Done.
permalink
report
parent
reply
2 points

That’s still 15% less work though. If I had to manually fix 1000 computers, clicking a few buttons to automatically fix 150 of them sounds like a sweet-ass deal to me even if it’s not universal.

You could also always commandeer a conference room or three and throw a switch on the table. “Bring in your laptop and go to conference room 3. Plug in using any available cable on the table and reboot your computer. Should be ready in an hour or so. There’s donuts and coffee in conference room 4.” Could knock out another few dozen.

Won’t help for people across the country, but if they’re nearish, it’s not too bad.

permalink
report
parent
reply
26 points
*

This works great for stationary pcs and local servers, does nothing for public internet connected laptops in hands of users.

The only fix here is staggered and tested updates, and apparently this update bypassed even deffered update settings that crowdstrike themselves put into their software.

The only winning move here was to not use crowdstrike.

permalink
report
parent
reply
6 points

Absolutely. 100%

But don’t let perfect be the enemy of good. A fix that gets you 40% of the way there is still 40% less work you have to do by hand. Not everything has to be a fix for all situations. There’s no such thing as a panacea.

permalink
report
parent
reply
7 points

It also assumes that reimaging is always an option.

Yes, every company should have networked storage enforced specifically for issues like this, so no user data would be lost, but there’s often a gap between should and “has been able to find the time and get the required business side buy in to make it happen”.

Also, users constantly find new ways to do non-standard, non-supported things with business critical data.

permalink
report
parent
reply
-5 points

Almost all computers can be set to PXE boot, but work laptops usually even have more advanced remote management capabilities. You ask the employee to reboot the laptop and presto!

permalink
report
parent
reply
-1 points
*

what common “basic hygiene” practices would’ve helped

Not using a proprietary, unvetted, auto-updating, 3rd party kernel module in essential systems would be a good start.

Back in the day companies used to insist upon access to the source code for such things along with regular 3rd party code audits but these days companies are cheap and lazy and don’t care as much. They’d rather just invest in “security incident insurance” and hope for the best 🤷

Sometimes they don’t even go that far and instead just insist upon useless indemnification clauses in software licenses. …and yes, they’re useless:

https://www.nolo.com/legal-encyclopedia/indemnification-provisions-contracts.html#:~:text=Courts have commonly held that,knowledge of the relevant circumstances).

(Important part indicating why they’re useless should be highlighted)

permalink
report
parent
reply
0 points
*

It’s called EFI. How do you think your BIOS update from inside BIOS is working? ;)

EDIT: oh, and PXE boot + wol.

permalink
report
parent
reply
-1 points

I’d issue IPMI or remote management commands to reboot the machines. Then I’d boot into either a Linux recovery environment (yes, Linux can unlock BitLocker-encrypted drives) or a WinPE (or Windows RE) and unlock the drives, preferably already loaded on the drives, but could have them PXE boot - just giving ideas here, but ideal DR scenario would have an environment ready to load & PXE would cause delays.

I’d either push a command or script that would then remove the update file that caused the issue & then reboots. Having planned for a scenario like this already, total time to fix would be less than 2 hours.

permalink
report
parent
reply
3 points

At my company I use a virtual desktop and it was restored from a nightly snapshot a few hours before I logged in that day (and presumably, they also applied a post-restore temp fix). This action was performed on all the virtual desktops at the entire company and took approximately 30 minutes (though, probably like 4 hours to get the approval to run that command, LOL).

It all took place before I even logged in that day. I was actually kind of impressed… We don’t usually act that fast.

permalink
report
parent
reply
1 point

Somebody give those workers that had their shit together a raise, for real.

permalink
report
parent
reply
4 points

Autopilot, intune. Force restart device twice to get startup repair, choose factory reset, share LAPS admin password and let the workstation rebuild itself.

permalink
report
parent
reply
15 points
*

Separate persistent data and operating system partitions, ensure that every local network has small pxe servers, vpned (wireguard, etc) to a cdn with your base OS deployment images, that validate images based on CA and checksum before delivering, and give every user the ability to pxe boot and redeploy the non-data partition.

Bitlocker keys for the OS partition are irrelevant because nothing of value is stored on the OS partition, and keys for the data partition can be stored and passed via AD after the redeploy. If someone somehow deploys an image that isn’t ours, it won’t have keys to the data partition because it won’t have a trust relationship with AD.

(This is actually what I do at work)

permalink
report
parent
reply
3 points

But your pxe boot server is down, your radius server providing vpn auth is down, your bitlocker keys are in AD which is down because all your domain controllers are down.

permalink
report
parent
reply
3 points

Yes and no. In the best case, endpoints have enough cached data to get us through that process. In the worst case, that’s still a considerably smaller footprint to fix by hand before the rest of the infrastructure can fix itself.

permalink
report
parent
reply
7 points

Sounds good, but can you trust an OS partition not to store things in %programdata% etc that should be encrypted?

permalink
report
parent
reply
2 points

With enough autism in your overlay configs, sure, but in my environment tat leakage is still encrypted. It’s far simpler to just accept leakage and encrypt the OS partition with a key that’s never stored anywhere. If it gets lost, you rebuild the system from pxe. (Which is fine, because it only takes about 20 minutes and no data we care about exists there) If it’s working correctly, the OS partition is still encrypted and protects any inadvertent data leakage from offline attacks.

permalink
report
parent
reply
5 points
*

Separate persistent data and operating system partitions, ensure that every local network has small pxe servers, vpned (wireguard, etc) to a cdn with your base OS deployment images, that validate images based on CA and checksum before delivering, and give every user the ability to pxe boot and redeploy the non-data partition.

At that point why not just redirect the data partition to a network share with local caching? Seems like it would simplify this setup greatly (plus makes enabling shadow copy for all users stupid easy)

Edit to add: I worked at a bank that did this for all of our users and it was extremely convenient for termed employees since we could simply give access to the termed employee’s share to their manager and toss a them a shortcut to access said employee’s files, so if it turned out Janet had some business critical spreadsheet it was easily accessible even after she was termed

permalink
report
parent
reply
3 points

We do this in a lot of areas with fslogix where there is heavy persistent data, it just never felt necessary to do that for endpoints where the persistent data partition is not much more than user settings and caches of convenience. Anything that is important is never stored solely on the endpoints, but it is nice to be able to reboot those servers without affecting downstream endpoints. If we had everything locally dependant on fslogix, I’d have to schedule building-wide outages for patching.

permalink
report
parent
reply
2 points

I’ve been separating OS and data partitions since I was a kid running Windows 95. It’s horrifying that people don’t expect and prepare for machines to become unbootable on a regular basis.

Hell, I bricked my work PC twice this year just by using the Windows cleanup tool - on Windows 11. The antivirus went nuclear, as antivirus products do.

permalink
report
parent
reply
36 points

You’d have to have something even lower level like a OOB KVM on every workstation which would be stupid expensive for the ROI, or something at the UEFI layer that could potentially introduce more security holes.

permalink
report
parent
reply
8 points

Maybe they should offer a real time patcher for the security vulnerabilities in the OOB KVM, I know a great vulnerability database offered by a company that does this for a lot of systems world wide! /s

permalink
report
parent
reply
0 points

Lol 😋 ! also i need a “Out-of-Band, Keyboard, Video, and Mouse” to your “OOB, KVM” so to steal the bank improve security.

permalink
report
parent
reply
1 point

Vpro is usually $20 per machine and offers oob kvm.

permalink
report
parent
reply
-4 points

UEFI isn’t going away. Sorry to break the news to you.

permalink
report
parent
reply
9 points

I didn’t say it was, nor did I say UEFI was the problem. My point was additional applications or extensions at the UEFI layer increase the attack footprint of a system. Just like vPro, you’re giving hackers a method that can compromise a system below the OS. And add that in to laptops and computers that get plugged in random places before VPNs and other security software is loaded and you have a nice recipe for hidden spyware and such.

permalink
report
parent
reply
5 points
*

…you don’t have OOBM on every single networked device and terminal? Have you never heard of the buddy system?

You should probably start writing up an RFP. I’d suggest you also consider doubling up on the company issued phones per user.

If they already have an ATT phone, get them a Verizon one as well, or vice versa.

At my company we’re already way past that. We’re actually starting to import workers to provide human OOBM.

You don’t answer my call? I’ll just text the migrant worker we chained to your leg to flick your ear until you pick up.

Maybe that sounds extreme, but guess who’s company wasn’t impacted by the Crowdstrike outage.

permalink
report
parent
reply
1 point

I mean, with the exception of the shackles, this is just logistics 101. The more something needs to stay working or not accidentally trigger a huge problem, the more resources you dedicate to picking up where the regular guy left off because the “fleffingbridge transport 1” company’s bus broke down in front of the regular guy and his bus got hit by a train. Solution? New bus, plant some trees. Prevention? Bridges and tunnels aren’t cheap, but clearly we need one there now. We can’t predict the future but we have to do our best to try or - simulated or real - the cost will be paid in blood. Obviously there’s moral limits, but hiring more staff is not in and of itself immoral nor the wrong approach.

If I was in charge of a real life logistics operation, I’d be devastated if anyone died because of me. I can’t say, however, that it can be avoided. Sometimes people die at random, that’s not yet 100% avoidable and might never be, but I do care. I’d hope people who actually end up in logistics could learn to indulge their empathy enough to remember there are lives on the line, but I can’t blame someone for being bitter that the actual work output is purely being fleeced for profit.

permalink
report
parent
reply
-1 points

Dual partitioning as Android does it might have helped. Install the update to partition B, reboot and if it’s alright swap A and B partitions to make B the default. Boot again to the default partition (A, formerly B).

It wouldn’t have booted correctly afaiu with the faulty update, and would have been reverted to use the untouched A partition.

permalink
report
parent
reply
2 points

Please, enlighten me how you’d remotely service a few thousand Bitlocker-locked machines, that won’t boot far enough to get an internet connection,

Intel AMT.

permalink
report
parent
reply
2 points

Does Windows have a solid native way to remotely re-image a system like macOS does?

permalink
report
parent
reply
0 points

Yes.

permalink
report
parent
reply
-3 points

No.

Maybe with Intune and Autopilot, but I haven’t used it.

permalink
report
parent
reply
0 points

Windows ADK does this too, or any PXE server really… so yes, you can. The CS issue though didn’t require re-image. Merely removing a file. DR planning would usually have a recovery image pre-installed to automate booting into for lower-level fixes.

permalink
report
parent
reply
3 points

If you don’t know, don’t answer

permalink
report
parent
reply
2 points

Yes but it is licensed based and focused on business customers.

permalink
report
parent
reply
6 points
*

You are talking about how to fix the problem.

This person is talking about what caused the problem.

Completely different things.

  1. Bad thing happened, how do we fix bad thing and its effects.

Analogous to: A house is on fire; call the ambulances to treat any wounded call the fire department, call insurance, figure out temporary housing.

This is basically immediate remedy or mitigation.

  1. Bad thing happened, but why did the bad thing happen and how to we prevent future occurrences of this?

Analogous to: Investigate the causes of the fire, suggest various safety regulations on natural gas infrastructure, home appliances, electrical wiring, building material and methods, etc.

This is much more complex and involves systemic change.

permalink
report
parent
reply
9 points

Rollout policies are the answer, and CrowdStrike should be made an example of if they were truly overriding policies set by the customer.

It seems more likely to me that nobody was expecting “fingerprint update” to have the potential to completely brick a device, and so none of the affected IT departments were setting staged rollout policies in the first place. Or if they were, they weren’t adequately testing.

Then - after the fact - it’s easy to claim that rollout policies were ignored when there’s no way to prove it.

If there’s some evidence that CS was indeed bypassing policies to force their updates I’ll eat the egg on my face.

permalink
report
parent
reply
7 points
*

from what ive read/watched thats the crux of the issue… did they push a ‘content’ update, i.e. signatures or did they push a code update.

so you basically had a bunch of companies who absolutely do test all vendor code updates beings slipped a code update they werent aware of being labeled a ‘content’ update.

permalink
report
parent
reply
1 point

I’m one of the admins who manage CrowdStrike at my company.

We have all automatic updates disabled, because when they were enabled (according to the CrowdStrike best practices guide they gave us), they pushed out a version with a bug that overwhelmed our domain servers. Now we test everything through multiple environments before things make it to production, with at least two weeks of testing before we move a version to the next environment.

This was a channel file update, and per our TAM and account managers in our meeting after this happened, there’s no way to stop that file from being pushed, or to delay it. Supposedly they’ll be adding that functionality in now.

permalink
report
parent
reply
28 points
*

Was a windows sysadmin for a decade. We had thousands of machines with endpoint management with bitlocker encryption. (I have sincd moved on to more of into cloud kubertlnetes devops) Anything on a remote endpoint doesn’t have any basic “hygiene” solution that could remotely fix this mess automatically. I guess Intels bios remote connection (forget the name) could in theory allow at least some poor tech to remote in given there is internet connection and the company paid the xhorbant price.

All that to say, anything with end-user machines that don’t allow it to boot is a nightmare. And since bit locker it’s even more complicated. (Hope your bitloxker key synced… Lol).

permalink
report
parent
reply
3 points

Bro. PXE boot image servers. You can remotely image machines from hundreds of miles away with a few clicks and all it takes on the other end is a reboot.

permalink
report
parent
reply
4 points

With a few clicks and being connected to the company network. Leaving anyone not able to reach an office location SOL.

permalink
report
parent
reply
20 points

You’re thinking of Intel vPro. I imagine some of the Crowdstrike victims customers have this and a bunch of poor level 1 techs are slowly griding their way through every workstation on their networks. But yeah, OP is deluded and/or very inexperienced if they think this could have been mitigated on workstations through some magical “hygiene”.

permalink
report
parent
reply
0 points
Deleted by creator
permalink
report
parent
reply

Technology

!technology@lemmy.world

Create post

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


Community stats

  • 18K

    Monthly active users

  • 5.1K

    Posts

  • 91K

    Comments