• Delta Air Lines CEO Ed Bastian said the massive IT outage earlier this month that stranded thousands of customers will cost it $500 million.
  • The airline canceled more than 4,000 flights in the wake of the outage, which was caused by a botched CrowdStrike software update and took thousands of Microsoft systems around the world offline.
  • Bastian, speaking from Paris, told CNBC’s “Squawk Box” on Wednesday that the carrier would seek damages from the disruptions, adding, “We have no choice.”
22 points

Yeah… Maybe don’t put all your IT eggs in one basket next time.

Delta is the one that chose to use Crowdstrike on so many critical systems therefore the fault still lies with Delta.

Every big company thinks that when they outsource a solution or buy software they’re getting out of some responsibility. They’re not. When that 3rd party causes a critical failure the proverbial finger still points at the company that chose to use the 3rd party.

The shareholders of Delta should hold this guy responsible for this failure. They shouldn’t let him get away with blaming Crowdstrike.

permalink
report
reply
17 points

So you think Delta should’ve had a different antivirus/EDR running on every computer?

permalink
report
parent
reply
9 points

I think what @riskable@programming.dev was saying is you shouldn’t have multiple mission critical systems all using the same 3rd party services. Have a mix of at least two, so if one 3rd party service goes down not everything goes down with it

permalink
report
parent
reply
12 points

That sounds easy to say, but in execution it would be massively complicated. Modern enterprises are littered with 3rd party services all over the place. The alternative is writing and maintaining your own solution in house, which is an incredibly heavy lift to cover the entirety of all services needed in the enterprise. Most large enterprises are resources starved as is, and this suggestion of having redundancy for any 3rd party service that touches mission critical workloads would probably increase burden and costs by at least 50%. I don’t see that happening in commercial companies.

permalink
report
parent
reply
6 points

In this case, it’s a local third party tool and they thought they could control to cadence of updates. There was no reason to think there was anything particularly unstable about the situation.

This is closer to saying that half of your servers should be Linux and half should be windows in case one has a bug.

Crowdstrike bypassed user controls on updates.
The normal responsible course of action is to deploy an update to a small test environment, test to make sure it doesn’t break anything, and then slowly deploy it to more places while watching for unexpected errors.
Crowdstrike shotgunned it to every system at once without monitoring, with grossly inadequate testing, and entirely bypassed any user configurable setting to avoid or opt out of the update.

I was much more willing to put the blame on the organizers that had the outages for failing to follow best practices before I learned that they way the update was pushed would have entirely bypassed any of those safeguards.

It’s unreasonable to say that an organization needs to run multiple copies of every service with different fundamental infrastructure choices for each in case one magics itself broken.

permalink
report
parent
reply
2 points

Adding another reply since I went on a bit of a rant in my other one… You’re actually missing the point I was trying to make: No matter what solution you choose it’s still your fault for choosing it. There are a zillion mitigations and “back up plans” that can be used when you feel like you have no choice but to use a dangerous 3rd party tool (e.g. one that installs kernel modules). Delta obviously didn’t do any of that due diligence.

permalink
report
parent
reply
3 points

Kernel module is basically the only way to implement this type of security software. That’s the only thing that has system wide access to realtime filesystem and network events.

Yes, they’re ultimately liable to their customers because that’s how liability works, but it’s really hard to argue that they’re at fault for picking a standard piece of software from a leading vendor that functions roughly the same as every piece of software in this space for every platform functions, which then bypassed all configurations they could make to control updates, grabbed a corrupted update and crashed the computer.
It’s like saying it’s the drivers fault the brakes on their Toyota failed and they crashed into someone. Yes, they crashed and so their insurance is going to have to cover it, but you don’t get angry at the driver for purchasing a common car in good condition and having it break in a way they can’t control.

What mitigations should they have had? All computer systems are mostly third party tools. Your OS is a third party tool. Your programming language is a third party tool. Webserver, database, loadbalancer, caching server: all third party tools. Hardware drivers? Usually third party, but USB has made a lot of things more generic.

If your package manager decides to ignore your configuration and update your kernel to something mangled and reboot, your computer is going to crash and it’ll stay down until you can get in there to tell it to stop booting the mangled kernel.

permalink
report
parent
reply
2 points

Sounds like they executed their plans just fine.

And due diligence is “the investigation or exercise of care that a reasonable business or person is normally expected to take before entering into an agreement or contract with another party or an act with a certain standard of care”. Having BC/DR plans isn’t part of due diligence.

permalink
report
parent
reply
2 points

If I were in charge I wouldn’t put anything critical on Windows. Not only because it’s total garbage from a security standpoint but it’s also garbage from a stability standpoint. It’s always had these sorts of problems and it always will because Microsoft absolutely refuses to break backwards compatibility and that’s precisely what they’d have to do in order to move forward into the realm of, “modern OS”. Things like NTFS and the way file locking works would need to go. Everything being executable by default would need to end and so, so much more low-level stuff that would break like everything.

Aside about stability: You just cannot keep Windows up and running for long before you have to reboot due to the way file locking works (nearly all updates can’t apply until the process owning them “lets go”, as it were and that process usually involves kernel stuff… due to security hacks they’ve added on since WinNT 3.5 LOL). You can’t make it immutable. You can’t lock it down in any effective way without disabling your ability to monitor it properly (e.g. with EDR tools). It just wasn’t made for that… It’s a desktop operating system. Meant for ONE user using it at a time (and one main application/service, really). Trying to turn it into a server that runs many processes simultaneously under different security contexts is just not what it was meant to do. The only reason why that kinda sort of works is because of hacks upon hacks upon hacks and very careful engineering around a seemingly endless array of stupid limitations that are a core part of the OS.

permalink
report
parent
reply
6 points

Please go read up on how this error happened.

This is not a backwards compatibility thing, or on Microsoft at all, despite the flaws you accurately point out. For that matter the entire architecture of modern PCs is a weird hodgepodge of new systems tacked onto older ones.

  1. Crowdstrike’s signed driver was set to load at boot, edit: by Crowdstrike.
  2. Crowdstrike’s signed driver was running unsigned code at the kernel level and it crashed. It crashed because the code was trying to read a pointer from the corrupt file data, and it had no protection at all against a bad file.

Just to reiterate: It loaded up a file and read from it at the kernel level without any checks that the file was valid.

  1. As it should, windows treats any crash at the kernel level as a critical issue. and bluescreens the system to protect it.

The entire fix is to boot into safe mode and delete the corrupt update file crowdstrike sent.

permalink
report
parent
reply
3 points
*

I enjoy hating on Windows as much as the next guy who installed Linux on their laptop once, but the bottom line is 90 percent of businesses use it because it does work.

Blaming the people who made the decision to purchase arguably the most popular EDR solution on the planet and use it (those bastards!) does nothing but show a lack of understanding how any business related IT decisions work.

permalink
report
parent
reply
1 point

Alternatively, they could have taken Crowdstrike’s offer of layered rollouts, but Delta declined this and wanted all updates immediately to all devices.

permalink
report
parent
reply
-5 points
CNBC Media Bias Fact Check Credibility: [High] (Click to view Full Report)

CNBC is rated with High Creditability by Media Bias Fact Check.

Bias: Left-Center
Factual Reporting: Mostly Factual
Country: United States of America
Full Report: https://mediabiasfactcheck.com/cnbc/

Check the bias and credibility of this article on Ground.News


Thanks to Media Bias Fact Check for their access to the API.
Please consider supporting them by donating.

Footer

Media Bias Fact Check is a fact-checking website that rates the bias and credibility of news sources. They are known for their comprehensive and detailed reports.

Beep boop. This action was performed automatically. If you dont like me then please block me.💔
If you have any questions or comments about me, you can make a post to LW Support lemmy community.

permalink
report
reply
3 points

Crowdstrike offers layered rollouts, but some executive declined this because they want the most up to date software at all times.

permalink
report
reply
3 points
*

Not for the rapid update that broke everything.

See post incident report:

How Do We Prevent This From Happening Again?

Software Resiliency and Testing

  • Improve Rapid Response Content testing by using testing types such as:

  • Local developer testing

  • Content update and rollback testing

  • Stress testing, fuzzing and fault injection

  • Stability testing

  • Content interface testing

  • Add additional validation checks to the Content Validator for Rapid Response Content.

  • A new check is in process to guard against this type of problematic content from being deployed in the future.

  • Enhance existing error handling in the Content Interpreter.

Rapid Response Content Deployment

  • Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.

  • Improve monitoring for both sensor and system performance, collecting feedback during Rapid Response Content deployment to guide a phased rollout.

  • Provide customers with greater control over the delivery of Rapid Response Content updates by allowing granular selection of when and where these updates are deployed.

  • Provide content update details via release notes, which customers can subscribe to.

Source: https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/

permalink
report
parent
reply
62 points

Pretty sure their software’s legal agreement, and the corresponding enterprise legal agreement, already cover this.

The update was the first domino, but the real issue was the disarray of Delta’s IT Operations and their inability to adequately recover in a timely fashion. Sounds like a customer skimping on their lifecycle and capacity planning so that Ed can get just a bit bigger bonus for meeting his budget numbers.

permalink
report
reply
21 points

Couldn’t agree more.

And now that this occurred, and cost $500m, perhaps finally some enterprise companies may actually resource IT departments better and allow them to do their work. But who am I kidding, that’s never going to happen if it hits bonuses and dividends :(

permalink
report
parent
reply
2 points

Fucking lol.

permalink
report
parent
reply
10 points

We just lost 500 million - we can’t afford that right now! /s

permalink
report
parent
reply
4 points

According to The headhunters are constantly trying to recruit me for inappropriate jobs it is starting to get traction with companies and they are starting to actually hire fully skilled it departments. Opposed to the ones merely willing to work for near minimum wage which is what they had before.

In some ways it won’t really make a difference because fully staffed up I.T departments also needs to be listened to by management, and that doesn’t happen often in corporate environments, but still they’ll pay the big bucks so that’s good enough for me.

permalink
report
parent
reply
1 point

According to The headhunters are constantly trying to recruit me for inappropriate jobs it is starting to get traction with companies and they are starting to actually hire fully skilled it departments. Opposed to the ones merely willing to work for near minimum wage which is what they had before.

In some ways it won’t really make a difference because fully staffed up I.T departments also needs to be listened to by management, and that doesn’t happen often in corporate environments, but still they’ll pay the big bucks so that’s good enough for me.

permalink
report
parent
reply
7 points

I wasn’t affected by this at all and only followed it on the news and through memes, but I thought this was something that needed hands-on-keyboard to fix, which I could see not being the fault of IT because they stopped planning for issues that couldn’t be handled remotely.

Was there some kind of automated way to fix all the machines remotely? Is there a way Delta could have gotten things working faster? I’m genuinely curious because this is one of those Windows things that I’m too Macintosh to understand.

permalink
report
parent
reply
17 points

All the servers and infrastructure should have “lights out management”. I can turn on a server, reconfigure the bios and install windows from scratch on the other side of the world.

Potentially all the workstations / end point devices would need to be repaired though.

The initial day or two I’ll happily blame on crowdstrike. After that, it’s on their IT department for not having good DR plans.

permalink
report
parent
reply
3 points
*

Hell I just did that with what’s effectively a black box this morning - if it’s critical, it gets done the right way or don’t bother doing it at all.

Edit: Bonus unnecessary word

permalink
report
parent
reply
2 points

There was no easy automated way if the systems were encrypted, which any sane organization mandates. So yes, did require hands-on-keyboard. But all the other airlines were up and running much faster, and they all had to perform the same fix.

Basically, in macOS terms, the OS fails to boot, so every system just goes to recovery only, and you need to manually enter the recovery lock and encryption password on every system to delete a file out of /System (which isn’t allowed in macOS because it’s read only but just go with it) before it will boot back into macOS. Hope you had those recorded/managed/backed up somewhere otherwise it’s a complete system reinstall…

So yeah, not fun for anyone involved.

permalink
report
parent
reply
1 point
Deleted by creator
permalink
report
parent
reply
3 points
32 points

Negligence can make contracts a little less permanent.

permalink
report
parent
reply
8 points

Delta was the only airline to suffer a long outage. That’s why I say Crowdstrike is the kickoff, but the poor, drawn-out response and time to resolve it is totally on Delta.

permalink
report
parent
reply
5 points

Idk, crowdstike had a few screwups in their pocket before this one. They might be on the hook for costs associated with an outage caused by negligence. I’m not a lawyer, but I do stand next to one in the elevator.

permalink
report
parent
reply
23 points

I can’t wait to see crowdstrike get liquidated from all of this, MSOFT is getting so much flak when this straight up wasn’t their fault

permalink
report
reply
1 point

Crowdstrike wouldn’t have a business model if the security of Microsoft Windows wasn’t so awful. Microsoft isn’t directly to blame for this, but they’re not blameless either.

permalink
report
parent
reply
0 points

Windows defender for enterprise is a strong competitor in that market, and CISO that went with crowdstrike did it because the crowdstrike sales team hosts really great lunches and sponsors lots of sports teams

permalink
report
parent
reply
0 points

Why would they be liquidated?

permalink
report
parent
reply
1 point

Inability to pay the settlements on the inevitable lawsuits that will be coming their way for halting the world economy for a day

permalink
report
parent
reply
0 points

I’m sure their Terms of Service make it clear they have limited liability or need to go to arbitration.

permalink
report
parent
reply
8 points

Their stock is at +44% since July 2023, they might be fine

permalink
report
parent
reply
3 points

Pure gambling

permalink
report
parent
reply
1 point

Lawsuits haven’t started yet, too soon. Companies effected by the outage are still running number to see HOW effected they were

permalink
report
parent
reply
12 points

The reboot 15 times solution, etc it is a bit on their side. But in general I agree, CrowdStrike and the industries that need that kind of service should know better.

permalink
report
parent
reply

News

!news@lemmy.world

Create post

Welcome to the News community!

Rules:

1. Be civil

Attack the argument, not the person. No racism/sexism/bigotry. Good faith argumentation only. This includes accusing another user of being a bot or paid actor. Trolling is uncivil and is grounds for removal and/or a community ban.


2. All posts should contain a source (url) that is as reliable and unbiased as possible and must only contain one link.

Obvious right or left wing sources will be removed at the mods discretion. We have an actively updated blocklist, which you can see here: https://lemmy.world/post/2246130 if you feel like any website is missing, contact the mods. Supporting links can be added in comments or posted seperately but not to the post body.


3. No bots, spam or self-promotion.

Only approved bots, which follow the guidelines for bots set by the instance, are allowed.


4. Post titles should be the same as the article used as source.

Posts which titles don’t match the source won’t be removed, but the autoMod will notify you, and if your title misrepresents the original article, the post will be deleted. If the site changed their headline, the bot might still contact you, just ignore it, we won’t delete your post.


5. Only recent news is allowed.

Posts must be news from the most recent 30 days.


6. All posts must be news articles.

No opinion pieces, Listicles, editorials or celebrity gossip is allowed. All posts will be judged on a case-by-case basis.


7. No duplicate posts.

If a source you used was already posted by someone else, the autoMod will leave a message. Please remove your post if the autoMod is correct. If the post that matches your post is very old, we refer you to rule 5.


8. Misinformation is prohibited.

Misinformation / propaganda is strictly prohibited. Any comment or post containing or linking to misinformation will be removed. If you feel that your post has been removed in error, credible sources must be provided.


9. No link shorteners.

The auto mod will contact you if a link shortener is detected, please delete your post if they are right.


10. Don't copy entire article in your post body

For copyright reasons, you are not allowed to copy an entire article into your post body. This is an instance wide rule, that is strictly enforced in this community.

Community stats

  • 15K

    Monthly active users

  • 6.8K

    Posts

  • 117K

    Comments