• Delta Air Lines CEO Ed Bastian said the massive IT outage earlier this month that stranded thousands of customers will cost it $500 million.
  • The airline canceled more than 4,000 flights in the wake of the outage, which was caused by a botched CrowdStrike software update and took thousands of Microsoft systems around the world offline.
  • Bastian, speaking from Paris, told CNBC’s “Squawk Box” on Wednesday that the carrier would seek damages from the disruptions, adding, “We have no choice.”
You are viewing a single thread.
View all comments View context
1 point

General practices aside, should they really not plan anybackups system though? Crowd strike did not cause 500 million in damages to delta, deltas disaster recovery response did.

Where do we draw the line there though I’m not sure. If you set my house on fire but the fire department just stands outside and watches it burn for no reason, who should I be upset with?

permalink
report
parent
reply
1 point

Well, in your example you should be mad at yourself for not having a backup house. 😛

There’s a lot of assumptions underpinning the statements around their backup systems. Namely, that they didn’t have any.
Most outage backups focus on datacenter availability, network availability, and server availability.
If your service needs one server to function, having six servers spread across two data centers each with at least two ISPs is cautious, but prudent. Particularly if you’re setup to do rolling updates, so only one server should ever be “different” at a time, leaving you with a redundant copy at each location no matter what.
This goes wrong if someone magically breaks every redundant server at the same time. The underlying assumption around resiliency planning is that random failure is probabilistic in nature, and so by quantifying your failure points and their failure probability you can tune your likelihood of an outage to be arbitrarily low (but never zero).
If your failure isn’t random, like a vendor bypassing your update and deployment controls, then that model fails.

A second point: an airline uses computers that aren’t servers, and requires them for operations. The ticketing agents, the gate crew that manages where people sit and boarding, the ground crew that need to manage routine inspection reports, the baggage handlers that put bags on the right cart to get them to the right plane, and office workers who manage stuff like making sure fuel is paid for, that crews are ready for when their plane shows up and all that stuff that goes into being an airline that isn’t actually flying planes.
All these people need computers, and you don’t typically issue someone a redundant laptop or desktop computer. You rely on hardware failures being random, and hire enough IT staff to manage repairs and replacement at that expected cadence, with enough staff and backup hardware to keep things running as things break.

Finally, if what you know is “computers are turning off and not coming back online”, your IT staff is swamped, systems are variously down or degraded, staff in a bunch of different places are reporting that they can’t do their jobs, your system is in an uncertain and unstable position. This is not where you want a system with strict safety requirements to be, and so the only responsible action is to halt operations, even if things start to recover, until you know what’s happening, why, and that it won’t happen again.

As more details have come out about the issues that Delta is having, it appears that it’s less about system resiliency, although needing to manually fix a bunch of servers was a problem, and more that the scale of flight and crew availability changes overloaded that aforementioned scheduling system, making it difficult to get people and planes in the right place at the right time.
While the application should be able to more gracefully handle extremely high loads, that’s a much smaller failure of planning than not having a disaster recovery or redundancy plan.

So it’s more like I built a house with a sprinkler system, and then you blew it up with explosives. As the fire department and I piece it back together, my mailbox fills with mail and tips over into a creek, so I miss paying my taxes and need to pay a penalty.
I shouldn’t have had a crap mailbox, but it wouldn’t have been a problem if you hadn’t destroyed my house.

permalink
report
parent
reply
1 point

First thank you for taking the time to type all of that out.

I think I follow your theory well enough but (I know this is 2 weeks later so I won’t look up any new information) I was under the impression delta was an outlier in their response compared to other airlines.

And one point about redundancies. Why shouldnt they consider a single operating system as a single failure point? If all 6 servers in the multiple locations all run windows, and windows fails thats awful right? Can they not dual boot orhavee a second set of servers? I do this in my own home but maybe thats not something that scales well.

I’m interested if your opinion has changed now that there has been a bit of time to have some more data come out on it.

permalink
report
parent
reply
1 point

You are correct that Delta was an outlier, but it wasn’t with regards to the scale of the outage, it was that their scheduling software was down far longer and they handled a lot of the customer side of things significantly less well.

Generally, your protection against operating system issues is the aforementioned restriction on changes and how they go out.
If something is stable, you can expect it to remain stable unless something changes or random chance breaks something.
The operational cost of running multiple operating systems in production like you describe would be high. Typically software is only written to work on one platform, and while it can be modified to work on others, it’s usually a cost with no benefit outside of a consumer environment.
Different operating systems have different performance characteristics you need to factor in for load scaling, different security models, and different maintenance requirements.
Often, but not always, server administrators will focus on one OS, so adding more to the mix can mean people are rusty with whichever is your backup, which can be worse than just focusing on fixing the issue with the primary.
OS bugs are rare, and they usually manifest early or randomly. It’s why production deployments tend to use the OS as long as it’s supported: change means learning the new issues and you’ve probably already encountered all the bullshit with what you’re currently using. That’s why the Linux distros tend to have long term support versions, and windows server edition tends to just get support for a long time with terrible documentation.

I’m a Linux guy, so defending windows feels weird, and I want to include that I don’t think anyone should use it, particularly for a server, but the professional in me acknowledges that it’s a perfectly functional hammer.

As we’ve learned more, I’ve become more disparaging of deltas choice to not keep the scheduling system modernized in a way that could recover faster, and not investing enough in making systems homogeneous across different airports. I still think that these issues are largely independent of their actual disaster recovery or resiliency plans.
Inevitably, the lawsuits will determine that the blame for the damage is split between the two of them. My bet is 70/30 crowdstrike/delta, since they can easily demonstrate that the issue was fundamentally caused by crowdstrike and negatively impacted other airlines and businesses in general. Some was clearly deltas fault for just failing to keep a system modernized to handle a massive shift like this, and would have been similarly disrupted by any outage with flight cancellations.

permalink
report
parent
reply

News

!news@lemmy.world

Create post

Welcome to the News community!

Rules:

1. Be civil

Attack the argument, not the person. No racism/sexism/bigotry. Good faith argumentation only. This includes accusing another user of being a bot or paid actor. Trolling is uncivil and is grounds for removal and/or a community ban. Do not respond to rule-breaking content; report it and move on.


2. All posts should contain a source (url) that is as reliable and unbiased as possible and must only contain one link.

Obvious right or left wing sources will be removed at the mods discretion. We have an actively updated blocklist, which you can see here: https://lemmy.world/post/2246130 if you feel like any website is missing, contact the mods. Supporting links can be added in comments or posted seperately but not to the post body.


3. No bots, spam or self-promotion.

Only approved bots, which follow the guidelines for bots set by the instance, are allowed.


4. Post titles should be the same as the article used as source.

Posts which titles don’t match the source won’t be removed, but the autoMod will notify you, and if your title misrepresents the original article, the post will be deleted. If the site changed their headline, the bot might still contact you, just ignore it, we won’t delete your post.


5. Only recent news is allowed.

Posts must be news from the most recent 30 days.


6. All posts must be news articles.

No opinion pieces, Listicles, editorials or celebrity gossip is allowed. All posts will be judged on a case-by-case basis.


7. No duplicate posts.

If a source you used was already posted by someone else, the autoMod will leave a message. Please remove your post if the autoMod is correct. If the post that matches your post is very old, we refer you to rule 5.


8. Misinformation is prohibited.

Misinformation / propaganda is strictly prohibited. Any comment or post containing or linking to misinformation will be removed. If you feel that your post has been removed in error, credible sources must be provided.


9. No link shorteners.

The auto mod will contact you if a link shortener is detected, please delete your post if they are right.


10. Don't copy entire article in your post body

For copyright reasons, you are not allowed to copy an entire article into your post body. This is an instance wide rule, that is strictly enforced in this community.

Community stats

  • 14K

    Monthly active users

  • 10K

    Posts

  • 199K

    Comments