140 points

If a single person can make the system fail then the system has already failed.

permalink
report
reply
19 points

If a single outsider can make Your system fail then it’s already failed.

Now consider in the context of supply-chain tuberculosis like npm.

permalink
report
parent
reply
10 points

Left-pad right?

permalink
report
parent
reply
132 points

Many people need to shift away from this blaming mindset and think about systems that prevent these things from happening. I doubt anyone at CrowdStrike desired to ground airlines and disrupt emergency systems. No one will prevent incidents like this by finding scapegoats.

permalink
report
reply
21 points
*

Hey, why not just ask Dave Plummer, former Windows developer…

https://youtube.com/watch?v=wAzEJxOo1ts

When anywhere from 8.5 million to over a billion systems went down, numbers I’ve read so far vary significantly, still that’s way too much failure for a simple borked update to a kernel level driver, not even made by Microsoft.

permalink
report
parent
reply
7 points

that’s a huge sign that their rollout process is garbage

permalink
report
parent
reply
19 points

That means spending time and money on developing such a system, which means increasing costs in the short term… which is kryptonite for current-day CEOs

permalink
report
parent
reply
8 points
*

Right. More than money, I say it’s about incentives. You might change the entire C-suite, management, and engineering teams, but if the incentives remain the same (e.g. developers are evaluated by number of commits), the new staff is bound to make the same mistakes.

permalink
report
parent
reply
18 points
*

I strongly believe in no-blame mindsets, but “blame” is not the same as “consequences” and lack of consequences is definitely the biggest driver of corporate apathy. Every incident should trigger a review of systemic and process failures, but in my experience corporate leadership either sucks at this, does not care, or will bury suggestions that involve spending man-hours on a complex solution if the problem lies in that “low likelihood, big impact” corner.
Because likely when the problem happens (again) they’ll be able to sweep it under the rug (again) or will have moved on to greener pastures.

What the author of the article suggests is actually a potential fix; if developers (in a broad sense of the word and including POs and such) were accountable (both responsible and empowered) then they would have the power to say No to shortsighted management decisions (and/or deflect the blame in a way that would actually stick to whoever went against an engineer’s recommendation).

permalink
report
parent
reply
3 points
*

Edit: see my response, realised the comment was about engineering accountability which I 100% agree with, leaving my original post untouched aside from a typo that’s annoying me.

I respectfully disagree coming from a reliability POV, you won’t address culture or processes that enable a person to make a mistake. With the exception of malice or negligence, no one does something like this in a vacuum; insufficient or incorrect training, unreasonable pressure, poorly designed processes, a culture that enables actions that lead to failure.

Example I recall from when I worked manufacturing, operator runs a piece of equipment that joins pieces together in manual rather than automatic, failed to return it to a ready flag and caused a line stop. Yeah, operator did something outside of process and caused an issue, clear cut right? Send them home? That was a symptom, not a cause, the operator ran in manual because the auto cycle time was borderline causing linestops, especially on the material being run. The operator was also using manual as there were some location sensors that had issues with that material and there was incoming quality issues, so running manually, while not standard procedure, was a work around to handle processing issues, we also found that culturally, a lot of the operators did not trust the auto cycles and would often override. The operator was unlucky, if we just put all the “accountability” on them we’d never have started projects to improve reliability at that location and change the automation to flick over that flag the operator forgot about if conditions were met regardless.

Accountability is important, but it needs to be applied where appropriate, if someone is being negligent or malicious, yeah there’s consequences, but it’s limiting to focus on that only. You can implement what you suggest that the devs get accountability for any failure so they’re “empowered”, but if your culture doesn’t enable them to say no or make them feel comfortable to do so, you’re not doing anything that will actually prevent an issue in the future.

Besides, I’d almost consider it a PPE control and those are on the bottom of the controls hierarchy with administrative just above it, yes I’m applying oh&s to software because risk is risk conceptually, automated tests, multi phase approvals etc. All of those are better controls than relying on a single developer saying no.

permalink
report
parent
reply
6 points

Oh I was talking in the context of my specialty, software engineering. The main difference between an engineer and an operator is that one designs processes while the other executes on those processes. Negligence/malice aside the operator is never to blame.

If the dev is “the guy who presses the ‘go live’ button” then he’s an operator. But what is generally being discussed is all the engineering (or lack thereof) around that “go live” button.

As a software engineer I get queasy when it is conceivable that a noncritical component reaches production without the build artifact being thoroughly tested (with CI tests AND real usage in lower environments).
The fact that CrowdWorks even had a button that could push a DOA update on such a highly critical component points to their processes being so out of the industry standards that no software engineer would have signed off on anything… If software engineers actually had the same accountability as Civil Engineers. If a bridge gets built outside the specifications of the Civil Engineer who signed off on the plans, and that bridge crumbles, someone is getting their tits sued off. Yet there is no equivalent accountability in Software Engineering (except perhaps in super safety-critical stuff like automotive/medical/aerospace/defense applications, and even there I think we’d be surprised).

permalink
report
parent
reply
-26 points
*

If you were a developer that knew you were responsible for developing ring zero code, massively deployed across corporate systems across the world, then you should goddamned properly test the update before deploying it.

This isn’t a simple glitch like a calculation rounding error or some shit, the programmers of any ring zero code should be held fully responsible, for not properly reviewing and testing the code before deploying an update.

Edit: Why not just ask Dave Plummer, former Windows developer…

https://youtube.com/watch?v=wAzEJxOo1ts

permalink
report
parent
reply
31 points
*

If you system depends on a human never making a mistake, your system is shit.

It’s not by chance that for example, Accountants have since forever had something which they call reconciliation where the transaction data entered from invoices and the like then gets cross-checked with something else done differently, for example bank account transactions - their system is designed with the expectation that humans make mistakes hence there’s a cross-check process to catch those.

Clearly Crowdstrike did not have a secondary part of the process designed to validate what’s produced by the primary (in software development that would usually be Integration Testing), so their process was shit.

Blaming the human that made a mistake for essentially being human and hence making mistakes, rather than the process around him or her not having been designed to catch human failure and stop it from having nasty consequences, is the kind of simplistic ignorant “logic” that only somebody who has never worked in making anything that has to be reliable could have.

My bet, from decades of working in the industry, is that some higher up in Crowdstrike didn’t want to pay for the manpower needed for the secondary process checking the primary one before pushing stuff out to production because “it’s never needed” and then the one time it was needed, it wasn’t there, thinks really blew up massivelly, and here we are today.

permalink
report
parent
reply
-1 points

Indeed, I fully agree. They obviously neglected on testing before deployment. So you can split the blame between the developer that goofed on the null pointer dereferencing and the blank null file, and the higher ups that apparently decided that proper testing before deployment wasn’t necessary.

Ultimately, it still boils down to human error.

permalink
report
parent
reply
1 point
*
Deleted by creator
permalink
report
parent
reply
1 point

Watch the video that I linked as an edit from Dave Plummer, he explains it rather well. The driver was signed, it was the rolling update definition files from CrowdStrike that were unsigned.

permalink
report
parent
reply
83 points

Note: Dmitry Kudryavtsev is the article author and he argues that the real blame should go to the Crowdstrike CEO and other higher-ups.

permalink
report
reply
24 points

Edited the title to have a by in front to make that a bit more clear

permalink
report
parent
reply
51 points

sure it is the dev who is to blame and not the clueless managers who evaluate devs based on number of commits/reviews per day and CEOs who think such managers are on top of their game.

permalink
report
reply
6 points

Is that the case at CrowdStrike?

permalink
report
parent
reply
12 points

I don’t have any information on that, this was more like a criticism of where the world seems to be leading to

permalink
report
parent
reply
11 points
*
Removed by mod
permalink
report
parent
reply
49 points

CrowdStrike ToS, section 8.6 Disclaimer

[…] THE OFFERINGS AND CROWDSTRIKE TOOLS ARE NOT FAULT-TOLERANT AND ARE NOT DESIGNED OR INTENDED FOR USE IN ANY HAZARDOUS ENVIRONMENT REQUIRING FAIL-SAFE PERFORMANCE OR OPERATION. NEITHER THE OFFERINGS NOR CROWDSTRIKE TOOLS ARE FOR USE IN THE OPERATION OF AIRCRAFT NAVIGATION, NUCLEAR FACILITIES, COMMUNICATION SYSTEMS, WEAPONS SYSTEMS, DIRECT OR INDIRECT LIFE-SUPPORT SYSTEMS, AIR TRAFFIC CONTROL, OR ANY APPLICATION OR INSTALLATION WHERE FAILURE COULD RESULT IN DEATH, SEVERE PHYSICAL INJURY, OR PROPERTY DAMAGE. […]

It’s about safety, but truly ironic how it mentions aircraft-related twice, and communication systems (very broad).

It certainly doesn’t impose confidence in the overall stability. But it’s also general ToS-speak, and may only be noteworthy now, after the fact.

permalink
report
reply
10 points

Weren’t the issues at airports because of the ticketing and scheduling systems going down, not anything with aircraft?

permalink
report
parent
reply
2 points

Yes, I think so.

permalink
report
parent
reply
7 points

That’s just covering up, like a disclaimer that your software is intended to only be used on 29ᵗʰ of February. You don’t expect anyone to follow that rule, but you expect the court to rule that the user is at fault.

Luckily, it doesn’t always work that way, but we will see how it turns out this time

permalink
report
parent
reply
2 points
*

Lawful Masses with Leonard French covered this yesterday. He is a copyright attorney. He starts the video with the opinion that the ToS wouldn’t protect CrowdStrike.

permalink
report
parent
reply
1 point

I’m pretty sure if a client pays for use in any of that they’ll shut up and take the money. Pretty ethical.

permalink
report
parent
reply
-2 points
*
Deleted by creator
permalink
report
parent
reply

Programming

!programming@programming.dev

Create post

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person’s post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Rules

  • Follow the programming.dev instance rules
  • Keep content related to programming in some way
  • If you’re posting long videos try to add in some form of tldr for those who don’t want to watch videos

Wormhole

Follow the wormhole through a path of communities !webdev@programming.dev



Community stats

  • 3.1K

    Monthly active users

  • 890

    Posts

  • 7.7K

    Comments