Repeated mistakes
When you have an incident and do a Root Cause Analysis (RCA). The ‘What?’ and ‘How?’ are usually easy to determine. Usually. But the most important question that rebuilds trust and shows that you and your teams are in control, is ‘Why?’.
You need to understand and communicate the ‘Why?’ and how it won’t happen again. Sometimes the hardest ones to report are when someone just makes a stupid mistake, we’re human.
In reality ‘Bob’ uploaded/deleted/turned off the wrong thing, he’s mortified, everybody on the team is poking fun and trying to make “Doing a Bob” stick.
That turns into:
“Human error caused the issue, additional training will be provided and second eye checks performed”
When it becomes a real problem, is when it’s the third time this quarter that ‘Bob’ has made a mistake…
In April Crowdstrike released an update that wrecked some Linux systems (Debian and Rocky Linux)
In June Crowdstrike released an update that caused Windows CPUs to max out at 100% and required rebooting
We all know what happened this month.
EDR (Endpoint Detection and Response) Vendors have a lot of unfettered access to systems through their product. It’s not exactly a transparent part of the sector either. Nobody wants to turn off daily updates. But customers and Boards need to know the why.
Crowdstrike aren’t alone in the industry, but with that recent record they better start doing some deep and transparent explaining soon. For their own sake. Other Vendors also need to learn lessons from this.