Is the conclusion of your root cause analysis "human error"?
Companies get hacked all the time. After the incident is more or less over, people starting looking for insights on "why did this happen". The culprit is often deemed to be "human error". In this post we show how using the 5 Why's technique to drill down to the root causes of that human error can vastly improve the lessons learned after an incident. We want to treat the disease, not the symptoms!
The 5 why's technique helps us treat the disease leading to a vulnerability instead of merely treating the symptoms - the vulnerability itself.
The intention of the root cause analysis (RCA) matters. Such undertakings may have different purposes:
Obviously purpose #1 is the ideal one, but often other and less admirable purposes seep into the thinking around the RCA. In such cases, the investigation often stops when the seemingly inevitable conclusion has been reached: "it was due to human error, nothing we could have done about that". Sometimes they will even point fingers to someone: "it was the intern's bad password that caused it all". This is not very nice, nor is it very constructive.
A recent data breach in Norway was quickly presented in media as "human error". The company Norkart, which sells GIS systems to the public sector, suffered a data breach and personal data of more than 3 million Norwegians has likely been stolen by a threat actor. According to news reports (in Norwegian), the attacks have used an open port, and this port was open due to "human error". They have not blamed any individuals (in public at least) for this error, have told media they a reviewing their practices, and stressing the importance of getting help early and reporting serious incidents to the police.
In the 2020 Solarwinds attack that led to more than 18.000 organizations being compromised - including the US government, the CEO blamed an intern for using and leaking a weak password in 2017 as the "cause of the incident" (The Hacker News). That blame game has received a lot of heat from security experts, and rightly so. Obviously it was not a good idea to use a weak password on a critical server, or to leak that on a private Github repository, but why did this happen? Why was it not discovered and stopped? Why were there no other security mechanisms stopping the threat actor from getting access to a critical resource due to a "single human error"?
That explanation failed the most important barrier management principle: no single error shall lead directly to unacceptable consequences.
What influences the probability of "human errors"?
Since it is so common to conclude that any undesired event was caused by "human error", it is a good idea to ask why we make errors. And, not surprisingly, there's been a lot of research into this topic. Smart people from fields spanning psychology, management, engineering, warfare, sociology, and probably many others have studied and written about human error and its causes and preconditions. This means that there is a lot of knowledge available to us about human errors that we can use to improve our understanding of cyber incidents.
A lot of effort has been made into understanding poor decisions in industrial control rooms. This type of analysis is often called human reliability analysis. There are many methodologies, varying from qualitative, to more quantitative methods trying to actually pinpoint failure rates for certain decision types. In these theories there are often descriptions of factors influencing our decisions, whether our decisions are "quick and intuitive" or "based on in-depth analysis". One such methodology that is widely accepted is the SPAR-H human reliability analysis created and published by Idaho national labs. The key insight we should bring with us is that security decisions are thus heavily influenced by our human strengths and weaknesses, just like everything else in our lives!
Let's first consider the case where "Johnny from accounting made a decision that contributed to the threat actor succeeding with the attack".
The first thing we can do if we are trying to better understand why this decision was made, and how we can avoid making the same "error" in the future, is to classify the decision. Was it intentional or unintentional? Intentional decisions may lead to harm because the decision maker intends to inflict harm. In that case we are talking about a malicious insider threat. In most cases, however, we are talking about mistakes, where the decision maker is trying to decide something, but makes the wrong decision. The reason for the wrong decision could be a lack of knowledge. Could more training help prevent this in the future? Would it be helpful to have some rules to follow, to deal with the complexity of the situation? It is easy to blame the person making the decision for the bad outcome but it is more useful to identify a missing procedure, or that there is a training gap that needs to be filled. By drilling down into the underlying causes of a poor decision, it is possible to do something to reduce the probability of similar errors being made again.
A lot of errors are not due to malicious intent, nor a lack of knowledge. These are unintended actions that can have bad consequences too. Often such errors are referred to as laps and slips. We can make errors because we simply don't have the right problem in our minds, or we forget important things. If the complexity of the situation exceeds our cognitive capacity this is likely to happen; there are important aspects of a problem that we are not thinking about because the memory (your head) is full. The reasons for this can be that you have less cognitive capacity than ideal, for example due to stress or lack of sleep. The situation is similar but with different causes when there is a confusing problem without the right tools to help make sense of it and make good decisions.
领英推荐
Sometimes our attention slips. That can easily make us overlook important things and make mistakes. Too many distractions, lack of fitness-for-duty; there are also many factors that influence our ability to focus.
All of these factors contribute to the reason for human error. They can also be factors that contribute to high performance, if we seek to optimize them. Here is a list of factors that are known to influence the quality of our decisions:
The good thing about these factors is that organizations have a great deal of influence over them. If you don't stop your RCA at the level of "it was human error", but actually dive into why there was a poor decision, you may uncover issues related to these factors that you can actually change. Treating the cause (no training) is better than treating the effect (bad password).
The case of the open port
Let's consider a case where an e-commerce company called "DonkeyCom" has been hacked (this is a made-up company and a made-up story). The initial review of logs showed that port 22 (SSH) was open on their web server, and that a brute-force attack gave the attacker access to the web server. The attacker then moved laterally and connected to the MySQL database used to hold data in the application, including user data such as profiles, password hashes and purchase histories. All data in the database was downloaded by the threat actor. The CEO says in a press release that "Unfortunately, a human error led to a port being open to the Internet, that let the hackers in and they stole all our data. There was nothing we could have done to prevent this, and we are very sorry. The Internet is a dangerous place."
Obviously that statement is wrong, there is plenty that could have been done, and in particular a defense-in-depth approach was missing here. SSH with a weak password on production servers? Dumping and downloading the whole database without being noticed? Data not encrypted at rest? There's clearly plenty that could have been done!
But let's think about how we could drill down into the root causes of the vulnerabilities the company discovered in their incident analysis; the weak password and the open port. This was all blamed on Henry - the careless engineer. One technique we could use to try to understand a bit more about this case, is 5 why's. For each apparent cause, we ask "why", until we think we have uncovered the real cause of an observed effect.
In the chat above, the investigator drills down into why the port was open. It shows only 3 iterations, but you could go on with this. Why was there no time for training? We were always stuck in firefighting, we are understaffed, we work too much overtime, etc. We are now drilling into the performance shaping factors discussed above!
The other obvious vulnerability found above was the weak SSH password, allowing the attacker easy access over the open port. The 5 why's technique could perhaps unravel why a password was used instead of key-based authentication, and that this was related to easy sharing of credentials with freelancers. The point is: there are always reasons for bad security decisions, and drilling into the details of them can often lead to useful insights.
Obviously, not exposing the port in question to the Internet, and not using a shared and very weak SSH password would be reasonable follow-ups. Fixing those problems, however, are only treating the symptoms. Overwork and lack of training will likely lead to further vulnerabilities that can be exploited - if the organization takes a broader look at how it can reduce the burden on its technical staff, provide relevant training, and provide decision aides such as checklists and written procedures, it will have much broader effects than just the vulnerabilities that were exploited.
Summary for busy people
As a champion of any proper RCA method I have to say that the main objective of the RCA is to uncover the mechanism of the problem, so that a countermeasure can be implemented at the most effective and efficient point(s) of the mechanism. "Human Error" and a few other pitfalls are dead ends in the RCA process, because it is always... wrong. The mechanism of the problem doesn't care what is the standard procedure, it doesn't care about what training was or was not held. The mechanism of the problem is never about your documentation. If you ever see "human error" in an RCA - the facilitator of the RCA made a very human error! Poor causes often lead to poor solutions; the famous "thee R's": ReTrain, ReWrite, ReCommunicate. Repeating previous efforts rarely lead to innovations. Especially so, if you stopped the RCA too early...
Hjelper private og offentlige virksomheter med digitalisering og sikkerhet
2 年Lesverdig for ledere