登录查看更多内容

Is the conclusion of your root cause analysis "human error"?

H?kon Olsen

Creating space for creativity by managing security risk

发布日期: 2022年6月6日

Companies get hacked all the time. After the incident is more or less over, people starting looking for insights on "why did this happen". The culprit is often deemed to be "human error". In this post we show how using the 5 Why's technique to drill down to the root causes of that human error can vastly improve the lessons learned after an incident. We want to treat the disease, not the symptoms!

The 5 why's technique helps us treat the disease leading to a vulnerability instead of merely treating the symptoms - the vulnerability itself.

The intention of the root cause analysis (RCA) matters. Such undertakings may have different purposes:

Understand why things happened, in order to improve system robustness for the future
Find a socially acceptable excuse to limit loss of trust in the market
Find a scapegoat to avoid looking at system level weaknesses

Obviously purpose #1 is the ideal one, but often other and less admirable purposes seep into the thinking around the RCA. In such cases, the investigation often stops when the seemingly inevitable conclusion has been reached: "it was due to human error, nothing we could have done about that". Sometimes they will even point fingers to someone: "it was the intern's bad password that caused it all". This is not very nice, nor is it very constructive.

finding root causes allows you to fix the underlying problem, not just one specific vulnerability

A recent data breach in Norway was quickly presented in media as "human error". The company Norkart, which sells GIS systems to the public sector, suffered a data breach and personal data of more than 3 million Norwegians has likely been stolen by a threat actor. According to news reports (in Norwegian), the attacks have used an open port, and this port was open due to "human error". They have not blamed any individuals (in public at least) for this error, have told media they a reviewing their practices, and stressing the importance of getting help early and reporting serious incidents to the police.

In the 2020 Solarwinds attack that led to more than 18.000 organizations being compromised - including the US government, the CEO blamed an intern for using and leaking a weak password in 2017 as the "cause of the incident" (The Hacker News). That blame game has received a lot of heat from security experts, and rightly so. Obviously it was not a good idea to use a weak password on a critical server, or to leak that on a private Github repository, but why did this happen? Why was it not discovered and stopped? Why were there no other security mechanisms stopping the threat actor from getting access to a critical resource due to a "single human error"?

That explanation failed the most important barrier management principle: no single error shall lead directly to unacceptable consequences.

What influences the probability of "human errors"?

Since it is so common to conclude that any undesired event was caused by "human error", it is a good idea to ask why we make errors. And, not surprisingly, there's been a lot of research into this topic. Smart people from fields spanning psychology, management, engineering, warfare, sociology, and probably many others have studied and written about human error and its causes and preconditions. This means that there is a lot of knowledge available to us about human errors that we can use to improve our understanding of cyber incidents.

A lot of effort has been made into understanding poor decisions in industrial control rooms. This type of analysis is often called human reliability analysis. There are many methodologies, varying from qualitative, to more quantitative methods trying to actually pinpoint failure rates for certain decision types. In these theories there are often descriptions of factors influencing our decisions, whether our decisions are "quick and intuitive" or "based on in-depth analysis". One such methodology that is widely accepted is the SPAR-H human reliability analysis created and published by Idaho national labs. The key insight we should bring with us is that security decisions are thus heavily influenced by our human strengths and weaknesses, just like everything else in our lives!

understanding human error means understanding how humans work

Let's first consider the case where "Johnny from accounting made a decision that contributed to the threat actor succeeding with the attack".

The first thing we can do if we are trying to better understand why this decision was made, and how we can avoid making the same "error" in the future, is to classify the decision. Was it intentional or unintentional? Intentional decisions may lead to harm because the decision maker intends to inflict harm. In that case we are talking about a malicious insider threat. In most cases, however, we are talking about mistakes, where the decision maker is trying to decide something, but makes the wrong decision. The reason for the wrong decision could be a lack of knowledge. Could more training help prevent this in the future? Would it be helpful to have some rules to follow, to deal with the complexity of the situation? It is easy to blame the person making the decision for the bad outcome but it is more useful to identify a missing procedure, or that there is a training gap that needs to be filled. By drilling down into the underlying causes of a poor decision, it is possible to do something to reduce the probability of similar errors being made again.

A lot of errors are not due to malicious intent, nor a lack of knowledge. These are unintended actions that can have bad consequences too. Often such errors are referred to as laps and slips. We can make errors because we simply don't have the right problem in our minds, or we forget important things. If the complexity of the situation exceeds our cognitive capacity this is likely to happen; there are important aspects of a problem that we are not thinking about because the memory (your head) is full. The reasons for this can be that you have less cognitive capacity than ideal, for example due to stress or lack of sleep. The situation is similar but with different causes when there is a confusing problem without the right tools to help make sense of it and make good decisions.

领英推荐

Storware #NewsFeed

Storware 11 个月前

Yes, CISA will be largely dismantled. No, it won't…

Kate Fazzini 1 个月前

Seven of the Most Important Letters in Human-I-T

Human-I-T 2 年前

Sometimes our attention slips. That can easily make us overlook important things and make mistakes. Too many distractions, lack of fitness-for-duty; there are also many factors that influence our ability to focus.

Don't text and drive (unsplashed photo by Alexandre Boucher)

All of these factors contribute to the reason for human error. They can also be factors that contribute to high performance, if we seek to optimize them. Here is a list of factors that are known to influence the quality of our decisions:

Time available to make the decision
Excessive stress
Task complexity
Training and experience
Processes and procedures
User interfaces
Attitudes
Team dynamics
Psychosocial working environment

The good thing about these factors is that organizations have a great deal of influence over them. If you don't stop your RCA at the level of "it was human error", but actually dive into why there was a poor decision, you may uncover issues related to these factors that you can actually change. Treating the cause (no training) is better than treating the effect (bad password).

The case of the open port

Let's consider a case where an e-commerce company called "DonkeyCom" has been hacked (this is a made-up company and a made-up story). The initial review of logs showed that port 22 (SSH) was open on their web server, and that a brute-force attack gave the attacker access to the web server. The attacker then moved laterally and connected to the MySQL database used to hold data in the application, including user data such as profiles, password hashes and purchase histories. All data in the database was downloaded by the threat actor. The CEO says in a press release that "Unfortunately, a human error led to a port being open to the Internet, that let the hackers in and they stole all our data. There was nothing we could have done to prevent this, and we are very sorry. The Internet is a dangerous place."

Obviously that statement is wrong, there is plenty that could have been done, and in particular a defense-in-depth approach was missing here. SSH with a weak password on production servers? Dumping and downloading the whole database without being noticed? Data not encrypted at rest? There's clearly plenty that could have been done!

But let's think about how we could drill down into the root causes of the vulnerabilities the company discovered in their incident analysis; the weak password and the open port. This was all blamed on Henry - the careless engineer. One technique we could use to try to understand a bit more about this case, is 5 why's. For each apparent cause, we ask "why", until we think we have uncovered the real cause of an observed effect.

In the chat above, the investigator drills down into why the port was open. It shows only 3 iterations, but you could go on with this. Why was there no time for training? We were always stuck in firefighting, we are understaffed, we work too much overtime, etc. We are now drilling into the performance shaping factors discussed above!

The other obvious vulnerability found above was the weak SSH password, allowing the attacker easy access over the open port. The 5 why's technique could perhaps unravel why a password was used instead of key-based authentication, and that this was related to easy sharing of credentials with freelancers. The point is: there are always reasons for bad security decisions, and drilling into the details of them can often lead to useful insights.

Obviously, not exposing the port in question to the Internet, and not using a shared and very weak SSH password would be reasonable follow-ups. Fixing those problems, however, are only treating the symptoms. Overwork and lack of training will likely lead to further vulnerabilities that can be exploited - if the organization takes a broader look at how it can reduce the burden on its technical staff, provide relevant training, and provide decision aides such as checklists and written procedures, it will have much broader effects than just the vulnerabilities that were exploited.

Summary for busy people

Vulnerabilities are often the result of poor organizational performance
People make mistakes, but we can optimize organizations for higher reliability
When a mistake has been made, a root-cause analysis can help uncover sub-optimal performance influencing factors
Fixing symptoms is necessary, but doesn't solve the problem for the long run
When someone blames a vulnerability on "human error", use the 5 why's technique to drill down to the real problems that can be solved

Real-world OT/SEC thoughts

586 位关注者

Stian Yttervik

2 年

As a champion of any proper RCA method I have to say that the main objective of the RCA is to uncover the mechanism of the problem, so that a countermeasure can be implemented at the most effective and efficient point(s) of the mechanism. "Human Error" and a few other pitfalls are dead ends in the RCA process, because it is always... wrong. The mechanism of the problem doesn't care what is the standard procedure, it doesn't care about what training was or was not held. The mechanism of the problem is never about your documentation. If you ever see "human error" in an RCA - the facilitator of the RCA made a very human error! Poor causes often lead to poor solutions; the famous "thee R's": ReTrain, ReWrite, ReCommunicate. Repeating previous efforts rarely lead to innovations. Especially so, if you stopped the RCA too early...

5 次回应

Olav Aasen

Hjelper private og offentlige virksomheter med digitalisering og sikkerhet

2 年

Lesverdig for ledere

2 次回应

查看更多评论

要查看或添加评论，请登录

H?kon Olsen的更多文章

The Human Element: Unpredictability in Security Risk Assessments

2025年2月16日

The Human Element: Unpredictability in Security Risk Assessments

What truly sets security risk assessments apart from their counterparts? It's the presence of a threat actor—an…

1 条评论
Grabbing OT benefits in the cloud - how to align security targets

2025年1月26日

Grabbing OT benefits in the cloud - how to align security targets

Modern OT projects are no longer just OT projects. Vendors are pushing their own SaaS models with cloud components, and…

3 条评论
When is an extra security control worth the cost?

2024年11月24日

When is an extra security control worth the cost?

There are many security controls to choose from, but they are not all created equal. We, as security experts, are not…

1 条评论
Build better cyber trainings by writing malware and vulnerable apps

2024年10月7日

Build better cyber trainings by writing malware and vulnerable apps

Cybersecurity is a field where people take a lot of company organized trainings. This may be awareness training meant…
From Defender to Business Catalyst: The Future of Security Leadership

2024年9月10日

From Defender to Business Catalyst: The Future of Security Leadership

Security leaders often struggle to make the impact they want, and the impact that our organizations need. Many CISO’s…

6 条评论
Digital dependencies: when a Falcon takes down your entire business

2024年7月19日

Digital dependencies: when a Falcon takes down your entire business

It's been all over the news today; Crowdstrike pushed an update to its Falcon agent on Windows hosts, causing BSOD…
The most common OT security weaknesses and what we can do about them

2024年6月24日

The most common OT security weaknesses and what we can do about them

Operational technology is where digital meets the physical world. It is the control system making sure the power plant…

2 条评论
Creating a defendable OT network

2024年6月3日

Creating a defendable OT network

This article is about posture management and continuous improvement. By focusing on actionable posture improvements…
Design your OT network for the day your factory is hacked

2024年4月25日

Design your OT network for the day your factory is hacked

When hackers are in the OT network, panic ensues. If your OT network has many of the typical characteristics panic may…

2 条评论
Digital resilience - more than cybersecurity

2024年3月28日

Digital resilience - more than cybersecurity

When you work in cybersecurity, it is easy to view cyber attacks as the only disruptive event for IT or OT systems…

1 条评论

See all articles

Is the conclusion of your root cause analysis "human error"?

H?kon Olsen

Creating space for creativity by managing security risk

What influences the probability of "human errors"?

领英推荐

The case of the open port

Summary for busy people

Real-world OT/SEC thoughts

586 位关注者

H?kon Olsen的更多文章

社区洞察

其他会员也浏览了

Concerns about the Equifax Data Breach, what could have been done differently?

Strange But True: Horror Stories of Cybersecurity- Episode 4: They're Baaaaack

People Are Still Not Using Secure Passwords Despite Warnings

Unrar Path Traversal Vulnerability affects Zimbra Mail

Are Password Managers Safe? Should You Use One?

Alert: F5 Warns of Active Attacks Exploiting BIG-IP Vulnerability

The Top 5 Most Common Vulnerabilities We’ve Discovered in Financial Services Applications

A Math-Based Approach to Password Strength

How Long Does It Take to Hack Your Password?

What influences the probability of "human errors"?

领英推荐

The case of the open port

Summary for busy people

Real-world OT/SEC thoughts

586 位关注者

H?kon Olsen的更多文章

The Human Element: Unpredictability in Security Risk Assessments

Grabbing OT benefits in the cloud - how to align security targets

When is an extra security control worth the cost?

Build better cyber trainings by writing malware and vulnerable apps

From Defender to Business Catalyst: The Future of Security Leadership

Digital dependencies: when a Falcon takes down your entire business

The most common OT security weaknesses and what we can do about them

Creating a defendable OT network

Design your OT network for the day your factory is hacked

Digital resilience - more than cybersecurity

社区洞察

其他会员也浏览了

Concerns about the Equifax Data Breach, what could have been done differently?

Strange But True: Horror Stories of Cybersecurity- Episode 4: They're Baaaaack

People Are Still Not Using Secure Passwords Despite Warnings

Unrar Path Traversal Vulnerability affects Zimbra Mail

Are Password Managers Safe? Should You Use One?

Alert: F5 Warns of Active Attacks Exploiting BIG-IP Vulnerability

The Top 5 Most Common Vulnerabilities We’ve Discovered in Financial Services Applications

A Math-Based Approach to Password Strength

How Long Does It Take to Hack Your Password?