Cybersecurity is a Product Reliability Issue
Thinking about reliability, availability, and cybersecurity
In common usage, we consider "reliability” a general product attribute. Will the car get us to where we need to go - and back? Will the airline and airplane carry us to our destination promptly and safely? How often will we need to call the refrigerator repair service? Today, we expect more reliability from our cars than we did of the Yugo, dating to the 1980s. Once a product, generally speaking, gets a reputation for unreliability, it stops getting buyers.
A significant exception is for computer systems. Software is flawed and crashes. Hardware experiences failures. Because "it's always been that way," we expect lower reliability from computer products. While far better than fifty years ago, systems are "buggy" and must be "debugged." Vendors sell software and hardware products with significant defects, and we pray we get "patches" to fix the problems. We do not tolerate bridges that collapse, but a crashing computer is normal because "that's what they do." Patches themselves can introduce more bugs, and if the vendor never creates a patch, we will suffer the problem forever.
While much better than old-time systems, computer hardware and software are still unreliable compared to other things in the rest of the world.
Cybersecurity is a product reliability issue
Robust products are available and reliable
To create dependable and secure systems, we look at two parameters: Availability and Reliability. This is true about hardware, software, space satellites, transportation, roads, bridges, and communications.
Availability is the probability that a resource will be usable when needed. When we encounter a closed road, that path is unavailable to us. If we fail to open a website in a browser, we say the page is "down." It doesn't matter why the road or website is inaccessible; we can't use them. When we go to use something, and we can't, it's unavailable.
Reliability is the probability of completing our task using the available resources. Something is unreliable if it doesn’t successfully and correctly complete its job.
An unreliable entity will also probably have lower availability.
We do not accept unreliable cars or airplanes, but we tolerate failures in our computer systems.
Reliability and the CIA Triad
Cybersecurity professionals
The reality is that reliability affects all aspects of the CIA Triad. If we cannot reach the system because it is unreachable or unavailable, that's a reliability problem. Tricking a human or a program into divulging sensitive information is a reliability problem. Our confidentiality mechanisms "didn't work." If someone tampers with our sensitive information or makes it unusable, that's an integrity failure, and our security mechanisms proved faulty or unreliable.
So, if we have more reliable software and hardware products, they will work better and - as a benefit - be more available and secure.
Reliability in the real world
Current events and news about problems with commercial aircraft inspired this blog. From late December 2023 to early January 2024, Four airplanes were involved in three mishaps. Attention-grabbing headlines such as “What’s going on with … planes” entice readers with their salaciousness and poor grammar. The incidents are real, but the sensationalism is overwrought.
First, a disclaimer, however. There are an average of 95,000 commercial and cargo aircraft flights every day. The number may peak during the summer travel season and slump on Christmas Day, but there are a vast number of successful landings every day. That's almost 35 million safe flights each year. In other words, air travel is incredibly reliable.
As a percentage of all travel, aircraft reliability failures are incredibly rare. When there's an incident, it becomes significant news, as if a bridge collapsed. The rarity of air crashes or structural failures brings them to our attention.
Four newsworthy aircraft incidents occurred between December 29th, 2023, and January 9th, 2024. Just after Christmas, a Ryan Air mechanic noticed a nut missing from a bolt in the tail of one of their Boeing 737 aircraft. There was no harm because the mechanic caught the failure before any serious problem could occur. Boeing has ordered inspections of all similar aircraft. The missing nut compromised the reliability of the airplane. Luckily, there weren't any failures or safety problems in this case.
Three days later, a Japanese Coast Guard plane taxied onto a runway at Tokyo's Haneda Airport, directly into the path of a Japanese Airlines Airbus A350. Five people died on the Coast Guard plane, but everyone escaped the JAL plane safely. There were at least two reliability failures in this crash. First, a human failure occurred when the Coast Guard Pilot entered the runway without permission. Then, apparently, the Public Address system failed on the Airbus, and the crew had to resort to backup mechanisms.
Finally, on January 6th, an Alaska Airlines plane experienced explosive decompression when a "door plug" substituting for an emergency exit blew out of the airplane at 16,000 feet. There were no injuries in the incident, but the Boeing 737 suffered a reliability failure. The Federal Aviation Administration has ordered all similar model aircraft grounded until investigators find the cause and fix it.
All of this sounds dramatic - and it is - but realize that there were almost 900,000 airplane landings in the same period. Considering this, commercial aircraft travel is remarkably safe.
Aircraft reliability failures are generally newsworthy. Computer system unreliability isn't.
Why are computer systems unreliable?
This unreliability comes in two forms: Intrinsic to the product and in its use.
When an organization builds a product with the most reliable components, it becomes more expensive because of the cost of the underlying parts. Reliable designs also cost more because of the focus on reliability and security rather than just making a product that mainly works. In some instances, such as the James Webb Space Telescope, repairing it isn't possible. So, the design of the JWST is to make the most reliable product from the start. The 1960s and 1970s were the eras of regular aircraft crashes. Today, incidents are both extremely rare and fundamentally newsworthy.
The cost of making things more reliable
Except for aerospace, structural engineering, and critical infrastructure, design engineers must balance product cost with warranty expense and intrinsic reliability. More reliable products require better and costlier parts; those products are more expensive to design and build. Organizations budget their engineering and manufacturing costs against the eventual selling price and engineer products that meet consumers’ willingness to pay.
Cybersecurity is a product reliability problem reflected in a product's price.
Since the late Admiral Grace Hopper removed a moth from the workings of her computer in 1947, we’ve accepted bugs in the system.
Defects-per-Kilo-Lines-of-Code (KLOC) is one measurement indicating the bug-free nature of a computer product. We measure this “quality metric” for software, and we can also apply this to hardware with embedded programming code, such as the firmware in our Internet routers. Defects turn into bugs; those, in turn, become vulnerabilities. While somewhat controversial, we’ve used this benchmark for decades in the computer industry. A quick Internet search reveals that, in 2024, we will experience between one and ten defects/KLOC in current products. Companies that produce the highest quality software have a rate as low as one defect in 2,000 lines of code.
It sounds promising that the industry is producing low-defect software until we consider the massive size of the products we use. An operating system may have up to 100,000,000 lines of code, and an office suite may be half that size. Beyond that, Cloud implementations and Artificial Intelligence systems are enormously large, complex, and opaque.
The larger a product’s code base, the more defects. This means that there are more opportunities for hackers to exploit intrinsic vulnerabilities.
The best behavior we can expect from these defects is that we experience unexpected or incorrect results. For example, I use a grammar checker when I write, and it wants to correct “…an office suite may be half…” to “…an office suite maybe half…” from a sentence two paragraphs earlier. That’s clearly incorrect and undoubtedly annoying, but easily corrected.
On the other hand, security vulnerabilities become significant problems for organizations. The National Institute of Standards and Technology maintains the National Vulnerabilities Database. Called the NVD, the people who report the vulnerabilities score each entry’s severity with a value from 0 (meaning “none”) to 10 (denoting “most critical.”) NIST calls this the Common Vulnerability Scoring System (CVSS) number. Infrastructure companies such as VMWare, Cisco, F5, and collaboration vendor Atlassian all released patches during the second week of January 2024 with a CVSS number of “10.” Cybersecurity professionals now need to watch for exploits that could attack these vulnerabilities.
Simply put, bugs lead to vulnerabilities; those make the products we use unreliable.
Unreliable people
When considering a product's reliability, we cannot discount the "stupid human trick." Whether the product is working correctly or not, people can trigger vulnerabilities. For example, a user with Administrator privileges can get tricked into installing malware. Then, the computer gets compromised, allowing the hacker’s nefarious deeds. The same user might give away their password or recycle logins, allowing the attacker an entrée. Thanks to the human interaction, an attacker can take advantage of the system's unreliability. In other words, the attacker exploits the vulnerability - in both the system and the person.
Unreliable interactions
English poet, scholar, and clergyman John Donne is known for ironic and abrupt poetry. Perhaps his most famous poem is "For Whom the Bell Tolls." In it, he says, "No man is an island, Entire of itself." While separate components may work relatively reliably, failures can occur in the interaction between systems.
Designers must also remove reliability problems when they consider a program's interactions. Whether program-to-program or interacting with people, cybersecurity depends on eliminating unreliable interplay between components.
Even if our systems work reasonably reliably and interact nicely, we depend on our infrastructure, such as the operating system, the network, and the libraries providing support functions.
For the moment - and historically - software engineering is different from other technical disciplines because we accept "bugs" in the hardware and software. Were things different 80 years ago at the start of the modern computer era, we might not accept this intrinsic unreliability?
Hackers exploit fundamental unreliability
Attackers then exploit the vulnerabilities and take advantage of product reliability failures to perform successful attacks.
Sadly, if we accept that hardware and software contain bugs and those defects can become vulnerabilities, how can an attacker exploit this?
领英推荐
GIGO
Since the beginning of our computer history, we've lived with the rule called GIGO - Garbage In, Garbage Out. It generally means that the quality of a program's output depends on the input's correctness. Assuming that there are no bugs or hardware failures, computer programs are idempotent, and the same input will produce the same output.
Bad input produces bad output. Often, our systems need input in specific forms, whether the format is numeric or has a maximum length. When an attacker malcrafts input deliberately, this is called a “tainted input” attack. A program that does not "sanitize its inputs" can allow incorrect data processing. Software can act erratically or produce wrong results when presented with invalid inputs. If people or other systems unwittingly accept the wrong results, the consequences can be catastrophic.
Authorization Bypass
One of the first tainted input vulnerabilities allows improperly authorized access. People (and systems) should only be allowed access to the resources they need to do their jobs. This might be network access, connection to protected information, or promiscuous access to an organization's data. "Authentication Bypass" attacks allow attackers to view protected web pages without the required login. Or, an attacker can retrieve the contents of an entire database instead of the proper record-at-a-time access.
Denial of Service and Ransomware
Software takes many forms, from apps on our phones to the presentation of websites to the applications supporting our business and personal tasks. For example, your email reader is essential to your daily life. So is the text message app on your phone. While these manipulate local data, they also get messages across various networks. Our network protocols and the supporting software are buggy. Attackers exploit those vulnerabilities.
As we’ve already seen, tainted input can result in Authentication Bypass or Information Disclosure. Another problem is Denial of Service. With a DoS, attackers who exploit a vulnerability can prevent a program, system, or network from providing its services. A classic example of this is called a "network flooding attack."
Ransomware gets its name because the malware encrypts a user or organization’s data and then extorts payment in exchange for the decryption key. Before encryption, Ransomware often steals valuable data over the Internet. The Cybercriminals threaten the victim with exposing this information if the extortion demands aren’t met. A side effect of Ransomware is that it also acts like a DoS because the rightful owner of their data cannot do their work once the malware encrypts critical information. A quick web search shows that this is an ongoing threat to healthcare organizations.
Attacks over the network
Hackers can abuse a system's unreliability by sending specifically malformed network packets to a system. When this computer receives these attack packets, it triggers the vulnerabilities; hackers can inject program code into the memory of the hardware and software. Attackers insert their programming instructions into the system in an "over-the-wire" attack. This is because we represent code as bits of information. We distinguish between data or program instructions by how the system uses the binary information.
When an attacker injects bad data that then overlays a program's computer (“machine”) instructions, the hacker can cause a DoS. Or, the victim system can execute the hacker’s program either locally or remotely. The worst case is when attackers insert specially designed code that bypasses a system's security mechanisms and operates at elevated privilege. When software vendors issue patches, these latter vulnerabilities earn the highest severity ratings.
Once an attacker carries out a code injection or privilege escalation attack, they gain control of the remote computer and can carry out their nefarious deeds. Or, they can simply break the system’s availability with a DoS attack. That crash can have potentially catastrophic results. The bugs that make a system or network unreliable and insecure also attack availability.
Again, designing systems to sanitize their inputs provides significant protection.
How do we build more reliable systems?
When we compare structural and software engineering, for example, the former treats the importance of reliability equally to capability. The question is not, "How can we build a bridge across this river?" It is, instead, "How can we build a structurally sound bridge across this river?" People and organizations that develop and sell computer systems prioritize using their product’s features rather than their reliability. As we've seen, unreliable products are also insecure.
There are several ways to enhance system reliability. Building and selling a product begins with designing reliability into the product from the start. By eliminating failure points, we get systems that meet the availability needs and behave reliably. As a benefit, we get better security.
Reliable by design
We can also build security into the product and design reliable systems. US federal government rules such as the Risk Management Framework from the National Institute of Standards and Technology guide us in building secure systems in design and implementation. A complementary Maturity Model measures an organization's processes to produce more reliable and secure systems.
Design for reliability and security; use more reliable parts.
Some car brands and models cost more than others. That's because the more expensive ones have better components, engineering, design, and implementation. More expensive cars may go faster, carry more, or have upscale trim. The Yugo car, from earlier, was cheap and notoriously unreliable. Contrasting this, a top-of-the-line Volvo has any nicety one could want and has a historic reputation for reliability and availability.
If you buy a cheap product designed and manufactured off-shore, the vendor has neither the incentive nor profit margin to develop reliable and secure systems. Likewise, high-reliability components cost more and may exceed a product's development budget, whether the product is mechanical or electronic. While higher prices don't necessarily indicate the use of more expensive parts, products with more extended warranties suggest that the builder had higher confidence in the product's reliability. We do not get the same reliability from bargain products.
Two phrases from popular culture may apply here: "You get what you pay for" and "Cheap is not what you pay, but what you get."
Zero Trust: the pessimist’s view
Pessimistically, we can assume that every system and network has been compromised or will be shortly. We can contain and limit damage from any cybersecurity incident by not trusting anything or anyone that connects to our networks and systems. Once we assume everything is hostile - until proven otherwise - we can decide about an organization's cybersecurity architecture.
The US Federal Government calls this cybersecurity approach Zero Trust. Initially proposed in 1994 and widely adopted by Google in 2009, every user, device, and access is continuously authenticated and authorized. In May 2021, the Biden Administration made Zero Trust Architecture
The four critical elements of Zero Trust are the concept of Least Privilege, continuous authentication and authorization, in-depth monitoring, and automated incident response.
ZTA is a new way to think about cybersecurity. Beyond the US Government, several industries are also moving to Zero Trust. Those include sectors deemed part of our Critical Infrastructure in the US. The healthcare sector and financial firms are also adopting ZTA.
Least Privilege and Separation of Duties - Do you “Need to…?”
When we talk about Zero Trust, one concept plays a key role. That is the Principle of Least Privilege
Further, along with Least Privilege, we segregate people's work by role. When someone logs in with Administrator capability, they can maintain the system but do not use that account for everyday activities such as processing email. A "normal User" carries out their tasks but does not need elevated privilege and cannot alter the system.
We call this segregation Separation of Duties. It applies to people and the work that they do. But it also applies to computer systems and network functions. Just as an employee in the accounting department might never modify engineering specifications, server functions must be separated. For example, a database server must never be an email gateway. Likewise, “root” privilege must be limited to authorized administrators and never used for day-to-day tasks.
Once an organization defines these roles and their boundaries, perimeters keep compromised systems from affecting the rest of the environment. Zero Trust calls this Microsegmentation.
Depending on the working style of some environments, authorized users may need broader access to sensitive information. For example, all caregivers in a medical institution might require access to every patient's electronic health records. By providing healthcare professionals with the information they might need, they are ready to care for any new patient. So, healthcare professionals require open access to medical records. But just because they could access anyone's medical charts doesn't mean they should. They do not have a Need to Know. Originally from the military world of classified information, applying Need to Know protects information confidentiality. The concept, however, also extends to "…Go…" The data center should be (colloquially) off-limits unless there's a business need to enter. Likewise “…Be…” and “…Do…” Even if someone has Administrator Privilege, they should not "Be" using elevated rights unless they need to "Do" system management functions.
Complexity is the enemy of security
Complexity contributes to unreliability. Attackers can use these vulnerabilities to hack systems, and two aspects of complexity work to defeat cybersecurity. First, having more "moving parts" in a system means more attack opportunities. The victim moving part may be vulnerable - perhaps unpatched, misconfigured, or built incorrectly. Secondly, statistics dictate that more items in use increase the likelihood that one will fail at any given time. That, unfortunately, is the definition of "reliability."
Simplifying - reducing complexity - increases reliability because the system is easier to understand. When a system is simpler, there's less to break. It's easier to map the components in a simple system and to identify all its possible interactions. Then, it is easier to detect vulnerabilities, as well.
Strengthen the “weakest link”
People also make systems unreliable and insecure. Humans may take shortcuts, be careless, or can be tricked into weakening an environment's security. Examples include leaving doors propped open, using and recycling weak passwords, and falling for a con artist's junk email.
Cybersecurity tools such as anti-malware and anti-SPAM products help prevent people from unthinkingly aiding cyber attacks. Organizations can also implement technologies such as Least Privilege to mitigate the effects of an attack. Ultimately, however, the human is the weakest link. Using a combination of policy and education, we can acclimate people to avoid so-called "Social Engineering" attacks. That way, computer users don't make our systems even more unreliable and insecure.
A public case for more secure and reliable systems
A final way to make our systems more reliable and secure is through public shame. When a company CEO whose company had a data breach appears before Congress, it affects the whole organization. A Cybersecurity breach is often a “rude awakening” for the organization.
After hackers steal confidential company information, a business’s post-incident analysis identifies vulnerabilities to resolve. While it's too late to protect the stolen data, other organizations and executives can learn from the experience. We also hope that the experience ultimately leads to better security implementations.
The simplest way to avoid insecure systems is to stop accepting their unreliability. Because software and hardware crashed in the past doesn't mean it should in the future. The savvy consumer reads evaluations and uses review websites before making a purchase. This research happens whether the buyer picks a restaurant or obtains a car. Similarly, people buying systems can evaluate a product's reliability and security from discussion websites and a vendor's support forums.
The promise of Artificial Intelligence
Artificial Intelligence provides hope that, in the future, we can have more reliable and secure systems. AI already shows promising signs in specialized areas such as medical diagnosis and pharmaceutical design. A web search for "AI for Software Reliability" offers diverse research and commercial tools and techniques. AI-based development and testing, in turn, can lead us to more secure systems.
Call to action: Stop accepting unreliable and insecure computer products
In current computer systems culture, the standard view is that there are two separate realms: reliability and cybersecurity. A reliable system usually produces correct answers and doesn't crash, except occasionally. On the other hand, a secure system protects its information from unauthorized disclosure or modification - to nearly 100%. The issue is that an unreliable system is vulnerable to having its cybersecurity violated. When our vendors provide us with more reliable systems, we'll also get more secure environments.