Are We Resilient? No ... no ... no!

Are We Resilient? No ... no ... no!

Introduction

Pweh! What a scary week!

In a single day last week, it seemed like the Internet was crashing around us. First, on 8 July 2015, all the flights for UA (United Airlines) were grounded, followed by a computer crash on the NYSE (New York Stock Exchange);  and then the Wall Street Journal site crashed. With a major outage on the Internet or a large-scale cyber-attack, these were the kind of things that would signal the start of a major problem. Wired classified it as "Cyber Armageddon", and Macfee have since pointed towards it being suspicious that it all happened on the same day and that there could have been a major cyber attack.

No matter if it was a cyber attack or not, it does show:

how dependent that our world is on information technology, and a failure in any part of it could be devastating to both the economy and our lives.

Overall the NYSE was down for over three hours and it was reported that it was a technical glitch (costing around $400million in trades), but John Macfee (the founder of Macfee) outlines that the lack of any detail on the reasons, and without a proper investigation, that a cyber attack cannot be excluded at this point:

One thing that is true is that airlines and the stock exchange are two key parts of our critical infrastructure, and problems in either of these, on a long-term basis, could have a devastating effect. Unfortunately few designers of systems take into account failover systems, as it can considerably increase the costs. Imagine you are quoting for an IT contract, and you say:

"Well that'll be a million to build, but we'll need another million to build it somewhere else and then there's the systems to flip them over, and then there's the load balancers ... and then ... hello ... are you still there?

Often terrorists and a cyber attack are used as the threat actors in triggering this chaos, but in most cases it will be a lack of thought; a lack of investment; and/or human error which will be the likely causes, and these things should not be forgotten. While, in the UK, the banks have been toughening their infrastructure against attack, the whole back-end infrastructure needs examined, especially in the security of power sources, which will bring everyone down.

There are still two major things that most systems are not resilient against:

  • Long-term power failure.
  • Sustained DDoS (Distributed Denial of Service).

In the UK, especially, our private sector has seen significant funding in creating secure and robust infrastructures, but a lack of funding in the public sector, along with single vendors controlling key elements of it, shows that there may be cracks in the system.

A rating system for critical infrastructure providers?

In the UK, with the banks going through a penetration test from the Bank of England (with CBEST), should other companies and public sector agencies go through the same thing? Could we have a star system for our network and power supply providers, so that we can assess how they would cope with a major outage? Just now, we purchase the cheapest, and just hope for the best!

A 5-star company, would have provable methods of providing alternative supplies, and support 24x7 support for any failures. They would also provide on-site supplies, along with assessing key risks within the environment. Also they would be open for audits of their systems at any time, and also provide audit and risk assessment facilities to their customers. They would also have a strong understanding of the costs to the business for a range of things, such as loss of business and brand damage.

Does anyone really understand failover?

We build systems and we make them work. We know they are not perfect, as they are build with interlinking dependances. If often use risk models to understand where the most likely breakpoints will be, and then create failover routes for these. Two fundamentals ones are often:

No Power and No Network

so will build-in alternative supplies, such as using different supplies from providers, or have a wireless failover system. Then, there's denial-of-service, where there is an exhaustion of services within the infrastructure, so we build in load balancers and spin-up new instances to cope, but eventually these defence mechanisms will exhaust themselves too.

Overall organisations fit into a complex web of dependent services, which are also interconnected, and where a failure in any part of the infrastructure will cause problems (Figure 1).

Figure 1: The interconnected world

The weakest link ... people, power supplies and planning

So I've came up with my 3P theory of critical infrastructure ... people, power supplies and planning. Like it or not, the main mistakes that are made are by people, either writing bad code, or designing weak systems, or even just making an operational mistake. For planning, few companies have strategic plans for a major event, and any senior executive must look at all of the factors that could affect their business, both internally and externally generated.

While many companies can copy with a strike on the transport system, or a slow-down in the payments system, there's are two things that will grind them to a halt ... a lack of networked connections to the Internet ... and a lack of power ... as:

No power ... no IT!

While many systems can run on generators, we are increasingly moving into an era where we use Cloud-based services, so a failure to connect to these can cause major problems.

This large-scale outage was highlighted, last week, by the former US Secretary of Defense William Cohen outlined that the US power grid was at great risk of large-scale outage, especially in the face of a terrorist attack:

The possibility of a terrorist attack on the nation's power grid — an assault that would cause coast-to-coast chaos — is a very real one.

As I used to be an electrical engineer,  I understand the need for a safe and robust supply of electrical power and that control systems can fail. Often, too, we would run alternative supplies to important pieces of equipment, in case one of the supplies failed. So if there's a single point-of-failure on any infrastructure that will cause large scale problems, it is the humble electrical power supply.

With control systems, there are often three main objectives (or regions of operation):

  1. Make it safe (protect life and equipment)!
  2. Make it legal (comply with regulations)!
  3. Make it work and make it optimized (save money and time)!

So basically the first rule trumped the other ones, so that a system would shut down if it there was a danger to life. Next the objective would be to make it legal, so that it fitted with regulatory requirements (such as for emissions, noise or energy requirements). Finally the least important was to make it optimized, but it the control system moved it back into the first two regions, then the control system would focus on making it safe and legal.

Failover over failover over failover

The electrical supply is one of the key elements that will cause massive disruption to IT infrastructures, so the supply grid will try to provide alternative routes for the power when there is an outage on any part of it. In Figure 1, we can see that any one of the power supplies can fail, or any one of the transmission lines, or any one of the substations, and there will still be a route for the power. The infrastructure must thus be safe, so there are detectors which detect when the system is overloaded, and will automatically switch off the transmission of the power when it reaches an overload situation. For example, a circuit breaker in your home can detect when there is too much current being drawn and will disconnect the power before any damage is done. The "mechanical" device - the fuse is a secondary fail-safe, but if both fail, you'll have a melted wire to replace, as cables heat up as they pass more current (power is current squared times resistance). If the cable gets too hot it will melt.

In Figure 2, the overload detector will send information back to the central controller, and operators can normally make a judgement on whether a transmission route is going to fail, and make plans for other routes. If it happens too quickly, or if an alarm goes un-noticed, transmission routes can then fail, which can increase the requirements from the other routes, and which can cause them to fail, so the whole thing fails like a row of dominoes.

Figure 2: Failover of power supplies

Large-scale power outage

So, in the US, the former Secretary of Defense William Cohen has sent a cold sweat down many leader's back, including industry leaders, as a major outage on the power grid, would cause large-scale economic and social damage. At the core is the limited ability to run for short periods of time with UPS (uninterruptible power supply), and then on generators, in order to keep networked equipment and servers running, but a major outage would affect the core infrastructure, which often does not have the robustness of corporate systems. His feelings is that an outage on the grid would cause chaos and civil unrest throughout the country. 

Alarm bells have been ringing of a while with Janet Napolitano, former Department of Homeland Security Secretary, outlined that a cyber attack on the power grid focused on “when,” not “if.” and where Dr. Peter Vincent Pry (Former senior CIA analyst defining that the US was unprepared for an attack on its electrical supply network and that it could:

take the lives of every nine out of ten Americans in the process.

The damage that a devastating EMP (Electromagnetic Pulse), such as from a nuclear explosion, has been well known, but many now think it is the complex nature of the interconnected components of the network and their control system infrastructure (typically known as SCADA - supervisory control and data acquisition) could be the major risk.

Perhaps a pointer to the problems that an outage cause is the Northeast blackout on 14 August 2003, which affect 10 million people in Ontario and 45 million people in eight US states.  It was caused by a software bug in an alarm system in a control room in Ohio. With this, some foliage touched one of the supply lines, which caused an overload of them. The bug stopped the alarm from being displayed in the control, and where the operators would have re-distributed the power from other supplies. In the end the power systems overloaded and started to trip, and caused a domino effect for the rest of the connected network. Overall it took two days to restore all of the power to consumers.

As the world becomes increasingly dependent on the Internet, we have created robustness in the ways that the devices connect to each other, and the multiple routes that packets can take. But basically no electrical power will often disable the core routing functionality.

Control systems - the weakest link

As we move into an Information Age we becoming increasing dependent on data for the control of our infrastructures, which leaves them open to attackers. Often critical infrastructure is obvious, such as the energy supplies for data centers, but it is often the ones which are the least obvious that are the most open to attack. This could be for an air conditioning system in a data centre, where a failure can cause the equipment to virtually melt (especially tape drives) or in the control of traffic around a city. As we move towards using data to control and optimize our lives we become more dependence on it.

Normally in safety critical systems there is a failsafe control mechanism, which is an out-of-band control system which makes sure that the system does not operate outside its safe working. In a control plant, this might be a vibration sensor on a pump, where, if it is run too fast, it will be detected, and the control system will place the overall system into a safe mode. For traffic lights, there is normally a vision capture of the state of the lights, and this is fed back to a failsafe system, that is able to detect when the lights are incorrect. if someone gets access to the failsafe system, the can thus overrule safety, and compromise the system. This article outlines a case where this occurred, and some of the lessons that can be learnt from it.

Traffic Light Hacking

Security researchers, lead by Alex Halderman at the University of Michigan, managed to use a laptop and an off-the-shelf radio transmitter to control traffic light signals (https://jhalderm.com/pub/papers/traffic-woot14.pdf). Overall they found many security vulnerabilities and managed to control over 100 traffic signals within Michigan City using a single laptop. In order to be ethical in their approach the gained full permission form the road agency, and made sure that there was no danger to drivers. Their sole motivation was to show that traffic control infrastructure could be easily taken over.

Overall they found a weak implementation of security with the usage of open and unencrypted radio signals, which allowed intruders to tap into their communications, and then discovered the usage of factory-default usernames and passwords. Along with this there was a debugging port which could be easily compromised.

In the US, the radio frequency used to control traffic lights is typically in the ISM band at 900 MHz or 5.8 GHz, which makes it fairly easy to get equipment to communicate with the radio system. The researchers used readily available wireless equipment and single laptop to read the unencrypted data on the wireless network.

Figure 2 provides an overview of the control system where the radio transmitter provides a live feed (and other sensed information) to the road agency. The induction unit is normally buried in each of the junctions, and detect cars as the pass over it, and the camera is used to watch the traffic lights, and feed the colours of the lights back to the controller. In this way there is a visual failsafe.

Figure 2: Overview of traffic control system

Conclusions

With all our complex infrastructures, it is the most simple of things that can trip them all, and cause large-scale chaos ... the electrical supply. Unfortunately it's not an easy call to make as the systems need to be safe, but this safety can lead to automated trips and are in danger from operator error.

As we move into a world, too, where the intercommunication of signals between cars and the roadway, and between cars, it is important that we understand if there are security problems, as with flick of a switch an attacker could cause mass chaos.

I go back to my belief that few companies have plans for a major disaster, and how their business and staff would cope with a major outage of either power or network provision. In a military environment, the two focal points to disable the enemy is to take their power away - bomb the power plants - and take-out the communications network. These two things effective disable the enemy, and in a modern business infrastructure, based on information processing and analysis, it is electrical power and network connectivity that are the most critical.

In conclusion, we are becoming an extremely small world, and where we are all inter-connected and inter-linked in some way, so we need to understand our links, and perhaps look more at our infrastructure rather than always on our front-end systems.

So what?

In this article I outlined a star rating system for critical infrastructure protection, and I strongly believe in this, either in Scotland or the UK, as it will push the providers to integrate robustness into their operations, while being able to properly articulate the costs of providing the support.

For my city (Edinburgh), I actually know the key weak points of the infrastructure ... but I'm not telling (but I hope that others know them, especially those who are focusing on protecting this amazing city!)

For us, we are taking it seriously, and want to spark debate, and a key part is to create software systems which can be easily configured and moved. So if you are interested, here is the event we're hosting:

Book now ... and see the future [here].

 

 

 

 

要查看或添加评论,请登录

Prof Bill Buchanan OBE FRSE的更多文章

社区洞察

其他会员也浏览了