The Domino Effect of IT Systems
It has been a week of IT failures, with Capita seeing a major outline and now BA's systems have crashed. Unfortunately the resiliance of systems are often an after-thought, and where a loss of a key service can bring down the whole of the infrastructure. In the Oil and Gas industry, engineers are well versed in understanding critical paths, and to make plans for alternative routes, but in IT, there is often little understanding of the interconnected nature of the infrastructures (both internally and externally). Few systems could actually cope a major power outage or a DNS outage, as most systems use domain name lookups to find their resources.
No power ... no data
This week Capita's systems were knocked out for at least two days, and affected council and NHS infrastructures around the UK (including Sheffield City Council). It related to a power failure in West Malling where generators failed, and shut down the whole of the data centre. Capita is the largest IT provider public sector in the UK, with around a 50% per cent of the overall market (£1.9bn in 2016).
On 8 July 2015, all the flights for UA (United Airlines) were grounded, followed by a computer crash on the NYSE (New York Stock Exchange); and then the Wall Street Journal site crashed. With a major outage on the Internet or a large-scale cyber-attack, these were the kind of things that would signal the start of a major problem. Wired classified it as "Cyber Armageddon", and Macfee have since pointed towards it being suspicious that it all happened on the same day and that there could have been a major cyber attack.
No matter if it was a cyber attack or not, it does show:
how dependent that our world is on information technology, and a failure in any part of it could be devastating to both the economy and our lives.
Overall the NYSE was down for over three hours and it was reported that it was a technical glitch (costing around $400million in trades). The think that it highlighted is that airlines and the stock exchange are two key parts of our critical infrastructure, and problems in either of these, on a long-term basis, could have a devastating effect. Unfortunately few designers of systems take into account failover systems, as it can considerably increase the costs. Imagine you are quoting for an IT contract, and you say:
"Well that'll be a million to build, but we'll need another million to build it somewhere else and then there's the systems to flip them over, and then there's the load balancers ... and then ... hello ... are you still there?
Often terrorists and a cyber attack are used as the threat actors in triggering this chaos, but in most cases it will be a lack of thought; a lack of investment; and/or human error which will be the likely causes, and these things should not be forgotten. While, in the UK, the banks have been toughening their infrastructure against attack, the whole back-end infrastructure needs examined, especially in the security of power sources, which will bring everyone down.
There are still two major things that most systems are not resilient against:
- Long-term power failure.
- Sustained DDoS (Distributed Denial of Service).
In the UK, especially, our private sector has seen significant funding in creating secure and robust infrastructures, but a lack of funding in the public sector, along with single vendors controlling key elements of it, shows that there may be cracks in the system.
A rating system for critical infrastructure providers?
In the UK, with the banks going through a penetration test from the Bank of England (with CBEST), should other companies and public sector agencies go through the same thing? Could we have a star system for our network and power supply providers, so that we can assess how they would cope with a major outage? Just now, we purchase the cheapest, and just hope for the best!
A 5-star company, would have provable methods of providing alternative supplies, and support 24x7 support for any failures. They would also provide on-site supplies, along with assessing key risks within the environment. Also they would be open for audits of their systems at any time, and also provide audit and risk assessment facilities to their customers. They would also have a strong understanding of the costs to the business for a range of things, such as loss of business and brand damage.
Does anyone really understand failover?
We build systems and we make them work. We know they are not perfect, as they are built with interlinking dependances. If often use risk models to understand where the most likely breakpoints will be, and then create failover routes for these. Two fundamentals ones are often:
No Power and No Network
so will build-in alternative supplies, such as using different supplies from providers, or have a wireless failover system. Then, there's denial-of-service, where there is an exhaustion of services within the infrastructure, so we build in load balancers and spin-up new instances to cope, but, eventually, these defence mechanisms will exhaust themselves too.
Overall organisations fit into a complex web of dependent services, which are also interconnected, and where a failure in any part of the infrastructure will cause problems (Figure 1).
Figure 1: The interconnected world
The weakest link ... people, power supplies and planning
So I've come up with my 3P theory of critical infrastructure ... people, power supplies and planning. Like it or not, the main mistakes that are made are by people, either writing bad code, or designing weak systems, or even just making an operational mistake. For planning, few companies have strategic plans for a major event, and any senior executive must look at all of the factors that could affect their business, both internally and externally generated.
While many companies can copy with a strike on the transport system, or a slow-down in the payments system, there's are two things that will grind them to a halt ... a lack of networked connections to the Internet ... and a lack of power ... as:
No power ... no IT!
While many systems can run on generators, we are increasingly moving into an era where we use Cloud-based services, so a failure to connect to these can cause major problems.
This large-scale outage was highlighted, last week, by the former US Secretary of Defense William Cohen outlined that the US power grid was at great risk of large-scale outage, especially in the face of a terrorist attack:
The possibility of a terrorist attack on the nation's power grid — an assault that would cause coast-to-coast chaos — is a very real one.
As I used to be an electrical engineer, I understand the need for a safe and robust supply of electrical power and that control systems can fail. Often, too, we would run alternative supplies to important pieces of equipment, in case one of the supplies failed. So if there's a single point-of-failure on any infrastructure that will cause large scale problems, it is the humble electrical power supply.
With control systems, there are often three main objectives (or regions of operation):
- Make it safe (protect life and equipment)!
- Make it legal (comply with regulations)!
- Make it work and make it optimized (save money and time)!
So basically the first rule trumped the other ones, so that a system would shut down if it there was a danger to life. Next the objective would be to make it legal, so that it fitted with regulatory requirements (such as for emissions, noise or energy requirements). Finally the least important was to make it optimized, but it the control system moved it back into the first two regions, then the control system would focus on making it safe and legal.
Failover over failover over failover
The electrical supply is one of the key elements that will cause massive disruption to IT infrastructures, so the supply grid will try to provide alternative routes for the power when there is an outage on any part of it. In Figure 1, we can see that any one of the power supplies can fail, or any one of the transmission lines, or any one of the substations, and there will still be a route for the power. The infrastructure must thus be safe, so there are detectors which detect when the system is overloaded, and will automatically switch off the transmission of the power when it reaches an overload situation.
For example, a circuit breaker in your home can detect when there is too much current being drawn and will disconnect the power before any damage is done. The "mechanical" device - the fuse is a secondary fail-safe, but if both fail, you'll have a melted wire to replace, as cables heat up as they pass more current (power is current squared times resistance). If the cable gets too hot it will melt.
In Figure 2, the overload detector will send information back to the central controller, and operators can normally make a judgement on whether a transmission route is going to fail, and make plans for other routes. If it happens too quickly, or if an alarm goes un-noticed, transmission routes can then fail, which can increase the requirements from the other routes, and which can cause them to fail, so the whole thing fails like a row of dominoes.
Figure 2: Failover of power supplies
Large-scale power outage
So, in the US, the former Secretary of Defense William Cohen has sent a cold sweat down many leader's backs, including industry leaders, as a major outage on the power grid, would cause large-scale economic and social damage. At the core is the limited ability to run for short periods of time with UPS (uninterruptible power supply), and then on generators, in order to keep networked equipment and servers running, but a major outage would affect the core infrastructure, which often does not have the robustness of corporate systems. His feelings are that an outage on the grid would cause chaos and civil unrest throughout the country.
Alarm bells have been ringing for a while with Janet Napolitano, former Department of Homeland Security Secretary, outlined that a cyber attack on the power grid focused on “when,” not “if.” and where Dr. Peter Vincent Pry (Former senior CIA analyst defining that the US was unprepared for an attack on its electrical supply network and that it could:
take the lives of every nine out of ten Americans in the process.
The damage that a devastating EMP (Electromagnetic Pulse), such as from a nuclear explosion, has been well known, but many now think it is the complex nature of the interconnected components of the network and their control system infrastructure (typically known as SCADA - supervisory control and data acquisition) could be the major risk.
Perhaps a pointer to the problems that an outage cause is the Northeast blackout on 14 August 2003, which affect 10 million people in Ontario and 45 million people in eight US states. It was caused by a software bug in an alarm system in a control room in Ohio. With this, some foliage touched one of the supply lines, which caused an overload of them. The bug stopped the alarm from being displayed in the control, and where the operators would have re-distributed the power from other supplies. In the end the power systems overloaded and started to trip, and caused a domino effect for the rest of the connected network. Overall it took two days to restore all of the power to consumers.
As the world becomes increasingly dependent on the Internet, we have created robustness in the ways that the devices connect to each other, and the multiple routes that packets can take. But basically no electrical power will often disable the core routing functionality.
Control systems - the weakest link
As we move into an Information Age we becoming increasing dependent on data for the control of our infrastructures, which leaves them open to attackers. Often critical infrastructure is obvious, such as the energy supplies for data centers, but it is often the ones which are the least obvious that are the most open to attack. This could be for an air conditioning system in a data centre, where a failure can cause the equipment to virtually melt (especially tape drives) or in the control of traffic around a city. As we move towards using data to control and optimize our lives we become more dependence on it.
Normally in safety critical systems there is a failsafe control mechanism, which is an out-of-band control system which makes sure that the system does not operate outside its safe working. In a control plant, this might be a vibration sensor on a pump, where, if it is run too fast, it will be detected, and the control system will place the overall system into a safe mode. For traffic lights, there is normally a vision capture of the state of the lights, and this is fed back to a failsafe system, that is able to detect when the lights are incorrect. if someone gets access to the failsafe system, the can thus overrule safety, and compromise the system. This article outlines a case where this occurred, and some of the lessons that can be learnt from it.
DDoS
Over the past few months the increase in DDoS attacks using DNS has been escalating. Without DNS most accesses to the Internet would be difficult, as it performs the key task of resolving the IP address of the remote server. One method which is increasingly used within Cloud-based systems is the concept of Dynamic DNS (DynDNS), which is the method of resolving the IP address of a destination on a real-time basis. This method is often used to manage the allocation of network traffic to cloud-based services. So when there is a peak into traffic loads, DynDNS will allocate the traffic to under-used servers, and thus even out the load.
Dyn is a company which manages the mapping of IP addresses for many companies, and uses DynDNS to manage workloads on accesses to Internet-based services. Recently there has been a massive attack against Dyn and which disrupted traffic within the East Coast of the US. This caused problems on sites which used Dyn, including Github, Twitter, SaneBox, Reddit, AirBnB, and Heroku.
The attack happened around 11:10 UTC, and it is currently not known who was behind the attack. It seems to be focus on the East Coast of the US, where even companies such as Amazon within their North Virginia AWS service experienced problems around its DNS resolution.
Many services, such as Twitter and Facebook, seemed to have been down from between 8:15 a.m. to 8:38am EDT:
Scott Hilton, from Dyn, reported that:
This morning, October 21, Dyn received a global DDoS attack on our Managed DNS infrastructure in the east coast of the United States. DNS traffic resolved from east coast name server locations are experiencing a service degradation or intermittent interruption during this time. Updates will be posted as information becomes available.
With the increasing reliance on the Internet, there is a great risk that malicious agents could exhaust key network routes, and cause a massive reduction in the quality of service, and even cause systems to crash. One of the bluntest tools for this is Distributed Denial of Service (DDoS), and the most common attack is for the system to actually attack itself - a reflection attack. For this DNS servers are asked for the IP address of a remote site, but the attack tells the DNS server to respond back to the target system. This causes the DNS servers to attack the internal network.
Estonia thinking ahead
Just imagine if you woke up, and there had been a massive cyber attack on your country by a large nation state, and, within hours, all of the major governmental systems have been hacked, and the Internet infrastructure failed. For this the data used by your government could been destroyed. So, if your country has been progressive in its adoption of electronic services, the government would have no health care records, no birth certificates, no social care records, ... all of it would be gone!
The threat of cyber warfare is real, and it is a threat which is increasing. Small nations, especially, are particularly afraid of an attack from larger ones, especially where there is control of key critical infrastructure, such as the power network, and in the face of a massive denial of service attack.
Previously, in 2009, Estonia was hit by a suspected cyber attack from another nation state, and has now decided to proactively look at moving copies of its data to the UK, for protection. Currently they have been using data stores in their own embassies around the world to mirror their core governmental data, but this is the first move to formalise the storage of it outside the country.
The data stored will include information related to their citizens, such as for birth records, and also government documents. For them the UK is seen as a relatively safe place against attack, with a proactive approach to data protection. Previously it was attack in 2009, and which took out their infrastructure for several days.
Overall Estonia has embraced the Internet like few others, and have managed to put most of their government services on-line. This includes voting, paying taxes, and virtually everything else.
A fragile infrastructure?
Many strategists know that the most fragile part of the Internet infrastructure is the energy supply, and where networks will not survive if they cannot survive on back-up power for sustained time periods. For a fault, this might be a few hours, but on a Cyber attack, the attack may last for many days or weeks. The other area of robustness is the ability to cope with a sustained Distributed Denial of Service attack against the gateway elements of the UK network infrastructure.
While peak attacks are a worry, and have brought down sites like the BBC, a sustained attack can cause many problems as it often involves valid connections being dropped. Versign, for example, observed a sustained attack for over one month, and peaking at 125Gbps.
Within a Cyber Age are adversaries can come from many directions, including from Hackvists and from nation state activity. For serious Cyber warfare, the starting point would often be to trip the power stations which provide the electrical power to the data centers involved with the core routing elements of the network. As was seen with the BT outage, this is likely to cause a disruption on the network, as UPS's and back-up generators sometimes do not work as they are meant to. Even with back-up power, most systems only use the back-up power to survive for short time periods, and will eventually shut themselves off as a few hours.
Sustained attack against a nation?
Apart from tripping the power supplies for the key elements of the core network infrastructure, a focus for Cyber attackers is likely to be from Distributed Denial of Service attack against the main gateways of the country. For this, the Verisign Denial of Service Report for 2016 gives some fairly shocking high-level statistics:
? Number of Attacks: 111% increase (year on year).
? Peak Attack: 274 Gbps and 56 million packets per second.
? Average Peak Attack: 19.37Gbps (41% over 10Gbps).
? Typical type: 64% were UDP streams (21% for TCP, and 9% for IP fragmentation).
The rate of throttle has generally increased over the years, with increasing bandwidth on links, with 2016 seeing a majority of attacks being over 5Gbps:
and where the number of attacks have increased greatly within 2016:
One thing that is sure is that the speed of the attacks has increased over the years, where a large amount of the attacks within 2016 being throttled at a rate of over 10Gbps, and where over 87% were at a rate of over 1Gbps. Most of the attacks used a reflection attack, where DNS and NTP servers, which provide core services within the network, are used to swamp the internal network with traffic.
DNS is required for domain name resolution, and without it, the Internet just wouldn't work, as we would have no idea about the IP address of a destination server. It is, though a weak protocol from a security point-of-view, but is essential for the network to work.
On a cyber attack, the core servers become the focus of a reflection attack, where the clients request a domain lookup for a site, and then tell the DNS server to respond back to a targetted server. The traffic will then swamp network connections, and where it is almost impossible to filter them out, without bringing down the whole of the network:
Finger on the pulse of the Internet
Akamai Technologies provides content to the Internet, and is responsible for around 30% of all Web traffic. They thus have their finger on the pulse of the Internet, and have servers all over the world, providing fast content delivery to the requests. There is a good chance that you're receiving this page via one of Akamai's servers. So, basically if a user in Japan wants to access Facebook, then Akamai is likely to provide them with the page, located in their country.
Recently Akamai Technologies announced that Distributed Denial of Service attacks are the most worrying at the present time, and their level of sophistication increases by the day. They have even recently monitored a sustained attack of 363Gbps (57 million packets per second) against a media outlet. This is the type of attack which few companies, let alone a nation-state, could cope with for a sustained period. This main focus is the reflection attack, where valid servers are used to generate large amounts of traffic, and one the protocol which is most often used is DNS.
Gaming companies, such as Sony and Microsoft, and media organisation, such as the BBC, have all been recently hit by outages caused by DDoS. Even Pokemon GO was a target from a hacking group, and which took the game off-line for a considerable time.
Akamai report that the increasing number of botnets have made it difficult to stop DDoS and trace the source. The top sources of the DDoS come from Vietnam, Brazil and Columbia:
Largest throttle
Verisign outline that multi-vector attacks were often used, with a range of protocols involved, and which aim to completely exhaust the resources of the network infrastructure. For many companies the capacity of network links has moved up through 1Gbps (Gigabit Ethernet), and onto 10Gbps (Ten Gigabit) on the core network. New switched networks are now moving up to 40Gbps (40G) and even 100Gbps (100G). While must of the current spend is on 10G networks, the majority of spending with focus on 40G and 100G by 2020:
The largest volumetric focused on an attack using UDP, Internet Control Measure Protocol (ICMP) and TCP flood traffic (multi-vector attack), with a DNS and NTP reflection attack. At its peak it had rates of 274 Gbps, and used UDP ports 53 and 80, and was sustained for over five hours with a flow of more than 200 Gbps.
Sustained attacks
While peak attacks are a worry, a sustained attack can cause many problems as it often involves valid connections being dropped. Versign thus observed a sustained attack for over one month, and peaking at 125Gbps. It also used multiple methods such as DNS reflection, fragmented packet attacks, ICMP floods and various TCP floods (SYN flood, Connect flood, and so on). Along with this, they observed a Christmas Tree flood.
Common targets
Cloud systems, such as Amazon and Azure, along with IT providers are the most exposed for their systems, and they accounted for almost 32%, with the finance sector involved in 27% of the attacks. The highest average throughput was targeted at telecommunications providers (38Gbps) and media outlets (32Gbps average). Telecommunications providers are obviously exposed as they most often support the traffic flows.
Ability to decimate a country
With our increasing dependence on the Internet, an all-out attack against a country is likely to bring down its critical infrastructure, especially around energy supplies. Without power our data centers would crash, along with the control systems for our transport network. Few transport systems would still work on a large-scale cyber attack, and where the airlines would stop all their operations, and where disruptions of the traffic light infrastructure would bring deadlock within hours.
The former Secretary of Defense William Cohen sent a cold sweat down many leader's back, including industry leaders, when he outlined that a major outage on the power grid, would cause large-scale economic and social damage. At the core is the limited ability to run for short periods of time with UPS (uninterruptible power supply), and then on generators, in order to keep networked equipment and servers running, but a major outage would affect the core infrastructure, which often does not have the robustness of corporate systems. His feelings is that an outage on the grid would cause chaos and civil unrest throughout the country.
Alarm bells have been ringing of a while with Janet Napolitano, former Department of Homeland Security Secretary, outlined that a cyber attack on the power grid focused on “when,” not “if.” and where Dr. Peter Vincent Pry (Former senior CIA analyst defining that the US was unprepared for an attack on its electrical supply network and that it could:
take the lives of every nine out of ten Americans in the process.
The damage that a devastating EMP (Electromagnetic Pulse), such as from a nuclear explosion, has been well known, but many now think it is the complex nature of the interconnected components of the network and their control system infrastructure (typically known as SCADA - supervisory control and data acquisition) could be the major risk.
As the world becomes increasingly dependent on the Internet, we have created robustness in the ways that the devices connect to each other, and the multiple routes that packets can take. But basically no electrical power will often disable the core routing functionality.
In the US, Senator Joe Lieberman (I-CT) wrote a bill named "Protecting Cyberspace as a National Asset Act of 2010", and which was seen as the "Kill switch bill", where the President would have the power to take over parts of the Internet.
How would a country cope with initial phases of a Cyber Attack?
Perhaps the recent coup in Turkey gives a hint on the type of scenario that we would see on a Cyber attack, and where internal control of the network would cause a disruption in service provision:
For this, as with many finance companies, countries need to invest in their 24x7 SOC (Security Operations Center) for critical infrastructure and which monitors the complete network and data infrastructure, and which can control and manage a potential attack:
On an attack it is likely that the Security Operations Center could take control of the internal network, and limit access to services. Five initial stages could be:
- Stage 1: take-over. So the first thing that the network and security engineers will have to do on a cyberware attack will thus be to take-over the control of the traffic, otherwise its own citizens will crash the internal infrastructure. The challenge in this phase is to control the internal forces, while dealing with external pressures. In these days, many of the services use Cloud systems, so throttling back on external traffic could also disrupt the network.
- Stage 2: Coping with the threat. As Stage 1 happens, security analysts are likely to be analysing the external threats, such as coping with a large-scale Distributed Denial of Service, and try and understand how they could cope with an external (or internal attack), without actually affecting the internal network. The plans would then have to be carefully intertwined to make sure that any control on the external threat does not affect the internal operation of the infrastructure. A large-scale crash would be almost impossible to cope with, as servers and service normally interconnection, and where the infrastructure would take a while to recreate itself. Like it or not, much of the infrastructure still requires a great deal of human intervention.
- Stage 3: Observation and large-scale control. At this stage we will see the Chernobyl Nuclear Power Plant effect where the most important alarm on the system was swamped by other less important ones. So at this stage alerts will be coming in on system crashes and problems, and thus plans will be in-place to filter these alerts so that only the most important ones will be fed to analysts, and who can then try and put in-place plans to overcome the problems before the infrastructure collapses. While many have tried to model the complex behavior of our network infrastructure, it is almost impossible to predict, so security and network analysts will have to cope with the large-scale disruption, and make decisions on how to keep the core infrastructure up-and-running. A key focus of this stage would be to make sure that military, transport, energy, health and law enforcement systems were given the highest priority, along with financial systems (as the shock wave of a disruption to the economic infrastructure of the country could be long lasting)
- Stage 4: Observation and fine-control. At this stage, we would move to a point that there was some stability, and where lesser alerts could be coped with. This might relate to services which were less important, but which need to be sustained. A key focus would be to protect the financial and commercial interests of the country.
- Stage 5: Coping and restoring. The final stage is likely to be the restoring of normality, and try to recover the systems, and which may be damaged in some way. On a cyber warfare event, this could be an extremely costly process, especially if the country has not coped well with the attack.
A failure to cope with the initial attack could make it extremely difficult to take control of the situation and also to recover from it.
Conclusions
With all our complex infrastructures, it is the most simple of things that can trip them all, and cause large-scale chaos ... the electrical supply. Unfortunately it's not an easy call to make as the systems need to be safe, but this safety can lead to automated trips and are in danger from operator error.
As we move into a world, too, where the intercommunication of signals between cars and the roadway, and between cars, it is important that we understand if there are security problems, as with flick of a switch an attacker could cause mass chaos.
I go back to my belief that few companies have plans for a major disaster, and how their business and staff would cope with a major outage of either power or network provision. In a military environment, the two focal points to disable the enemy is to take their power away - bomb the power plants - and take-out the communications network. These two things effective disable the enemy, and in a modern business infrastructure, based on information processing and analysis, it is electrical power and network connectivity that are the most critical.
In conclusion, we are becoming an extremely small world, and where we are all inter-connected and inter-linked in some way, so we need to understand our links, and perhaps look more at our infrastructure rather than always on our front-end systems.
Director of SICOM. 07803 737321
7 年The compromise of the two views above might be a private cloud, hosted in known, multiple, UK data centres and connected by 10Gb waves. We're building this platform, to launch in September.
Helping MSP’s transform their data protection services through the power of Druva’s SaaS platform.
7 年Great article Bill. On the power issue I would make the point that this is the fundamental reason why organisations should stop playing at running in house datacentres. There is a reason that external co-location providers spend what they do on their facilities. You will never achieve in house what they can achieve as running a datacentre is all they do. That said, there are many poor and aging co-location datacentres out there so customers need to know how to pull up the covers and do their due diligence properly.