Challenges of Mitigating Vulnerabilities in Digital Substations

Industrial Control Systems (ICS) in the electric power sector, particularly those utilizing the IEC 61850 standard for fully digital substations, face unique challenges in mitigating vulnerabilities. In ideal situations, asset owners will have a program in place that provides timely information about ICS vulnerabilities. With the introduction of SOCI ACT Enhanced Cyber Security Obligation and maturing their Cyber Security Program to achieve AESCSF SP-3, electric utilities across the country have implemented OT Network Security Monitoring systems to improve asset visibility, vulnerability and patch management, and capabilities for near real-time threat picture. Which is great effort to improve their OT Cyber Security posture and operations and enhance their incidence response and recovery capabilities. However, even with accurate vulnerability information, verifying the applicability of the vulnerability to an ICS can be difficult. Mitigating these vulnerabilities can even more complex to due to the following challenges:

  1. Extensive testing needs to be performed prior to the application of a mitigation (such as applying a patch) to ensure it doesn't affect critical system functions; and
  2. If a patch, upgrade, or configuration change is considered viable, strategic planning and downtime are required to implement it. In high availability control system environments, finding a downtime can be challenging.
  3. Even after testing, the system must be monitored ensure the mitigation is working as intended.

The "Principles of operational technology cyber security | Cyber.gov.au" published by ASD acknowledges above challenges and highlights the need for a cyber security operation for continuous monitoring and robust patch management strategies with a team composed of protection, control, automation, networking, R&D, and cyber security engineers with deep and intricate knowledge of the power systems and its components. I would like to list three (3) of the key principles below:

Principle 1: Safety is paramount – Ensure the system is safe!

Safety is critical in physical environments. This includes safety of human life, safety of plant, equipment and the environment, and reliability and uptime of the process. Cyber security controls must be safe, and safety must be informed by the cyber threat environment.

The principle of “safety is paramount” implies the following incident response questions are significant:

  • If there is a cyber incident in an area that requires software running correctly for the work environment to be considered safe (safety and protection systems), is an organisation prepared to send staff to that site knowing that a bad actor has been, or is currently, on the network?
  • If there is a cyber incident in an area that requires software running correctly for the work environment to be considered safe (safety and protection systems), in many ways this means that paying a ransom cannot be an option, as there is no timely large-scale method to verify that the system has been returned to a safe state. Can an organisation be confident that the encryption process was the only modification to the files, given that a malicious actor is known to have been on the OT network?
  • Is restoring from backup an acceptable approach to mitigate cyber incidents? That is, if a malicious actor has been on the network for a period of time, can the backups be trusted? Is there a way to validate that the critical OT system is safe, after recovery?

Safety of human life, safety of the plant equipment, safety of the environment, and the need to maintain reliability and uptime, are necessary systemic ways of thinking that need to permeate all tasks, even essential and common cyber hygiene tasks potentially considered unrelated, such as:

  • How to take a backup? Are there risks to executing backups over the same (potentially close-to-saturated) network as time-critical safety control messages?
  • How to do asset discovery? Are active processes acceptable, or is passive the only way?
  • How to patch, and how to do change management in general? What are the system requirements for frequency, testing rigour, scope, roll-out strategies and roll-back strategies.

Principle 2: Knowledge of the business is crucial – Know and defend vital systems.

The more knowledge a business has about itself, the better that business can protect against, prepare for and respond to a cyber incident. The higher in the organisation there is an understanding, visibility and reporting of cyber risks, especially to OT systems, the better the outcome.

All critical infrastructure organisations should ensure they meet the following baselines:

  • Identify the vital systems the organisation needs to continue to provide their crucial services Understand the OT system’s process, and the significance of each part of the process
  • Create an architecture that allows those vital systems and processes to be defended from other internal and external networks
  • Ensure that personnel responsible for designing, operating and maintaining OT systems understand the business context that the OT system operates within, including the physical plant and process connected to the OT system and how it delivers services to stakeholders.
  • Understand the dependencies vital systems have to be able to operate and where they connect to systems external to the OT system.

Examples and implications

A commonly agreed upon imperative of cyber security is to know what needs to be protected. The first part of this is to understand which elements of the business are essential for the organisation to be able to provide its critical services. The second part is to understand the systems and processes being protected. This may include (but is not limited to): systems engineering drawings, asset lists, network diagrams, knowing who can connect to what and from where, recovery procedures, software vendors, services and equipment, and, to the extent possible, software bills of material and the desired configuration state.

Knowing what parts of the business are essential to be able to provide a critical service requires both top-down and bottom-up thinking. Top-down thinking has historically led many organisations to seek to separate OT from IT. Bottom-up thinking provides an opportunity for an organisation to go further and discover the minimal set of OT equipment required for a critical function. For example, to be able to generate electricity, depending on the generator, it may be that the minimum requirement is the generator, a controller in a control panel, and a suitable fuel supply. For critical infrastructure entities, understanding what is needed to protect the absolute core functions - keeping the water flowing and the lights on - should then guide the effective layering of cyber security controls. This has implications for architecture, protection, detection, and backup of devices and files.

It is essential that OT-specific incident response plans and playbooks are integrated into the organisation’s other emergency and crisis management plans, business continuity plans, playbooks and mandatory cyber incident reporting requirements. The involvement of a process engineer is important, both when creating plans and playbooks and during any investigation, containment or recovery processes. There is also a need to provide an information pack to third parties before or when they are engaged, to quickly bring them up to speed. This third-party pack should include the likes of points of contact, naming conventions for servers, data sources, deployed tools, and what tools are acceptable to be deployed. All plans, playbooks, and third-party packs must be regularly exercised, updated by all relevant parties including legal, and protected due to their value to the adversaries.

Physical aspects that aid staff to have knowledge of the OT system should also be considered. This may include colour coding cables, putting coloured banding on existing cables, or marking devices allowed in the OT environment in a highly visible way. Only authorised devices should be connected to the OT environment, to help ensure that only authorised code can be introduced to OT environments. Overt visual cues allow an organisation to better protect their environment by identifying unauthorised devices, and allow an organisation to quickly make correct decisions in response to cyber or intelligence-based events. Such markings would need to be periodically assessed and verified to ensure accuracy and currency.

Understanding the business context of the OT system is essential for assessing the impact and criticality of OT outages and cyber security compromises. It is also vital to determining recovery priorities during a critical incident. For organisations reliant on OT to be able to provide a critical service, an integrated OT cyber security function is a necessary part of the business. OT cyber security personnel are not expected to have the deep understanding of a physical system that an electrical, chemical, or process engineer may have, but they should have a working knowledge of plant operation and most importantly, maintain working relationships with those in the organisation responsible for the physical plant. Such relationships are critical both to the success of any cyber enhancement project as well as when there is a need to respond to a cyber event.

Principle 6: People are essential for OT cyber security

A cyber-related incident cannot be prevented or identified in OT without people that possess the necessary tools and training creating defences and looking for incidents. Once a cyber-related incident has been identified in OT, trained and competent people are required to respond.

A strong safety-based cyber security culture is critical to the on-going cyber resiliency of OT systems. There is a need for each organisation to reframe the requirements from these principles as workplace safety requirements, as opposed to cyber security requirements.

Staff, particularly field technicians and all other members of operating staff, are often the front line of defence and detection for an organisation.

Examples and implications

A mix of people with different backgrounds, with various skills, knowledge, experience and security cultures, is necessary to support effective OT cyber security practices. This includes members from infrastructure and cyber security teams (commonly found in IT), as well as control system engineers, field operations staff, and asset managers (commonly found in OT).

Developing a cohesive OT cyber security culture requires general alignment on the principles of OT throughout the organisation. Consider that there will be a different set of inherent values and priorities carried by members of different backgrounds. For example, the first principle of OT cyber security, “Safety is paramount”, often requires a fundamental shift in thinking for people that have non-engineering or non-critical infrastructure backgrounds. Team members with non-engineering backgrounds gaining an understanding of OT challenges is important for the team to work cohesively in OT.

In most critical infrastructure OT sites, from electricity generation to water treatment facilities, staff are the front line of defence. They almost certainly will not be OT cyber security experts, nor people who work in corporate IT. Field operations staff rarely receive formal information technology or cyber security training and certification. Often, experience with the IT components of an Industrial Control System (ICS) will have been developed on-the-job, out of necessity due to the growing dependency of site operations on ICT infrastructure and IP-based communication.

As such, significant focus is required to develop cyber security awareness as a core component of field safety culture, so that operators feel confident and empowered to raise potential cyber concerns, without fear of ridicule or judgement. Further, there needs to be a process put in place where cyber-safety related observations can be raised quickly, with a culture of knowing that observations will be appreciated.

Potential strategies to develop security awareness and a cyber-safe culture amongst staff include:

  • Incorporating cyber security into safety assessments, factory acceptance testing (FAT), site acceptance testing (SAT), and the engineering change management process. Established methods include Cyber-Informed Engineering, Cyber PHA or HAZCADS.
  • Creating environments and processes that encourage local staff to identify and report suspicious behaviour. A common anti-pattern is for engineers to perform remote maintenance without informing on-site staff. The field operator will observe the engineer’s activities as a mouse moving on a local machine or visible interaction with the HMI. Local staff will grow to ignore such behaviour as being normal and legitimate.
  • Conditioning field operators to consider the possibility of cyber compromise when operational faults are identified. Historically, faults that engineers address have been due to engineering issues such as misconfiguration, device failure, corruption of data or the device working outside of tolerances. Typical responses include restarting the program, rebooting or resetting the device, re-flashing or loading a known good configuration, or replacing the device. Historically, malicious cyber actions have not been considered, meaning that cyber incidents may have been misidentified and dismissed as operational faults or missed entirely. The possibility that a fault has a cyber-related cause should also be considered. Most, if not all, of the traditional remediation steps listed will reset communication links and wipe volatile memory, which may have helped a cyber security investigation. Specific additional processes, and changes to long existing processes, are required for cyber identification, classification and investigations in OT.

Real-World Incidents

  1. Hatch Nuclear Power Plant (2008): An engineer installed a software update on a business network computer, inadvertently resetting the data on the plant’s control system and causing the reactor to shut down for 48 hours. This incident highlighted the risks of making changes to interconnected systems without thorough testing.
  2. Davis-Besse Nuclear Power Plant (2003): During a routine patch update, a software patch caused the network to crash, leading to a temporary loss of monitoring capability for the reactor. This incident underscored the importance of careful planning and testing when applying patches in critical environments.
  3. ICS Patching Incident (2023): A patch applied to an ICS environment in a manufacturing plant caused unexpected system reboots and downtime. The patch, intended to fix a security vulnerability, was not thoroughly tested in the specific ICS environment, leading to significant operational disruptions.
  4. SA Port Augusta Wind Farm Incident (2022): On June 23, 2022, the Port Augusta Renewable Energy Park experienced power system oscillations due to a coding error introduced during testing. This incident did not cause generation or customer load impact but highlighted the complexities and risks associated with integrating renewable energy sources into the grid.

These incidents illustrate the critical need for robust patch management and thorough testing in ICS environments.

Best Practices for ICS Patch Management

  1. Establish a Test Environment: Set up a separate test environment that closely mimics the production system to safely test patches before deployment.
  2. Use Hardware-in-the-Loop (HIL) Simulation: Incorporate real hardware components in the test environment to simulate actual conditions.
  3. Develop a Comprehensive Testing Plan: Create detailed test scenarios covering various conditions, including normal operations, peak loads, and failure conditions.
  4. Automate Testing: Use automated testing tools to run repetitive tests efficiently and identify issues quickly.
  5. Collaborate with Vendors: Work closely with vendors to understand the potential impacts of patches and access vendor-specific testing tools.
  6. Conduct Pilot Testing: Apply patches to a small subset of the environment first and monitor the results closely before full deployment.
  7. Document and Review: Keep detailed records of all tests conducted and regularly review and update testing procedures.

Importance of Testing and Setting Up a Test Environment

Testing is crucial for ensuring that patches and updates do not introduce new vulnerabilities or disrupt operations. Here’s how to set up an effective test environment:

  1. Define Objectives: Clearly outline the goals of the test environment, such as testing patches, training staff, or simulating attacks.
  2. Inventory and Replicate: Create an inventory of all components in the production environment and replicate them in the test environment.
  3. Use Realistic Hardware: Use the same or similar hardware as in the production environment to ensure accurate testing.
  4. Network Segmentation: Isolate the test environment from the production network to prevent accidental disruptions.
  5. Simulate Real Processes: Implement simulations of actual processes controlled by the ICS.
  6. Implement Virtualization: Use virtualization to create multiple instances of ICS components for flexible testing.
  7. Develop Test Scenarios: Create detailed test scenarios to identify potential issues under different conditions.
  8. Automate Testing: Use automated tools to run tests efficiently and identify issues quickly.
  9. Document and Review: Keep detailed records of all tests and regularly review and update procedures.
  10. Training and Collaboration: Ensure staff are trained to use the test environment effectively and collaborate with vendors and industry groups.

A well-setup test environment helps in thoroughly testing patches and updates, ensuring they do not introduce new vulnerabilities or disrupt operations. It also plays a crucial role in cybersecurity testing, allowing for vulnerability assessments, penetration testing, and incident response drills in a controlled setting.

Collaborating with Vendors: The SEL Advantage

Collaborating with vendors such as Schweitzer Engineering Laboratories (SEL) can be a highly effective strategy to overcome the challenges of mitigating vulnerabilities in ICS environments. Here’s why partnering with SEL could be the best choice:

  1. Comprehensive Control: SEL owns the entire hardware, software, and supply chain of the manufacturing lifecycle, ensuring high-quality and secure products.
  2. Global Expertise: SEL has access to a global talent pool of R&D, Protection, Control & Automation, and Cybersecurity Engineers, providing cutting-edge solutions and support.
  3. Engineering Services: SEL Engineering Services extends the manufacturing capability into the design and integration of power systems, offering end-to-end solutions.
  4. Cybersecurity Experience: The SEL Infrastructure Defense Cyber Services team has over 20 years of experience in OT cybersecurity for critical energy infrastructure, ensuring robust protection against cyber threats.
  5. Advanced Testing Capabilities: SEL owns Hardware-in-the-Loop (HIL) Simulators to create test environments that are replicas of power systems, enabling thorough and accurate testing of patches and updates.
  6. 24/7 Comprehensive Security Operation Center (SOC) Services: SEL offers tailored SOC services specifically designed for the needs of the electric power sector, providing continuous monitoring and rapid response to security incidents.

By partnering with SEL, organizations can leverage these advantages to enhance their ICS security and reliability, ensuring that vulnerabilities are effectively mitigated without compromising system performance and safety.

要查看或添加评论,请登录

Eric TURSON的更多文章

社区洞察

其他会员也浏览了