Design for Mission Criticality
Flight deck of challenger space shuttle.

Design for Mission Criticality

Technology is all around us and today, more than ever, it is taking over tasks and activities that previously was trusted to us humans. Technology is becoming the norm and there is hardly an area where technology is not yet in use. This increased reliance comes with increased expectations; we expect technology being available. However, recent outages like the CrowdStrike outage on the 19th July 2024 or the outage that hit WhatsApp and Facebook in early 2024, have shown how IT is now inextricably interwoven into our everyday life. More and more everyday applications are hosted in a public cloud and an outage that hosts a component in the US can cause an airport in Switzerland to cancel all takeoff and landings, a hospital in Germany to cancel all non-urgent surgeries or a retailer in UK being unable to sell items as the POS[1] is not working.

Technology is everywhere, and concepts like public cloud will further increase our dependencies on globally connected systems. This means that we (across the IT industry) need to consider the unintended consequences of a failure even for non-mission critical systems. We must ensure that apply similar techniques to when we design mission critical systems, like an in-flight control system or the software of a self-driving car. Of course, there are different levels of availability requirements; not being able to sell a pint of milk has a different customer impact to that of an in-flight control system of a passenger plane becomes “unavailable” due to an outage in mid-flight, or that our hands-free in-car autopilot shutdown because of a failure, whilst driving at 70mph in the middle lane of a busy motorway.

Time to consider Mission Criticality wider

Designing mission critical systems is nothing new. Nuclear Power Station, Submarines and Space Crafts are only a small number of examples where technology has delivered mission critical services since the 1970s. In the military, technology has been playing am integrated part of any operation. For instance, modern military vehicle heavily relies on mission-critical systems that enhance and guarantee successful mission capabilities . These systems include sensors, actuators, effectors, radars, and processing resources controlling unmanned vehicles like the?ASW Continuous Trail Unmanned Vessel?(ACTUV ).

However, designing and developing mission critical systems has always been a niche field. As an IT Architect during the 1990s and early 2000s it was special to be working on a mission critical system. What is clear is that we need to cannot see mission criticality as a niche field anymore – we (across the IT Sector) must consider all levels of criticality when designing IT solutions.

What is mission criticality?

Mission criticality is a term that refers to the fact that the service of the overall IT system (user interface, integration, data as well as infrastructure components) is critical. Service criticality in this context is a measure of the business importance of the service and a failure of a mission critical service may cause immediate and serious widespread and lasting impact to business operations and may not be tolerated for reasonably long periods. In other words, a mission critical system is absolutely vital to the operation of the business and should the service fail, immediate and serious widespread impact to the business will be the result.

In addition, a mission critical service is typically constantly available (7 days 24 hours). These services typically also deliver extremely responsive service (end to end response time typically less than 1 second) with a demand profile that is random and of unpredictable nature, supporting typically thousands of transactions per hour of sometimes secure or highly secure nature.

Examples are an in-flight control system for an airplane, the control system in a self-driving car or the overall command and control system in a nuclear power plant.? But, also financial systems, like a core retail banking system allowing customers to withdraw money, or an online ordering and distribution system for a global retailer can be seen as mission critical. Whether or not a service is mission critical is mainly down to the client, and sometimes it can be a challenge to establish the level of criticality in the first place.

Other types of criticalities

Various papers differentiate between safety, mission, business, and security critical . Safety critically is when a failure “may lead to loss of life, serious personal injury, or damage to the natural environment”. A failure in a mission critical system “may lead to an inability to complete the overall system or project objectives, e.g., loss of critical infrastructure or data. A failure in a business-critical system “may lead to significant tangible or intangible economic costs, e.g., loss of business or damage to reputation” and a failure in a security-critical system “may lead to loss of sensitive data through theft or accidental loss.”

As technology is becoming more and more the “business” of an organization, I do not differentiate between business and mission criticality; it is one and the same.

Business Criticality Framework?

Awareness and understanding of the business service characteristics of each business is important to ensure that systems and technology services are designed to deliver all required business critical requirements. As referred to above, often a business service and / or an IT Application criticality is only being expressed by outlining the level of availability in terms of absolute percentage. For instance, “my application has to be 99.99% available”. However, the level of criticality of an IT Application will depend on several characteristics covering 7 different categories and not just one or two (the below list is an example, variances are possible):

Platinum – Mission Critical

  • The service is vital to the operation of the business. Should the service fail then serious widespread impact to the business will result within a matter of minutes
  • Constant availability
  • Only one service outage per year
  • The service must deliver a real-time immediate response, typically under 2 seconds
  • Demand peaks randomly and with unpredictable volume. Randomly: in this case relates to the nature of user access where the demand is unpredictable
  • Demand for the service is typically measured in thousands of transactions per hour or more
  • The activities and/or information managed by this service is directly or indirectly subject to external regulatory controls that are outside of the Client’s direct control

Gold – Highly Critical

  • The service is very important to the operation of the business. Should the service fail then serious impact to the business will result within a matter of hours
  • Constant availability
  • The service must deliver a real-time immediate response, typically under 2 seconds
  • Predictable demand peaks adhering to known rhythm (daily, weekly, monthly, quarterly, etc) although volume of demand may vary
  • Demand for the service is typically measured in thousands of transactions per hour or more
  • The activities and/or information managed by this service is directly or indirectly subject to external regulatory controls

Silver – Critical

  • The service is vital to the operation of the business. Should the service fail then serious widespread impact to the business will result within a matter of minutes
  • 6:00am through 7:00pm Monday – Saturday
  • The service must deliver a real-time immediate response, typically under 2 seconds
  • Consistent volume of demand at known frequencies – i.e. in a cyclic fashion
  • Demand for the service is typically measured in thousands of transactions per day
  • The activities and/or information managed by this service is directly or indirectly subject to external regulatory controls

Bronze – Important

  • The service is important to the operation of the business, but failure can be tolerated without serious impact to the business for up to a day
  • 6:00am through 7:00pm Monday – Friday (Service is 8am to 6pm)
  • The service must deliver an interactive response, typically under 30 seconds
  • Demand peaks randomly and with unpredictable volume
  • Demand for the service is typically measured in thousands of transactions per day
  • The service is not directly or indirectly subject to external regulatory controls

Summary

IT is now everywhere, and when we design IT solutions, we must understand all service characteristics to be able to cater for any relevant potential outage scenarios. Mission criticality is not a special case, but a characteristic that must be considered holistically.

Thanks for Reading

??

?

?


[1] POS = point of sales

Thomas Williams

Business Development Executive | EMBA | Customer value first

3 个月

Absolutely agree! Integrating a proactive risk management strategy and leveraging predictive analytics can further enhance our ability to anticipate and mitigate potential outages. It's about building resilience into every layer of our IT infrastructure. #ResilientIT #PredictiveAnalytics

要查看或添加评论,请登录

社区洞察

其他会员也浏览了