Workload Resiliency in the Cloud
Picture by DALL·E 3

Workload Resiliency in the Cloud

As we get more digital in our daily lives, any service disruption can be annoying as it affects our routine. In 2023, Singapore’s DBS Bank, rated as the World's Best Bank, experienced multiple outages, with the latest one lasting for more than four hours. In response, Singapore's central bank, the Monetary Authority of Singapore (MAS), is advising customers to have alternative payment providers or even some cash on hand. This emphasizes the need for redundancy in payment options. Having alternative options or diversification is always good to have.

Many businesses have chosen to go to the public cloud because they have built a highly resilient infrastructure today; however, services do fail from time to time. It should come as no surprise that system outages are becoming increasingly common. As systems grow more complex and we grow complacent with the idea that the public or private cloud, is infallible, we are experiencing more frequent outages.

Many of these outages are due to various reasons. It could be caused by technology, human mistakes or fire at the data center. Human errors can happen everywhere. In 2017, a human error disrupted AWS S3. Or performing data center cooling upgrades during peak hours which happened to Equinix datacenter when DBS went offline in Oct 2023.

Many of us have become reliant on e-services for daily transactions, making it essential for service providers to ensure reliability. Some food stalls in Singapore are now completely cashless. Consumers do not get mad at the cloud providers, any third-party providers or the data centre for not being able to access their favourite services. They expect service providers to provide them with a reliable service, whether it is a Government e-services or an e-banking applications.

As we move our workload to the cloud, let’s look at some factors that contribute to this, including:

  • Multi-hybrid cloud complexity
  • Lack of choice leads to constraints
  • New technology means new skills are needed

The solution? Embracing application modernization through a common platform. Let’s look in more detail at each one of the points to see how a common platform can help.

Complexity of Multi-Hybrid Cloud

Going to the public cloud is attractive because of all the value it brings to us. They have the economics of scale and provide the shared responsibility model that companies can leverage on. However, systems can fail at times, and we should be better prepared and mindful of cloud concentration risks. According to Gartner latest survey, “The risk associated with cloud concentration is fast losing its ‘emerging’ status as it is becoming a widely recognized risk for most enterprises”.

Businesses need to assess and address potential concentration risks in the use of cloud services that can reduce service resilience. However, it should only apply to a certain class of workload that needs to be highly resilient. Any disruption of such a workload can have a major negative impact on consumers.

Some of these workloads in certain industry verticals are deemed to be critical in nature. In Singapore, the Cyber Security Agency is the government agency that is responsible for the Cybersecurity Act. Under section 7(1) of the Cybersecurity Act, a Critical Information Infrastructure (CII) is a computer or a computer system located wholly or partly in Singapore, necessary for the continuous delivery of an essential service, and the loss or compromise of the computer or computer system will have a debilitating effect on the availability of the essential service in Singapore. The critical sectors are Energy, Water, Banking & Finance, Healthcare, Transport (which includes Land, Maritime, and Aviation), Government, Infocomm, Media, and Security & Emergency Services.

To enhance service resilience, enterprises should avoid putting all their workloads in a single cloud environment. Depending on the workload's nature, it may run on the cloud, on-premises, or extend to the edge. The challenge is reducing complexity and differences across these multiple environments.

Both containers and Kubernetes have obvious advantages that can help here, however, some workloads should just continue to stay on physical servers or virtual machines.

One choice Is No Choice

Cloud providers are now dominant players in the industry. Whether starting with a greenfield project or modernizing an existing application, a cloud-first approach is crucial. It involves adopting cloud-first principles that align with cloud technologies for applications, platforms, and infrastructure. Key cloud-native principles include being distributed, scalable, multi-cloud capable, polyglot, disposable, and API-centric.

Having choices in the IT industry is vital, and this is a fundamental aspect of open-source software.

Closing the Cloud Skills Gap

Closing the skills gap in cloud infrastructures is crucial as these systems become more complex. Reports indicate that over 90% of organizations struggle to find the right talent, especially with the increasing adoption of emerging technologies such as containers, AI, and edge computing, as highlighted in a recent Red Hat survey.

Today, platform engineering teams focus on delivering platform-as-a-service for developers. They manage tasks like securing infrastructures, implementing cloud-native services, ensuring compliance, and operating complex hybrid and multi-cloud environments.? However, the skill sets, tools, and processes needed for these services vary significantly. Managing multi-cloud environments demands extensive expertise, including proficiency in technologies like Kubernetes.

According to a 2022 Red Hat’s State of Enterprise Open Source report, 43% of developers lack the necessary skills for adopting containers, and 39% lack the resources to do so.? To address this gap, organizations should invest in the right technologies, training programs, build an internal Community of Practises, and foster a culture of continuous learning.

Embracing Application Modernization through a common platform

While we want to adopt cloud practices, we must invest in the right technologies to allow our talents to focus on modernizing the infrastructure using a common substrate to avoid cognitive overload. Cognitive Load Theory is?built on the idea that our working memory has a limited capacity. When we overload it with too much information or too many tasks, our performance suffers, stress levels rise, and our mental health can deteriorate.

Platform engineering teams must account for the differences between on-premises infrastructure and various cloud service providers as shown in Figure 1. By embracing a common substrate, teams can reduce complexity, enhancing service resilience and mitigating cognitive overload.

Figure 1: Various Kubernetes distribution

In Figure 2, a unified substrate enables organizations to seamlessly embrace cloud-first principles while modernizing applications to be more cloud-native. It provides a consistent foundation with common controls, compliance, and security, following a standardized operational model across environments. This approach empowers developers to concentrate on application development and ensures consistent deployment across environments. This consistency is vital in meeting business requirements such as building highly resilient services, adhering to regulatory standards, or positioning services closer to users for improved performance.

Figure 2: A common platform across hybrid-multi cloud including edge locations

Summary

The evolving landscape of cloud services necessitates a fundamental shift towards a cloud-first approach to ensure resilience and reliability in an environment prone to disruptions. The adoption of cloud-first principles is important but also an emphasis on investing in relevant skills and technologies. The shortage of expertise, particularly in managing complex cloud infrastructures, stands as a significant challenge that needs addressing to effectively navigate the multi-hybrid cloud environment.

The use of a standardized approach and a unified platform across various cloud environments is a necessity. This strategy aims to streamline operations, reduce complexity, and maintain consistency in deploying applications. By implementing this unified infrastructure approach, businesses can be more resilient and provide consistent services.

In the subsequent posts, I will explore how a common substrate using Kubernetes and its ecosystem can help to improve service resilience.

Agree. Great post. Been having many conversation with many customer concern on their application resiliency. Its always critical for business prior cloud era (multiple DC and backup). Its becomes even more critical with cloud era especially with the proliferation of apps. Apps everywhere and becomes the key success factor and differentiator for many organisation.

well done Li Ming, i totally agree with your post. i look forward to your second posting on k8s and ecosystem to help improve the resiliency. Here is one earlier linkedin posting from Foo-Bang Chan that speaks to the k8s ecosystem in multi-cluster and multi-cloud environment - https://www.dhirubhai.net/feed/update/urn:li:activity:7081771626584805376/

要查看或添加评论,请登录

Li Ming Tsai的更多文章

  • Kubernetes as the Common Substrate

    Kubernetes as the Common Substrate

    In the previous article, we discussed how outages are becoming more frequent and disrupting our daily lives. A service…

    3 条评论
  • Cloud Native AI Training with Kubernetes

    Cloud Native AI Training with Kubernetes

    At KubeCon Paris, there were many presentations on how Kubernetes is for AI and how Kubernetes is the API server and…

    2 条评论
  • AI is a Hybrid Cloud Workload

    AI is a Hybrid Cloud Workload

    AI is not new, but we are now at an AI revolution aka the “IPhone moment for AI”. As such, AI is rapidly becoming a…

    1 条评论
  • Agile IT during Turbulent Times

    Agile IT during Turbulent Times

    While I write this article in Jan 2021, the world is not much different from 2020 because of COVID-19. COVID-19 was…

    1 条评论