Data Protection for Kubernetes

Data Protection for Kubernetes

Overview

????????????When I started my IT career, one of my daily tasks as a backup admin was to ensure all servers successfully completed their overnight backups.??I had 100 application servers across the US, backing up various apps to tape.??Yes, this was pre-virtualization and back when we used tape.??My IT Director was a bit gruff and used to tell me, “If you miss a backup, you’re fired!”??While I knew he wasn’t serious, he did emphasize the importance of backups to the organization.

????????????The problem with that approach is that the emphasis wasn’t on recovery, the only concern was the backup.??While I was responsible for over one hundred servers daily backups, I had consistently experienced 3-5 backup failures each day.??(Be grateful that tape is mostly dead).??Yes, I simply re-ran the backups and confirmed their completion, but I never tested the recovery and I have no idea what our RPO/RTO times were.??

In the world of Data Protection, there are many more concerns than getting a backup done.???In this article we will discuss:?

1.?????Recovery is more than just Restoration.

2.?????RPO vs RPO

3.?????The rule of 3-2-1

4.?????The rise of Kubernetes

5.?????Where is Kubernetes deployed?

6.?????Moving to Stateful Kubernetes Apps

7.?????App Consistency

8.?????Application Awareness

9.?????Alternative data protection strategies (and their Shortfalls)

10.??How Kasten does it and eliminates the shortcoming of other solutions.?

11.??Highlight industry certifications, compliance, customer success?

12.??DEMO – How Kasten can protect K8s workloads and move them to another cluster.


Recovery is more than Restoration

Both words are nouns, the difference between restoration and recovery is that restoration?is the process of bringing an object back to its original state, but the process of restoring is the?act or process of regaining or repossession of something lost.

????????????As those definitions would indicate, the objectives are not about restoring a data blobs, the goal is to recover your application(s).??Yes, you can back up the PVC (your applications data on a Persistent Volume Claim), but that doesn’t help you put Humpty-Dumpty back together and back on his feet.??????????

RPO vs RTO

No alt text provided for this image

https://www.veeam.com/blog/rto-rpo-definitions-values-common-practice.html

Downtime is not an option for modern organizations that must fulfill their customers’ needs and expectations. Different types of incidents can occur and impact your business revenue or even existence. Whether it’s a ransomware attack, a power outage, flood or simply human mistakes, these events are unpredictable, and the best thing you can do is to BE PREPARED.

Preparedness means that you should have a documented and repeatable business continuity and disaster recovery plan (BCDR plan). That plan should tested at least annually.

Two of the important parameters that define a BCDR plan are the?Recovery Point Objective (RPO)?and?Recovery Time Objective (RTO).?For those of you who are not familiar with these terms, let me give you a brief description:

  • RPO?limits how far to roll back in time and defines the maximum allowable amount of lost data measured in time from a failure occurrence to the last valid backup.
  • RTO?is related to downtime and represents how long it takes to restore from the incident until normal operations are available to users

While RPO and RTO may sound similar, they serve different purposes and, in an ideal world, their values would be as close to zero as possible. However, back to our world, the cost for zero RPO and RTO would be extremely expensive and might not be worth the effort.

Let’s take a closer look at recovery objectives. RPO is about how much data you afford to lose before it impacts business operations. For example, for a banking system, 1 hour of data loss can be catastrophic as they operate live transactions. At a personal level, you can also think about RPO as the moment you saved a document you are working on for the last time. In case your system crashes and your progress is lost, how much of your work are you willing to lose before it affects you?

On the other hand, RTO is the timeframe within which application and systems must be restored after an outage. It’s a good practice to measure the RTO starting with the moment the outage occurs, instead of the moment when the IT team starts to fix the issue. This is a more realistic approach as it represents the exact point when the users start to be impacted.

The Rule of 3-2-1

https://www.veeam.com/blog/321-backup-rule.html

The 3-2-1 Rule is a rule to live by. Here at Veeam, we have been advocating this rule for many years to help organizations ensure recoverability when it is needed most. In this blog post I’m going to explain the 3-2-1 Rule and show you the way to upgrade it to a more modern and resilient way of thinking!

What is the 3-2-1 Rule?

The 3-2-1 Rule was first concepted by U.S. photographer?Peter Krogh. This was a rather important innovation for the photography world and has deep implications into other technology disciplines and stays timeless to this day.?

The 3-2-1 Rule, as I like to explain it, states the following:

  • There should be 3 copies of data
  • On 2 different media
  • With 1 copy being off site

With this base rule outline, now we can upgrade it to work with modern critical data. However, let’s not forget the base rule’s best attributes:

  • It does not have any specific technology or hardware requirement
  • It can address nearly any failure scenario

The Rise of Kubernetes

Kubernetes is an open-source system for automating deployment, scaling, and operations of application containers across clusters of hosts. It provides basic mechanisms for deployment, maintenance, and scaling of applications. It has a large, vibrant, and active community of users and contributors.

Kubernetes was originally created by Google to manage its internal container orchestration needs. It was open-sourced in 2014 and has since become one of the most popular open-source projects in the world.

The rise of Kubernetes can be attributed to several factors, including:

  • The rise of cloud computing: Kubernetes is a natural fit for cloud-native applications, which are designed to be portable and scalable.
  • The growth of the container ecosystem: Kubernetes is one of the most popular container orchestration platforms, and it benefits from the growth of the container ecosystem.
  • The open-source community: Kubernetes has a large, vibrant, and active open-source community that contributes to its development and adoption.

Kubernetes is a powerful and versatile platform for managing containerized applications. It is used by a wide range of organizations and is well-positioned to continue to grow in popularity in the years to come.

Where is Kubernetes being deployed???

Kubernetes is in all verticals across the globe:

·???????Retail (your hardware store, running K3s at the edge/retail store)

·???????Cruise Ships (K3s clusters tracking your drink orders)

·???????Government

·???????Military

·???????Finance

·???????Software development

·???????Conglomerates (who made your soap?)

To discover where you can find Kubernetes, all you must do is look at Linkedin to determine if a company is running Kubernetes.??Simply go to?https://www.dhirubhai.net/?, search for a company, go to their LI company page, then keyword search for Kubernetes in the People Tab.??In the example below, I searched for?Sopra Steria?and discovered that 976 employees at Sopra Steria have Kubernetes in their job title or job description.???Do they have Kubernetes???Yes.???2% of the workforce, based on that SIMPLE search.?

No alt text provided for this image

Moving to Stateful Kubernetes Apps

??When Kubernetes first started, it was a Google project named?Borg.??Initially, Kubernetes workloads were stateless.??This was for two main reasons:?

1)?????Kubernetes follows architecture principles of?Infrastructure as Code.??Applications are deployed from code and state was not traditionally kept on the Kubernetes cluster.?

2)?????There was no standardized way to save application data (state) to (disk) volumes on the cluster

??At the end of 2019,?Container Storage Interface drivers were released for Kubernetes version 1.17,?which opened the way for stateful data in Kubernetes.???A year later, snapshot functions?were also released.??The CSI drivers allowed for a standardized way for all flavors of Kubernetes to talk to any storage, using native Kubernetes APIs (kubectl) commands.??Not only does the CSI driver provide a way for the cluster to use any storage, but also take snapshots of those volumes.???Any storage vendor or cloud provider can create the CSI driver, following the standard, allowing for any Kubernetes cluster to talk to the storage without needing to know if the storage is Nutanix, PureStorage, DellEMC, NetApp, Google, Amazon, Azure, etc.. If there is a CSI driver for the storage, Kubernetes can talk to that storage with native Kubernetes APIs.

Application Consistency – not just a snapshot of disk or a VM

Application-consistent backups can be enabled if the data service needs to be quiesced before a volume snapshot is initiated.??To obtain an application-consistent backup, a?quiescing?function, as defined in the application blueprint, is first invoked, and is followed by a volume snapshot. To shorten the time spent while the application is quiesced, it is?unquiesced?based on the blueprint definition as soon as the storage system has indicated that a point-in-time copy of the underlying volume has been started. The backup will complete asynchronously in the background when the volume snapshot is complete, or in other words after?unquiescing?the application, K10 waits for the snapshot to complete. An advantage of this approach is that the database is not locked for the entire duration of the volume snapshot process.

https://docs.kasten.io/latest/kanister/testing.html


Alternative data protection strategies

In the world of data protection, there are a few alternatives to Kasten.??The primary alterative is Velero. Velero is an opensource project, funded by VMWare.??It’s a free backup tool. Let’s see how it compares to K10 – a product that has been in development for 6+ years.???

If Free is so great, why do customers come to Kasten after deploying Velero???


Security and Safety – K10 advantages?

  • Backup data is encrypted in flight and rest (Velero cannot)?
  • No privileged daemonsets?required (Velero requires privilege?daemonsets)
  • RBAC + OIDC and Token Auth (Velero - no granular Role based access controls)
  • No agents or plugins

Correctness and Consistency – K10 advantages?

  • No data loss risk?with object stores?-?Velero runs “Dirty” (inconsistent) backups when agents are used
  • Consistent backups
  • Global catalog to track state

Performance and Efficiency – K10 advantages?

  • Up to 4X space efficiency on backups?(dedupe and compression)
  • 3-5X throughput improvements (K10 is multi-threaded,?Velero is single threaded)
  • Reduced memory and CPU footprint?

Enterprise Dashboard + K8s API – K10 advantages?

  • Extremely easy-to-use UI
  • Native CLI (kubectl) and APIs
  • Day 2 support: Monitoring, alerting, logging

Consistent, extensible, multi-layered capture – K10 advantages?

  • Native database layer support (logical dumps using native tools)
  • Consistency dependent on underlying storage
  • No extensibility for application-awareness
  • Cluster configuration blueprints of ETCD

Resources?

  • No Backup Admin required –?Velero requires at?least 1 Full Time Employee for on-going management, tracking and modifications.

What if you considering one of the other dinosaurs (old) backup vendors???Consider this, most of them use Velero to inorganically bolt-on to their 20-30+ year old backup architectures.???These solutions are NOT cloud native and if you lose your backup ecosystem, it means you’ll have to rebuild their whole ecosystem before you can start recovering your Kubernetes backup platform, then finally recover your applications – hopefully they are consistent.?

How Kasten protections the entire application (or set of applications)?

Kasten captures the entire application, both the data the meta data.??The Data is the data that resides on the PVC (the disk) and the Meta-Data is everything that was used for the deployment and configuration of that application.???By capturing the everything listed below, K10 to not only backup the app, but redeploy the application. This is done by installing the application, configuring the app as it was during the backup, recovering the PVC (the data) and marrying the data to the newly deployed app.

No alt text provided for this image


Kasten’s K10 Data protection platform automatically discovers all your applications.??By scanning the namespaces, K10 can discover:

  • Pods
  • Ingress
  • Registry
  • Service Account
  • Services
  • StatefulSet
  • PVCs
  • Configmaps
  • Authentication details
  • Deployment info (helm/operatorhub)?

And non-namespace dependencies can be captured too

  • ClusterRole
  • ClusterRoleBindings
  • CRDs

????????????The app can be recovered (redeployed) back to its original namespace, into a new name space (clone) or SOMEWHERE ELSE.???And if you want to deploy that app somewhere else, we can help with the transformation – the changing of variables such as IP addresses, routes and storage – just to name a few.

Customer Success stories

Government & commercial Kubernetes environments at Large Aerospace vendor???????????

In April 2021, the Kasten team was introduced Kasten to an existing Veeam customer focused on the Aerospace industry.?They needed help in protecting their stateful Kubernetes apps.

Kasten learned about their environment, explained K10 we could help get protect and recover their k8s applications.??The customer learned how our enterprise data protection platform can help them in both environments - Government cloud and commercial cloud, plus empower them to use Kasten to migrate in the future.

The customer had been developing apps in Kubernetes.?They were running two Cluster environments with plans consolidate to one cluster (site2).

  • Site1?: ‘OnPrem’ (Cloud IaaS)??
  • Site2?:?Hyperscaler managed K8s?

They were impressed with our capabilities but deferred to purchasing until next year. They took our quote to use for budgeting for next year….?

And then they had an incident in August.??They had data loss. They had snapshots, but they didn’t have the expertise to get their apps back.??They were asked why didn’t they have the protection for those apps????It was then that they realized that the snapshots were not good enough AND they didn’t’ have the skillset to get Humpty-Dumpty back together.

Sopra Steria (search Kasten + Sopra Steria on YouTube)

https://www.kasten.io/kubernetes/resources/customer-case-studies/sopra-steria-backup-dr-for-red-hat-openshift

Leidos (search Kasten + Lidos on YouTube)

https://www.kasten.io/kubernetes/resources/case-study-leidos-launches-portable-secdevops-stack-on-aws-govcloud

Demo

EKS Application Migration <- Click link for YouTube demo

Application Mobility – Redeploying an application in Cluster B, backed up from Cluster A

Summary

Data protection is essential for any organization that uses Kubernetes, as it provides a way to recover from data loss or corruption. While the concept of data protection for stateful apps is not new, how we approach backups for cloud native, Kubernetes application is different.??Kubernetes follows the DevOps methodology and data protection for Kubernetes should follow cloud native strategy too. There are several data protection strategies that can be used for Kubernetes, but the best approach will be:

  1. Cloud native
  2. Able to use native APIs (kubectl)
  3. Provide a clear path to recovery.

Steve Merriman

VMCA-2024 | VMCE-2024

1 年

Long live tape! (-:

要查看或添加评论,请登录

Matt LeBlanc的更多文章

社区洞察

其他会员也浏览了