登录查看更多内容

Using the CAP Theorem to Analyze Microservices

Navjot Bansal

Building Computer Vision Systems @Oracle | Software Architecture | System Design | ICPC Regionalist

发布日期: 2023年9月18日

(Us) Engineers experience multiple learning curves and take multiple ownerships while building software and backend systems. As the ownership begins to expand we are involved in designing the architecture for software and their development lifecycle. The Architecture we design dictates customer experience and small flaws in our design leads to bad experience and failed products.

CAP theorem although a widely discussed and fundamental theorem in System design somehow is not reckoned with, while we design our software.

This article aims to touch upon

What CAP theorem is
How can CAP accelerate software success

In distributed systems, software is containerized and modularized into different microservices, decoupled databases, and multiple replicasets that communicate through networks. Networks & backends just like humans can have a breakdown at any time and need time to recover.

CAP in nutshell

CAP theorem also known as Brewer's theorem is a statement that "Distributed Systems can only provide two of three properties simultaneously: consistency, availability, and partition tolerance"

In a gist, there will always be tradeoffs that have to be made while designing software by prioritizing some functionality over others.

Understanding CAP

In the figure above we have the following architecture

Backend - That PUTS/GETS value from the database
Distributed Database - Implemented in master-slave architecture. If you are not aware of the master-slave pattern I will explain it in gist for you. Master is the main database that takes care of the write (PUTS) operation and sends updated values to the slave replica sets and slave databases in multiple counts take care of the read (GETS) operation.

Situation: If the network communication breaks for the Master-Slave Database after a PUT operation has happened and the user does the GET for the same key.

We have two options

Give an outdated value in response to the user that is residing currently in the slave replica sets

Give a 500 response stating DB operation failed and retry in the given time.

Option 1 is ensuring the system is up (Available) and responding to queries but is not in sync (Consistent) with the database, whereas the second option ensures we are responding only when the backend is in sync (Consistent) compromising with uptime (Availability)

Choosing uptime vs consistency will make major differences in customer experience. In order to make a choice you must understand what is the primary focus of customers. If you are planning to onboard customers with different focues i.e few choose uptime while others choose consistency you will have to trade off to eventual consistency but GOOD LUCK with that :)

Consistency, availability, and partition tolerance explained

Consistency (C) - Consistency ensures any replica set that receives an individual query should return the exact same output every time. If a Database update has happened the backend should return the user the updated data and reject the stale value. Consistency doesn't guarantee that all requests will be served but ensures the requests that are served are receiving correct values.

Availability (A)- Availability ensures that the system is reachable and performing almost ~100% of the time. Every request no matter some internal network failure or component outage should return HTTP-OK (200) response. Availability doesn't guarantee the output provided is up to date with the backend DB.
Partition Tolerance (P) - A system should not halt if there is a communication break between services. If we are ensuring partition tolerance it means that even if there is a network partition "breakage" between services the backend should not fail abruptly.In order to ensure Network partition doesn't halt the service we replicate all backends and their state data into multiple network subnets so that if one subnet fails we are good to use other subnets

领英推荐

A Crash Course on Microservices Design Pattern

Dr. Sayed Peerzade 6 个月前

Steering Clear of Distributed Monolith Traps in Your…

Priyal Walpita 1 年前

MIGRATING FROM MONOLITH TO MICROSERVICES: STRATEGY &…

Ahmed El-Sayed 2 年前

If you are not convinced about why the CAP theorem states that at most 2 of 3 properties are ensured you should read this.

What to choose

It's a fair assumption to always consider P (Partition tolerance) as an essential component as microservices heavily rely on networks for day-to-day communication.

Of A (Availability) or C (Consistency), what should we choose?

I will explain some examples of the tech giant Google, what they abide by, and why they prefer one over the other.

Google Search

Google search receives millions of hits every second.
only <0.001% (assuming) of people might be impacted if the latest articles are not shown in the search results
Thus, Availability has to be the primary focus as halting the system to update indices will drastically impact the customer experience and product adoption.

Gmail

Gmail is a communication tool and users rely heavily on communicating with the right data to reach the destination.
A user will be willing to wait rather than send broken emails with broken texts to the person

Communication tools like GMail and Facebook typically enforce only eventual consistency. Messages may take a while to travel between different servers, so different servers may temporarily have inconsistent views of the world depending on what messages they have seen so far. However, all servers will eventually have consistent states, since the correct state is defined by interpreting the messages in the order of their global timestamps, rather than the order in which they arrived.

Google Suite ( Docs, Sheets ), etc.

I believe Google Docs used to enforce consistency at the cost of availability (if your master was down then your doc wasn't available), but has now also moved to an eventual consistency model that allows concurrent editing and eventually resolves conflicts.

For Devs!

Multi-billion dollar companies resolve to basic expected experience for customers while choosing and prioritizing architectural designs.

Distributed Systems always have to deal with multiple fallacies assumed while designing architectures.

But now you understand the CAP theorem, I highly suggest you sketch a small diagram of your back end and evaluate whether all parameters of CAP are respected and whether they are in line with the end-user expectations.

The above exercise will help you evaluate some loopholes and design flaws in the back end. Some small improvements like internal caching, semaphores, and leader elections on multiple replica sets might help make the service better and reduce the impact network communication has on the overall service.

Happy Engineering ?? !!!!

The Service Principle

847 位关注者

Roman Siewko

Senior Vibe Coder | AI Therapist | DevOps Engineer

10 个月

For anyone who doubts, there is a "Beating the CAP Theorem Checklist" ?? Here is why your idea will not work: ? you are assuming that software/network/hardware failures will not happen ? you pushed the actual problem to another layer of the system ? your solution is equivalent to an existing one that doesn't beat CAP ? you're actually building an AP system ? you're actually building a CP system ? you are not, in fact, designing a distributed system Specifically, your plan fails to account for: ? latency is a thing that exists ? high latency is indistinguishable from splits or unavailability ? network topology changes over time ? there might be more than 1 partition at the same time ? split nodes can vanish forever ? a split node cannot be differentiated from a crashed one by its peers ? clients are also part of the distributed system ? stable storage may become corrupt ? network failures will actually happen ? hardware failures will actually happen ? operator errors will actually happen ? deleted items will come back after synchronization with other nodes ? clocks drift across multiple parts of the system, forward and backwards in time Source with complete list here ? https://ferd.ca/beating-the-cap-theorem-checklist.html

要查看或添加评论，请登录

Navjot Bansal的更多文章

Copy of Thoughts over ? : Tech debt is just bad code?

2024年2月26日

Copy of Thoughts over ? : Tech debt is just bad code?

What's "Thoughts over ?" Thoughts over ? is a segment where I will be discussing "non-technical" problems that software…
Trash Talk and Garbage Collection.

2024年2月5日

Trash Talk and Garbage Collection.

For this newsletter, I have emphasized upon basics of Garbage collection in Python and what life would be like without…
Is More Caching = Efficient Application?

2024年1月29日

Is More Caching = Efficient Application?

For this newsletter, I emphasized upon Caching and how its overdose and inefficient integration can potentially slow…
Case Study: How Stackoverflow's monolith beats microservice performance.

2023年4月18日

Case Study: How Stackoverflow's monolith beats microservice performance.

Every Software Engineer's savior Stack Overflow operates immaculately, serving around 260,000,000 (260 Million)…
Failproof micro-service: Retry Strategy for intermittent failures

2023年2月3日

Failproof micro-service: Retry Strategy for intermittent failures

This post is in continuation to Creating a Failure Resilient Application. I highly recommend reading this article…

2 条评论
Designing Microservices for failure Resiliency

2023年1月14日

Designing Microservices for failure Resiliency

In Microservices, we achieve "Segregation of Concerns" which prevents the whole system from crashing when a particular…
Tech in trend : Serverless!

2022年11月23日

Tech in trend : Serverless!

As per a survey by Oreilly, almost 40% of the companies leveraging Software services have moved to serverless…
Being proactive with reactive scaling with KEDA

2022年10月30日

Being proactive with reactive scaling with KEDA

Intro https://naruto.fandom.
Breaking the if-else logic trap with the Rule-based design pattern

2022年10月16日

Breaking the if-else logic trap with the Rule-based design pattern

Overview There are situations where you are presented to deal with legacy code or work upon modules that require you to…

13 条评论
Scaling up or Scaling out?

2022年9月7日

Scaling up or Scaling out?

Overview You are ready with your Stateless Application server and are inviting users to test it out. As soon as the…

See all articles

Using the CAP Theorem to Analyze Microservices