登录查看更多内容

Scaling Up: Why did Amazon move from Distributed Microservices to monolith?

Tarun Annapareddy

EX Nutanix | I develop scalable software solutions

发布日期: 2023年12月27日

This blog is based on the Techblog released by Amazon on how they scaled up the Prime Video quality monitoring Service. The original blog itself is not elaborate. So, In this blog, we will dissect the Techblog and read between the lines.?

You can refer to this original article: PrimeVideoTech

Caution: This article is not verified by the Prime video team. I have read the original blog extensively, there are a lot of things not explained well. So, I used my experience and knowledge to fill the gaps and write this.

PrimeVideo Monitoring UseCase:

Amazon’s Prime Video is one of the most successful streaming video services where you can watch movies shows and live streams. We use devices like our Phones and laptops to watch the content. These devices or Clients (specifically) interact with the Amazon Server to fetch the content.

Many streaming services use Encryption/Decryption methods between the Server and Client for efficient communication. The perceptual quality issue we observe at the output may be from any of these Bugs in the Server/ Client code or just a Network issue. So, the Prime Video team has developed a Monitoring Service to process the Decrypted frames again and analyze issues.

Scale:

As we have seen, each frame in the customer's live stream needs to be processed. So, this will lead to millions of parallel upload streams where each customer sends back the decrypted frame to the Amazon Monitoring Service for analyzing the issues.?

However, It looks like they are not interested in doing this monitoring for all the customers. They have aimed to start with a few thousand and scale up as needed.?

Distributed Processing Pipeline:

Initially, they designed a distributed pipeline using Serverless Step functions. Also as we have discussed in my previous article on Serverless. The ease of use helped them to design the pipeline quickly and scale horizontally? based on load

The client machine triggers the “Start Conversion” signal so that the orchestration engine can trigger the initial setup for receiving the stream.
The uploaded live streams are captured by the “Media Conversion” service which stores the frames in S3.
They have used S3 as an intermediate storage for the images. Every service should read and write the data repeatedly.
Also, they might be using S3 events. Ie, every time some write happens to S3 we can configure it to trigger events that can trigger Serverless functions/ Step Functions.
Each write will lead to multiple Step functions which process the frame in parallel and run multiple detection algorithms.
The detectors will notify their result to the customer.
All the results are aggregated and written back to S3. Maybe they can use these as logs to detect issues.

Bottlenecks with Distributed Pipeline:

As we have seen the pipeline is based on microservices and Distributed framework. They have used Serverless to scale the detection process vertically based on scale. Also, S3 is used as the main storage that every process can refer.

Each detector is a microservice/process which downloads the image from S3. I can understand the need to use S3, as the data size is not small they cannot directly pass in the event metadata. So, they store in S3 and pass the link/event instead. This is a standard solution while working with Orchestration pipelines.? But in this case, this added S3 read/write overhead and cost that surpasses the advantages
As we have discussed Serverless Step functions are stateless. AWS charges for every event in the step function process. So, I assume this team is burning a lot of money by triggering step functions for each frame

领英推荐

Real-time Data Processing with Apache Spark:…

Machine Learning Reply GmbH 10 个月前

Autonomous Insider: Cost optimization for K8s, ECS…

Sedai 6 个月前

?? Announcing Timeplus Enterprise v2.5: Now Generally…

Timeplus 2 个月前

Solution and Tradeoff:

The major issue was using the Orchestration method and S3 for reading/storing data. It provides the durability for our systems at the cost of performance. Ie, even if the worker crashes in the middle of the detection process, as we have the frame with S3 it can restart the process and use the frame.?

But, If we think they might not need durability. Because these Bugs do not change for each customer. If we take a random sample of a few customers we should be able to identify most of them. We can lose some frames in this method and still detect the problems at a later point. So, they have decided to trade off durability for performance by keeping images locally instead of sending them to S3.

New Architecture:

They are still using the Orchestration pipeline. But, we have seen earlier that each activity/ Task of this pipeline uploads the data to S3 and the next pipeline reads it. However they did not store it S3, they kept it locally for the next process to run

The client machine triggers the “Start Conversion” signal so that the Orchestration engine is triggered to start the workflow
The first task that the workflow triggers is “Media Conversion” which processes the frame and stores it in the local Heap memory
The next task that workflow triggers is “detection” tasks which can directly read from the local memory and do their work

Problems I see with the new Architecture:

This new architecture had traded durability to achieve performance. So, if the worker crashes, we might lose some frames or start the process from the very beginning.?
The bigger issue is the need for both processes to be on the same machine. So, they have to pack the MediaConversion, Detector1, and Detector2 processes on a single machine. Because if the Detector runs on another machine, It does not have the frame to process.
How can we scale this system to add more detection processes? What if we reach a situation where We need to scale each detector horizontally? We cannot do it now as they are coupled tightly. We lost the advantage of Microservices by bundling them as a big monolith.

Takeaways from the article:

System design is full of tradeoffs. Understand the requirements before designing your system. Do not prefix on using Microservices and fancy serverless. Do take the decision that can help you scale the system for at least 4 years ahead.?

Another major takeaway from my side is to never be afraid to re-architect. Never be afraid to point out the problem as a problem and always keep scale in mind.

References:

https://www.primevideotech.com/video-streaming/scaling-up-the-prime-video-audio-video-monitoring-service-and-reducing-costs-by-90

要查看或添加评论，请登录

Tarun Annapareddy的更多文章

BitTorrent: Rarest First and Choke Algorithms Are Enough

2024年5月5日

BitTorrent: Rarest First and Choke Algorithms Are Enough

Consider the use case of file sharing within an organization, where multiple users need to share files. How do we…
RAFT consensus from Roots

2024年4月14日

RAFT consensus from Roots

Most of the distributed systems use RAFT to achieve consensus. Today in this article we will take a deeper look at a…
Demystifying Redis Messaging

2024年2月8日

Demystifying Redis Messaging

Redis is well-known as a Cache. But it is not just a Cache, it has developed over the years into a reliable Key-Value…

1 条评论
Serverless: The Trade-offs and Challenges a head

2023年11月26日

Serverless: The Trade-offs and Challenges a head

We replaced Monoliths with Microservices to gain the benefits of scalability and a reduction in the time from idea to…
Interesting Add-ons to Container Orchestrators(K8s), based on research papers

2023年10月31日

Interesting Add-ons to Container Orchestrators(K8s), based on research papers

In the world of microservices and serverless computing, we end up deploying thousands of services and end up in the…
How Redis Cluster Maintains Availability while Re-Sharding

2023年9月12日

How Redis Cluster Maintains Availability while Re-Sharding

Sharding is the most common way of scaling your databases, like Postgres, Redis etc., for heavy writes.

2 条评论
Event-driven and reactive Data Ingestion pipelines with Temporal

2022年12月18日

Event-driven and reactive Data Ingestion pipelines with Temporal

We generally use the Saga pattern to implement a sequence of business transactions. Each transaction can update the…

8 条评论
Exactly once delivery guarantee and Failure recovery strategies in Flink

2022年6月24日

Exactly once delivery guarantee and Failure recovery strategies in Flink

Fault tolerance refers to the system's ability to continue operating despite failures or malfunctions. Now let us see…
Stream Processing with Apache Flink- The Introduction

2022年6月8日

Stream Processing with Apache Flink- The Introduction

Before going through Stream processing, let us understand its counter technique, "Batch processing," and its contraries…

2 条评论
HOT updates to cover MVCC design holes

2022年4月30日

HOT updates to cover MVCC design holes

Let us pick Isolation from ACID properties in today’s discussion. Transaction Isolation guarantees that queries…

See all articles

Scaling Up: Why did Amazon move from Distributed Microservices to monolith?

Tarun Annapareddy

EX Nutanix | I develop scalable software solutions

PrimeVideo Monitoring UseCase:

Scale:

Distributed Processing Pipeline:

Bottlenecks with Distributed Pipeline:

领英推荐

Solution and Tradeoff:

New Architecture:

Problems I see with the new Architecture:

Takeaways from the article:

References:

Tarun Annapareddy的更多文章

社区洞察

其他会员也浏览了

StreamNative September Newsletter

StreamNative April 2024 Newsletter

The Future of Serverless

Overclock Labs: July 2022 Recap

?? Google Kubernetes Engine (GKE): Your Container Command Center ??

KAFKA

Streaming Data on AWS: Amazon Kinesis Data Streams or Amazon MSK?

Netflix: An AWS Case Study

Amazon Kinesis vs DynamoDB Streams

Cost-Effective Handling of Traffic Spikes: Horizontal Scaling with ECS

PrimeVideo Monitoring UseCase:

Scale:

Distributed Processing Pipeline:

Bottlenecks with Distributed Pipeline:

领英推荐

Solution and Tradeoff:

New Architecture:

Problems I see with the new Architecture:

Takeaways from the article:

References:

Tarun Annapareddy的更多文章

BitTorrent: Rarest First and Choke Algorithms Are Enough

RAFT consensus from Roots

Demystifying Redis Messaging

Serverless: The Trade-offs and Challenges a head

Interesting Add-ons to Container Orchestrators(K8s), based on research papers

How Redis Cluster Maintains Availability while Re-Sharding

Event-driven and reactive Data Ingestion pipelines with Temporal

Exactly once delivery guarantee and Failure recovery strategies in Flink

Stream Processing with Apache Flink- The Introduction

HOT updates to cover MVCC design holes

社区洞察

其他会员也浏览了

StreamNative September Newsletter

StreamNative April 2024 Newsletter

The Future of Serverless

Overclock Labs: July 2022 Recap

?? Google Kubernetes Engine (GKE): Your Container Command Center ??

KAFKA

Streaming Data on AWS: Amazon Kinesis Data Streams or Amazon MSK?

Netflix: An AWS Case Study

Amazon Kinesis vs DynamoDB Streams

Cost-Effective Handling of Traffic Spikes: Horizontal Scaling with ECS