Scaling Up: Why did Amazon move from Distributed Microservices to monolith?

Scaling Up: Why did Amazon move from Distributed Microservices to monolith?

This blog is based on the Techblog released by Amazon on how they scaled up the Prime Video quality monitoring Service. The original blog itself is not elaborate. So, In this blog, we will dissect the Techblog and read between the lines.?

You can refer to this original article: PrimeVideoTech

Caution: This article is not verified by the Prime video team. I have read the original blog extensively, there are a lot of things not explained well. So, I used my experience and knowledge to fill the gaps and write this.

PrimeVideo Monitoring UseCase:

Amazon’s Prime Video is one of the most successful streaming video services where you can watch movies shows and live streams. We use devices like our Phones and laptops to watch the content. These devices or Clients (specifically) interact with the Amazon Server to fetch the content.

Many streaming services use Encryption/Decryption methods between the Server and Client for efficient communication. The perceptual quality issue we observe at the output may be from any of these Bugs in the Server/ Client code or just a Network issue. So, the Prime Video team has developed a Monitoring Service to process the Decrypted frames again and analyze issues.

Scale:

As we have seen, each frame in the customer's live stream needs to be processed. So, this will lead to millions of parallel upload streams where each customer sends back the decrypted frame to the Amazon Monitoring Service for analyzing the issues.?

However, It looks like they are not interested in doing this monitoring for all the customers. They have aimed to start with a few thousand and scale up as needed.?

Distributed Processing Pipeline:

Initially, they designed a distributed pipeline using Serverless Step functions. Also as we have discussed in my previous article on Serverless. The ease of use helped them to design the pipeline quickly and scale horizontally? based on load

  1. The client machine triggers the “Start Conversion” signal so that the orchestration engine can trigger the initial setup for receiving the stream.
  2. The uploaded live streams are captured by the “Media Conversion” service which stores the frames in S3.
  3. They have used S3 as an intermediate storage for the images. Every service should read and write the data repeatedly.
  4. Also, they might be using S3 events. Ie, every time some write happens to S3 we can configure it to trigger events that can trigger Serverless functions/ Step Functions.
  5. Each write will lead to multiple Step functions which process the frame in parallel and run multiple detection algorithms.
  6. The detectors will notify their result to the customer.
  7. All the results are aggregated and written back to S3. Maybe they can use these as logs to detect issues.

Bottlenecks with Distributed Pipeline:

As we have seen the pipeline is based on microservices and Distributed framework. They have used Serverless to scale the detection process vertically based on scale. Also, S3 is used as the main storage that every process can refer.

  1. Each detector is a microservice/process which downloads the image from S3. I can understand the need to use S3, as the data size is not small they cannot directly pass in the event metadata. So, they store in S3 and pass the link/event instead. This is a standard solution while working with Orchestration pipelines.? But in this case, this added S3 read/write overhead and cost that surpasses the advantages
  2. As we have discussed Serverless Step functions are stateless. AWS charges for every event in the step function process. So, I assume this team is burning a lot of money by triggering step functions for each frame

Solution and Tradeoff:

The major issue was using the Orchestration method and S3 for reading/storing data. It provides the durability for our systems at the cost of performance. Ie, even if the worker crashes in the middle of the detection process, as we have the frame with S3 it can restart the process and use the frame.?

But, If we think they might not need durability. Because these Bugs do not change for each customer. If we take a random sample of a few customers we should be able to identify most of them. We can lose some frames in this method and still detect the problems at a later point. So, they have decided to trade off durability for performance by keeping images locally instead of sending them to S3.

New Architecture:

They are still using the Orchestration pipeline. But, we have seen earlier that each activity/ Task of this pipeline uploads the data to S3 and the next pipeline reads it. However they did not store it S3, they kept it locally for the next process to run

  1. The client machine triggers the “Start Conversion” signal so that the Orchestration engine is triggered to start the workflow
  2. The first task that the workflow triggers is “Media Conversion” which processes the frame and stores it in the local Heap memory
  3. The next task that workflow triggers is “detection” tasks which can directly read from the local memory and do their work

Problems I see with the new Architecture:

  1. This new architecture had traded durability to achieve performance. So, if the worker crashes, we might lose some frames or start the process from the very beginning.?
  2. The bigger issue is the need for both processes to be on the same machine. So, they have to pack the MediaConversion, Detector1, and Detector2 processes on a single machine. Because if the Detector runs on another machine, It does not have the frame to process.
  3. How can we scale this system to add more detection processes? What if we reach a situation where We need to scale each detector horizontally? We cannot do it now as they are coupled tightly. We lost the advantage of Microservices by bundling them as a big monolith.

Takeaways from the article:

System design is full of tradeoffs. Understand the requirements before designing your system. Do not prefix on using Microservices and fancy serverless. Do take the decision that can help you scale the system for at least 4 years ahead.?

Another major takeaway from my side is to never be afraid to re-architect. Never be afraid to point out the problem as a problem and always keep scale in mind.

References:

https://www.primevideotech.com/video-streaming/scaling-up-the-prime-video-audio-video-monitoring-service-and-reducing-costs-by-90




要查看或添加评论,请登录

Tarun Annapareddy的更多文章

社区洞察

其他会员也浏览了