Scaling Up: Why did Amazon move from Distributed Microservices to monolith?
This blog is based on the Techblog released by Amazon on how they scaled up the Prime Video quality monitoring Service. The original blog itself is not elaborate. So, In this blog, we will dissect the Techblog and read between the lines.?
You can refer to this original article: PrimeVideoTech
Caution: This article is not verified by the Prime video team. I have read the original blog extensively, there are a lot of things not explained well. So, I used my experience and knowledge to fill the gaps and write this.
PrimeVideo Monitoring UseCase:
Amazon’s Prime Video is one of the most successful streaming video services where you can watch movies shows and live streams. We use devices like our Phones and laptops to watch the content. These devices or Clients (specifically) interact with the Amazon Server to fetch the content.
Many streaming services use Encryption/Decryption methods between the Server and Client for efficient communication. The perceptual quality issue we observe at the output may be from any of these Bugs in the Server/ Client code or just a Network issue. So, the Prime Video team has developed a Monitoring Service to process the Decrypted frames again and analyze issues.
Scale:
As we have seen, each frame in the customer's live stream needs to be processed. So, this will lead to millions of parallel upload streams where each customer sends back the decrypted frame to the Amazon Monitoring Service for analyzing the issues.?
However, It looks like they are not interested in doing this monitoring for all the customers. They have aimed to start with a few thousand and scale up as needed.?
Distributed Processing Pipeline:
Initially, they designed a distributed pipeline using Serverless Step functions. Also as we have discussed in my previous article on Serverless. The ease of use helped them to design the pipeline quickly and scale horizontally? based on load
Bottlenecks with Distributed Pipeline:
As we have seen the pipeline is based on microservices and Distributed framework. They have used Serverless to scale the detection process vertically based on scale. Also, S3 is used as the main storage that every process can refer.
领英推荐
Solution and Tradeoff:
The major issue was using the Orchestration method and S3 for reading/storing data. It provides the durability for our systems at the cost of performance. Ie, even if the worker crashes in the middle of the detection process, as we have the frame with S3 it can restart the process and use the frame.?
But, If we think they might not need durability. Because these Bugs do not change for each customer. If we take a random sample of a few customers we should be able to identify most of them. We can lose some frames in this method and still detect the problems at a later point. So, they have decided to trade off durability for performance by keeping images locally instead of sending them to S3.
New Architecture:
They are still using the Orchestration pipeline. But, we have seen earlier that each activity/ Task of this pipeline uploads the data to S3 and the next pipeline reads it. However they did not store it S3, they kept it locally for the next process to run
Problems I see with the new Architecture:
Takeaways from the article:
System design is full of tradeoffs. Understand the requirements before designing your system. Do not prefix on using Microservices and fancy serverless. Do take the decision that can help you scale the system for at least 4 years ahead.?
Another major takeaway from my side is to never be afraid to re-architect. Never be afraid to point out the problem as a problem and always keep scale in mind.
References: