From Apache Spark to Ray- How Amazon Saved $100 Million by This Switch

From Apache Spark to Ray- How Amazon Saved $100 Million by This Switch

Imagine a company so large that even the smallest performance improvements can lead to millions in savings. At Amazon, where data processes power everything from logistics to customer insights, even a minor tweak in efficiency can transform operations.

Recently, Amazon made a game-changing shift in its database management strategy.

Amazon's Business Data Technologies (BDT) team made a significant move by migrating their data processing tasks from Spark to Ray, tackling challenges with exabyte-scale data.

Here's how they saved $120 million a year:

The Problem


Image Credits- amazon

  1. Merge Operation Struggles: As Amazon's data grew, "merge" operations became increasingly slow and unreliable, taking days or weeks to complete.
  2. Compaction Jobs Delays: Spark's performance began to lag when dealing with exabyte-scale data. Traditional Spark jobs were taking too long to finish, and scaling solutions were limited.

The Solution

  1. Moving to Ray: BDT tested Ray and discovered it could handle much larger datasets—12 times bigger than Spark—and was 91% more cost-efficient. Ray’s advanced task orchestration and zero-copy shuffling contributed to better resource usage and faster processing speeds.
  2. Serverless Design: BDT switched to a serverless architecture, using Ray on EC2, DynamoDB, and other AWS services for job tracking and management.



Image Credits- amazon

Results


  • 82% more efficient: Ray sped up compaction tasks, reducing execution time from half a minute to less than a tenth of a second.
  • $100 million savings: By shifting to Ray, Amazon reduced its computational costs by $100 million annually.
  • 250,000 vCPU years saved: Ray reduced the computational need by roughly 250,000 vCPU years every year.
  • Improved reliability: Ray's reliability increased from 85% to 99.15%, closely matching Spark’s 99.91%.
  • Reduced memory usage: Ray consumed 55% of server memory, leading to optimized memory usage for large-scale operations.

Ray’s speed and scalability allowed Amazon to meet its massive data processing needs, improving both cost and operational performance dramatically.

Future Outlook

Ray is a strong contender for large-scale data operations, particularly for solving specific, complex problems. The team is working on adapting Ray’s compaction algorithm to integrate with Apache Iceberg, a feature expected to improve processes in 2025. Ray’s flexibility makes it a valuable tool for organizations willing to invest in tailored solutions to tackle challenging and costly issues.


要查看或添加评论,请登录

Aashiya Mittal的更多文章