Aurora Serverless V2: AWS' Surprising Return to 'Obsolete' Tech and Challenging Its Own Rules to Improve the Game-Changing Database Service
Igor Aronov
Distinguished Engineer at McKinsey & Company | Cybersecurity & Cloud Innovator | Ex Wall St | CISSP | AWS 10X
Last week, AWS engineers opened their kimonos and published a fascinating paper on the major advancement of Aurora Serverless Relational Database Service and the journey to V2 from the original ASv1.
Here are the most notable improvements:
While these achievements are impressive and will provide a more flexible, cost-efficient, and seamless serverless database experience for us, the customers, the engineer in me wondered how such a breakthrough was achieved.
The paper does not disappoint and goes into substantial design detail, explaining innovations, which to me, an industry veteran, appear more like reinventions of familiar technologies as well as a departure from AWS' own architecture dogmas.
Dogma 1: "Everything Fails All The Time", AWS CTO Werner Vogels
From their early days, AWS departed from the view held by the traditional IT vendors, that the infrastructure and platforms have to be intrinsically highly available, while absolving applications from this duty. Through the Well-Architected Framework they encouraged us to embrace the failure and design around it on higher layers closer to or built into the application. Their services rarely achieved the much-coveted 5 or 6 nines of availability we were so used to in the traditional data center. Features like VMotion by VMware , which allowed workloads, unaware of the unstable world around them, to move live from host to host (or even across data centers) have become virtually obsolete.
I got used to it over the years and, hence, my surprise to find out that one of the key innovations allowing ASv2 to scale so well is ... drum roll ... live migration of virtual machines. The paper does not discuss the philosophical shift from AWS reliability principles and provides few details about the implementation. It does mention changes to the hypervisor (no surprise here) and the development of a brand new Nitro and SRIOV-based instance type in order to support this operation.
Live migration of EC2 instances and their derivatives is rumored to be used by AWS to navigate around hardware maintenance events but it has not been made available to the customer to be leveraged in their own architectures. Perhaps, since it's been packaged as an instance type, this is an architecture element AWS will continue to use and offer to the customer? Let's wait for re:Invent 2024 and see.
领英推荐
Dogma 2: Capacity Is Fully Provisioned
When you create an instance of a service in AWS, be that EC2 or Lambda, one way or another you specify the amount of capacity (memory, storage, compute) you are willing to pay for. AWS will (ostensibly) allocate all of it to you and that's what's going to be on the billing report. Yet, your application is unlikely to actually use all of it or, at least, not all the time.
Enterprise data center owners have long capitalized on this phenomenon and implemented techniques such as thin provisioning, deduplication, transparent page sharing and so on. Applications still have the illusion of full capacity but, what they do not actually use, goes back into the common pool of resources and is used as a buffer for everyone's benefit. A resource socialism in a way, it's been a significant factor in capacity planning economics.
As you may be guessing by now, ASv2 introduces another key innovation: over-subscriptions. We learn that the sum of vCPUs across all Aurora instances on a host can exceed the total number of physical CPUs, and that the total memory corresponding to the customer max ACUs of the instances can exceed the host's physical memory.
While this is not a new idea, I am genuinely happy about this development but have two questions for AWS:
Are they planning to promote this approach broadly, both as an internal optimization technique and/or a customer-facing feature?
... and ...
Will they make the efficiency gains visible to us via resource consumption metrics (CloudWatch) and not charge us for what applications running inside the instance are actually not using?
Owing to its title, the paper also goes into the depth of a rather sophisticated resource manager featuring fleet-wide and within-host techniques as well as token bucket-based regulation for instance growth. The design is painfully reminiscent of VMware DRS, which places this architecture further on a trajectory familiar to us from the distant past.
This goal of this article is not to criticize AWS or shed tears for VMware or IBM's z/VM. I am actually impressed by the openness and unconstrained imagination of its engineers. As a part of this monumental effort, they developed genuinely new building blocks previously unavailable in the AWS ecosystem and collected a lot of useful data through careful empirical observations. Thinking outside the box, they evaluated, and eventually discarded the Firecracker VMM, which underpins the iconic serverless Lambda and may be ultimately responsible for its limitations when it comes to high-performance workloads.
The evolution of Aurora Serverless from v1 to v2 mirrors the Japanese concept of Shuhari in cloud architecture. AWS initially followed its established rules, then broke from convention creatively, and finally transcended its own dogmas. This journey benefits customers through improved performance, cost efficiency, and a more seamless experience. By challenging norms, AWS has set a new standard for cloud database services, encouraging all of us to continually reassess and improve our approach to cloud architecture. This willingness to innovate beyond established principles not only pushes the boundaries of technology but also aligns more closely with evolving customer needs, potentially paving the way for future advancements in other cloud services.
This will definitely make the public cloud architecture more glamorous and small and medium sized companies will benefit more from a event driven architecture as opposed to a capacity provisioned architecture. Bigger companies using private cloud would be reanalyzing their decisions.It’s a choice between security and cost and it all depends on the demand and speed with which private cloud providers start to work on serverless offerings. With that said I do believe serverless v2 introduces cost unpredictability even if it appears to save costs. Companies like to know in advance how much they are going to spend every year as most budgetary decisions are made by taking the previous years and adding a % for buffer/ inflation . So unless budget calculations are also revamped (and go serverless :-) )this kind of dynamic up/down cost changes will have the accounting teams scratching their heads .i also believe there will still be latency issues during the scaling period . There is also a potential for over provisioning during high work loads because the scaling is based on ACU and not compute resources .