登录查看更多内容

Classloaders to the Rescue!

Carlos Arguelles

发布日期: 2024年9月2日

How I used three esoteric, obscure features of Java to solve a real-world problem at Amazon

[cross post of my original 10/26/21 blog since that blog may be behind a paywall sometimes.]

Sometimes I read about a weird, esoteric feature in the JVM and I think ok, sure, but when the heck is this ever actually used? Back in 2013, I was facing a tricky little problem back at Amazon, and I ended up using three weird, esoteric features in the JVM to solve it.

It all started when I needed to load test my service, so I wrote a bunch of load generation code. It was responsible for executing transactions at a desired rate (so that I could answer questions like, “what happens when my system receives a load of 250,000 transactions per second?”). This sounds easy, but it gets tricky at high throughput and requires careful thread management (hint: ScheduledExecutorService on steroids, but also distributed among hundreds or thousands of machines). It also allowed for blending transaction types: if you wanted to load test an RPC service that had 3 APIs X, Y, Z, you could write a piece of code to simulate a customer calling X, one to simulate a customer calling Y, and one to simulate a customer call to Z. Then you could ask the platform to execute your code not just with a desired rate, but also with a desired blend of operations, eg. run at 10,000 transactions per second for 30 minutes with 70% X, 20% Y, 10% Z. Lastly, the core engine aggregated metrics for the run, such as throughput, error rate and percentile latency for all these RPC calls over time, etc. and created dashboards.

I realized that most of the code I had written to load test my service was reusable so others could leverage it. My framework was generic, engineers could write their product-specific code to be executed by it. So I decided to clean it up, refactor it and offer it as a platform more broadly for other amazonians (I didn’t know it back then, but this would would eventually become the platform that tens of thousands of services use today to ensure they’re ready for peak traffic). The desire to open up the platform got me into the business of vending my software to other engineers at Amazon.

Some software companies, like Google, have their code in a giant monorepo. Everybody works in the same repository, with the same dependencies, and building off the same version (head). Other companies, like Amazon, have smaller granularity, often per team. Amazon calls these things versionsets (“VS”)— they are isolated build and runtime closures (your Java classpath). You create a versionset for your product, initially empty. And you add Java packages (JARs) to it. Those JARs bring their dependencies into your VS, because they need them to operate. Those dependencies bring their dependencies, which brings their dependencies, and so forth. So when you added a JAR to your VS it could bring hundreds of transitive JARs. Your VS could get bloated fast (and unless you regularly cleaned up unused packages they would live in it forever). To keep things stable, a VS only accepted specific versions of Java packages. For example, you brought Log4j 2.14 into your VS. If there was a Log4j 2.15 version, you had to bring it explicitly. This kept your VS from breaking when there were API changes in dependencies. That’s where “versionset” got its name: it was an isolated set of Java JARs and versions, to be used as build and runtime closure for a product.

In a monorepo world, you just say to your customers, “hey! grab my code from this path in the repository.” Done. In a versionset world, those customers need to import your Java package into their VS explicitly. That was the status quo of vending software at Amazon, to just say “hey bring my Java package version x.y into your versionset and start using it!”

[ If you’re wondering: why didn’t I just build a service? That would have solved this. There were other, more complicated reasons I couldn’t vend this as a service at that time. ]

I did not love that software vending model, for many reasons.

Being a good citizen. My dependencies got passed to customer’s versionsets. If I was going to have the power to bring hundreds of transitive dependencies into thousands of versionsets at Amazon, that was a big responsibility. Sometimes, you would end up with dependency conflicts (package A brought Foobar-2.1 into my VS, but package B brought Foobar-2.2 into my versionset, which one do I use and what happens if they’re not backwards compatible? Ahrhghghggghhhh!!!). Your VS was broken, so you needed to drop everything you were doing and spend the next 5 hours in dependency-hell, resolving conflicts one by one. If you were lucky, it broke at compile time; if you were unlucky, at runtime. It was a rite of passage at Amazon, involving crying, cursing, sacrificing your first-born child, and having your coworkers commiserating with you.
Speed of innovation. If bringing a dependency into my VS could get amplified a thousand-fold when that dependency got added to thousands of customer versionsets, that was going to make me think twice about each dependency I brought. This was going to stifle innovation and speed of delivery of changes.
Ability to fix a bug in a timely fashion. Once your code was imported into somebody’s VS, it wasn’t refreshed again until they explicitly refreshed it. Most service owners refreshed dependencies on a cadence, sometimes nightly, sometimes weekly, but also, sometimes manually, sometimes never. If I pushed a bad bug, it was impossible to ensure 100% of the VSs that got the bug also picked up the fix. For every urgent bug fix, I was going to have to file tickets to all my customers to get them to refresh their VS asap which is a huge inconvenience to them.
Predictable operations. In terms of providing operational support for my platform, multiple versions of my platform out there meant the platform could behave differently for different customers, so it could become an operational nightmare, having to ask the person you’re helping what version of the platform they’re using and then having to remember the quirks of that specific build.
Customer experience & onboarding cost. The onboarding experience was toilsome: bring my JAR into your version set, deal with all the dependency conflicts, change your ant build file, export and deploy your own main() as a standalone executable. On average, it took a couple of days. I wanted it instant: type a command and it just works!

I disliked everything about the status quo. I needed to think outside the box.

Classloaders to the Rescue!

The basic problem was that if you had a piece of code that was built in one VS (my framework’s), and you tried to use a JAR from a different VS (the customer’s), the JVM wasn’t happy, because they had different build and runtime closures. I was chatting with one of my mentors, Cary (a Principal at Amazon), and he casually mentioned classloaders. I didn’t know anything about classloaders, but my friend Trevor did and he helped me bootstrap the thing.

What are Java classloaders? “The Java Class Loader is a part of the Java Runtime Environment that dynamically loads Java classes into the Java Virtual Machine. Usually classes are only loaded on demand. The Java run time system does not need to know about files and file systems as this is delegated to the class loader.”

Normally, you give classloaders very little thought. Classes are just magically loaded when you need them. How? Not my problem. Most programs have a straight forward, single classpath, so things are straight forward. You don’t normally think about how those classes are actually being loaded.

Turns out you can have multiple classloaders running in the same JVM, which gives you classpath isolation. I could have also achieved that by running different processes and communicating via sockets, but this was for code that needed to execute hundreds of thousands of times per second, so performance was very important. I profiled and verified that two classloaders in the same JVM was significantly faster than 2 JVMs.

Trevor was right: this seemed like potentially a good solution for my little problem. This was going to allow me to have multiple versions of the same dependency, one in the classloader responsible for the platform code, and one in the classloader responsible for the customer code, happily co-existing with each other.

My platform booted in the primary, default classloader. It then created a child classloader, and it loaded the customer JAR (and its dependencies) into the child classloader. It worked!

I did run into one little problem. Java’s default classloader behavior is parent-first strategy. That means that whenever a class needs to be loaded, it will first search in the parent context and if not found, it will search in the child context. This generally makes sense, but in my case it meant my customers’ code could end up using my dependencies instead of theirs, which led to very subtle runtime failures, very hard to debug and understand. So Daniel and I ended up writing our own classloader, switching to a child-first strategy (good article about parent-first vs. child-first classloader delegation). There was the occasional weird class that had its own classloader logic (log4j, I’m looking at you…) so we even had to have some custom logic in the classloader to treat them differently.

领英推荐

Coding Challenge #37 - Redis CLI Tool

John Crickett 1 年前

Why Learning Java in 2025 is a Game-Changer for…

GUVI Geek Networks, IITM Research Park 1 个月前

ZIO 2.0 Released

John De Goes 2 年前

I had a bigger problem though. The JVM does not consider a class loaded from classloader A to be castable to the exact same class loaded from classloader B. Sure, you, the human, know they’re the same class. But if they’re living in different classloaders, the JVM thinks they’re different.

Reflection to the Rescue!

I had used one obscure, esoteric feature of the JVM to solve the first problem (create a secondary classloader to load customer code with clashing dependencies). So I turned to another obscure, esoteric feature of the JVM to solve my second problem (call code from one classloader in a different classloader): reflection.

What is Java reflection? From here, “Reflection is an API which is used to examine or modify the behavior of methods, classes, interfaces at runtime.”

Reflection is ugly to look at, but incredibly powerful. Say I wanted to call a method doSomething in an object foo from my secondary class loader. Instead of doing this:

foo.doSomething();

I could do this:

Method method = foo.getClass().getMethod("doSomething", null);
method.invoke(foo, null);

This is hideous, verbose code, but it works very well across classloaders! It gets even uglier when parameters are involved in those method calls.

Annotations to the Rescue!

I had one more decision to make. Traditionally to write code that executes within a framework, the framework exposes a Java interface, and you must create a Java class that implements the interface. This is a fine and tried way but I disliked a few things about it. One, sometimes customers already had test code that behaved like transactions X, Y and Z. So I didn’t want to force them to have to refactor it or have multiple copies of it around. Secondly I didn’t have 100% certainty that I had ironed out the exact interface and I suspected many cases were going to surface as my platform gained more customers, so I wanted to harden the API but leave the door open for growth. Instead of Java interfaces, I turned to yet another somewhat esoteric feature of the JVM: annotations.

What are Java annotations? From here: Java annotations are used to provide metadata for your Java code. These can be runtime instructions that tell others what to do with your code.

Annotations turned out to be an elegant solution to my problem of how to evolve my platform, and keep it flexible, while having a reasonably hardened API. The annotations could encode all kinds of interesting metadata for my customers to tell my platform. You could annotate your initialization code, your termination code, and your transactions. TestNG and jUnit both do a nice job with this, with @BeforeClass, @BeforeTest, @Test, @DataProviders, etc, so I took a lot of my inspiration from TestNG. You could annotate pre-existing code in a couple of minutes and have a working load test!

And miraculously it all worked!

In my final solution, I vended my software via an interface package that contained annotations. Customers brought the interface package into their VS. Because it was just an interface, it brought no dependencies into their VS, just itself. And because the interface was hardened, I didn’t have to worry about customers refreshing the interface regularly. The interface offered a bunch of annotations, so it afforded me flexibility to grow it in the future. And lastly, since I loaded my closure into the primary classloader and my customer’s closure into a secondary classloader, incompatible classes could happily coexist.

To be honest, the code necessary to make all this happen is probably the ugliest code I’ve ever written. The APIs for dealing with classloaders, reflection and annotations are powerful, but they aren’t particularly elegant or beautiful, and neither was my code. But it was effective, and it has survived a decade in production, being executed millions of times per second in production right now as you’re reading this story. The tradeoff was product complexity for me, or a more complicated onboarding story for my customers, so I chose the former. Customer Experience always wins. I think that had a lot to do with the product usage growing to tens of thousands of services at Amazon using it every day to validate they can scale. The fact that it’s still running today is a testament to those design choices holding up to the passage of time.

Next time you see an esoteric feature, just think: it may end up being just what you need some day!

Carlos' Blog

12,047 位关注者

Kehinde Onadipe

SDE at Amazon Business

6 个月

Nice read

1 次回应

Amuda Adeolu

Senior Software Engineer / Java Community Champion at Andela

6 个月

Thanks for sharing, Carlos Arguelles 1. Some software companies, like Google, have their code in a giant monorepo. There are some notable exceptions to the use of this single widely accessible repository, particularly the two large open-source projects Chrome and Android, which use separate open-source repositories, and some high-value or security-critical pieces of code for which read access is locked down more tightly [1] 2. Reflection is ugly to look at, but incredibly powerful... Reflections comes with hidden price i. You lose all the benefits of compile-time type checking ii. Reflective method invocation is much slower than normal method invocation [2] [1]Fergus Henderson, Software Engineering at Google, Revised 19 Feb 2019. , Available at < https://arxiv.org/pdf/1702.01715 > [Accessed: September 7, 2024] [2] Joshua Bloch, Effective Java Programming Third Edition, pp 282-284

Rhythm Varshney

Top System Design Voice | SDE@OneCard | Backend Developer | Problem Solver | Health and Tech

6 个月

With each line of read, my interest was increasing. However, in a VS where topologically packages get saved, is it a jar or branch reference of code? As I see some teams build RPMs over VS. Also, the classloader concept is interesting and would definitely be using it on demand. Very insightful

Albert R.

Client Technical Specialist and Chief Database Architect at Mphasis, a Blackstone company || Health AI @ DocNote.ai || GenAI Search @ MetaRAG.ai || GRC @ NIST.ai || KYC @ OFAC.ai

6 个月

Carlos Arguelles good read: 'the code necessary to make all this happen is probably the ugliest code I’ve ever written.' That's because you are using java ??. Hope you were SUPing for Labor Day.... Thanks again for your post.

1 次回应

Roman Fresneda Quiroga

6 个月

The Almighty TPS Gen! I kinda loved that bit of every service launch at Amazon, It Just Worked! Testament to all those great, and shall I say, tasteful design decisions. Thanks for the deep thoughts into it!

3 次回应

查看更多评论

要查看或添加评论，请登录

Carlos Arguelles的更多文章

Beware of the Big Tech “Bubble”

2025年3月23日

Beware of the Big Tech “Bubble”

[cross post of my original 10/07/23 blog since that blog may be behind a paywall sometimes.] My entire 27 year career…

29 条评论
Memories from Working at Microsoft in the Nineties

2025年1月19日

Memories from Working at Microsoft in the Nineties

Fast times: the days of sleeping at the office [cross post of my original 5/20/21 blog since that blog may be behind a…

27 条评论
The Good, the Bad and the Ugly: watching my code evolve over a?decade

2025年1月5日

The Good, the Bad and the Ugly: watching my code evolve over a?decade

Lessons learned in growing, maintaining, and trying to kill my product [cross post of my original 11/16/21 blog since…

13 条评论
Launching vs Landing: A real world case study

2024年12月28日

Launching vs Landing: A real world case study

"If you put code in production and no one uses it, does it still have an impact?" [cross post of my original 12/28/24…

23 条评论
The day my mistake kinda brought down the Amazon Store

2024年11月30日

The day my mistake kinda brought down the Amazon Store

Embrace serendipity: Sometimes one embarrassing failure can put you on an unexpected new path [cross post of my…

18 条评论
Bunnies and Bees: Inventing at Microsoft back in the nineties

2024年11月25日

Bunnies and Bees: Inventing at Microsoft back in the nineties

[cross post of my original 6/9/21 blog since that blog may be behind a paywall sometimes.] In my 27 years in the…

7 条评论
How a few dedicated Amazonians saved Christmas

2024年11月10日

How a few dedicated Amazonians saved Christmas

And other stories from a collection of old t-shirts [cross post of my original 1/13/24 blog since that blog may be…

5 条评论
Collaborating or Competing?

2024年10月21日

Collaborating or Competing?

The surprisingly competitive world of Engineering Productivity tooling in large software companies [cross post of my…

6 条评论
How I Rewrote My Job Description (Over and Over Again) At Amazon

2024年10月13日

How I Rewrote My Job Description (Over and Over Again) At Amazon

Develop a healthy disregard for your day job [cross post of my original 09/21/21 blog since that blog may be behind a…

12 条评论
You got peanut butter in my chocolate!

2024年10月6日

You got peanut butter in my chocolate!

The story of marrying a Profiler and a Load Generator at Amazon I think, too many times, junior engineers assume…

6 条评论

See all articles

Classloaders to the Rescue!

Carlos Arguelles

How I used three esoteric, obscure features of Java to solve a real-world problem at Amazon

Classloaders to the Rescue!

领英推荐

Reflection to the Rescue!

Annotations to the Rescue!

And miraculously it all worked!

Carlos' Blog

12,047 位关注者

Carlos Arguelles的更多文章

社区洞察

其他会员也浏览了

AWS Lambda: Comprehensive Notes for Interviews

Unlocking the Power of AWS Lambda with Python

Xmas special: InfoQ 2024 report, Apache Kafka 3.7.2, Spring AI MCP

Making the Tedious Exciting - How Amazon Q is Revolutionizing Software Updates

A zero cost distributed cache that works on all managed K8S: Openshift, GKE. Using Jgroups Java library with Infinispan inside a Spring Boot

3 Optimization Techniques To Boost Your Java Lambda Performance

CRUD Operations in Java Applications with?MongoDB

Kafka vs RabbitMQ: Biggest Differences and Which Should You Learn?

Redis for Python Developers?-?Aamir?P

Preventing Duplicate Cron Job Executions in Distributed Systems Java base

How I used three esoteric, obscure features of Java to solve a real-world problem at Amazon

Classloaders to the Rescue!

领英推荐

Reflection to the Rescue!

Annotations to the Rescue!

And miraculously it all worked!

Carlos' Blog

12,047 位关注者

Carlos Arguelles的更多文章

Beware of the Big Tech “Bubble”

Memories from Working at Microsoft in the Nineties

The Good, the Bad and the Ugly: watching my code evolve over a?decade

Launching vs Landing: A real world case study

The day my mistake kinda brought down the Amazon Store

Bunnies and Bees: Inventing at Microsoft back in the nineties

How a few dedicated Amazonians saved Christmas

Collaborating or Competing?

How I Rewrote My Job Description (Over and Over Again) At Amazon

You got peanut butter in my chocolate!

社区洞察

其他会员也浏览了

AWS Lambda: Comprehensive Notes for Interviews

Unlocking the Power of AWS Lambda with Python

Xmas special: InfoQ 2024 report, Apache Kafka 3.7.2, Spring AI MCP

Making the Tedious Exciting - How Amazon Q is Revolutionizing Software Updates

A zero cost distributed cache that works on all managed K8S: Openshift, GKE. Using Jgroups Java library with Infinispan inside a Spring Boot

3 Optimization Techniques To Boost Your Java Lambda Performance

CRUD Operations in Java Applications with?MongoDB

Kafka vs RabbitMQ: Biggest Differences and Which Should You Learn?

Redis for Python Developers?-?Aamir?P

Preventing Duplicate Cron Job Executions in Distributed Systems Java base