The Good, the Bad and the Ugly: watching my code evolve over a?decade

The Good, the Bad and the Ugly: watching my code evolve over a?decade

Lessons learned in growing, maintaining, and trying to kill my?product


[cross post of my original 11/16/21 blog since that blog may be behind a paywall sometimes.]


During my career at Microsoft, Amazon and Google, I’ve spearheaded a number of new products. Bootstrapping something, particularly something greenfield, is cool and exciting. You’re pioneering. There’s no past chaining you. You’re not stuck with some old crappy code somebody wrote before you. There’s no pre-existing customers that you can accidentally break with a bad push. You write the rules. It’s energizing.

Evolving an older codebase sounds like the exact opposite. Yet I assert that when you move from shiny object to shinier object, you miss out on an important part of maturing as a software engineer: observing the evolution of the code you wrote over many versions. How does it stand the passage of time? By leaving, you avoid seeing the price for poor decisions you made, years down the road. You avoid watching a new generation of junior engineers analyzing every decision you made, and sometimes even think you’re an idiot for the choices you made. You avoid a lesson in humility.

Having to evolve a pre-existing, well-established piece of software requires an entirely different skill set than bootstrapping a shiny thing. It goes beyond “keeping the lights on.” You start seeing how little choices become amplified over time. Did you create scalable and sustainable processes? You see how small little bits of technical debt accrue.

“Software engineering” is not just “writing code.” It’s also what happens next.

And lastly, if you hang around for long enough, you get to do something heart-breaking: start killing the product of your hard work. Having to set aside your feelings and objectively decide: it’s time to deprecate this thing, it cannot grow to accommodate the needs of the new generation.

I watched what I built (TPSGenerator, Amazon’s internal load and performance testing platform) evolve from 100 lines of code to 100,000 lines of code over a decade. It went from being a pet project that I worked on weekends and nights to being a business critical application that thousands of services at Amazon use every day, and is maintained by an actual team of engineers. That experience taught me a lot about software evolution.


2009: The?seed

My first task at Amazon was testing two new hot features in AWS: Elastic Load Balancer (ELB) and AWS Auto Scaling. You could put EC2 instances behind a load balancer, and if the load grew beyond current capacity, auto-scaling would launch more instances of EC2.

To test ELB, I deployed a simple http-server to a bunch of EC2 instances behind that load balancer. The http-server would respond with a payload that included the name of the EC2 instance. I then wrote a test to hit the load balancer in a loop, keeping a count of how many times each EC2 instance was chosen by the ELB. At the end of the test, I could assert that all the EC2 instances behind the load balancer had been hit, and verify that the load had been spread more less fairly among the instances.

But how to test auto-scaling? I needed to run that code at high throughput, so that auto-scaling would kick in. And, I needed to generate all kinds of interesting traffic spikes to see how Auto Scaling would adapt. I didn’t know it back then, but I was writing a poor man’s load generator, much like jMeter.


2012: The Seed Germinates…

I spent a year in AWS and then moved to the Amazon Store side of things. I kept playing around with the code, but mostly for fun.

It wasn’t until a few years later, after I was responsible for a multimillion dollar outage due to a load bottleneck, that I gave it a new shiny coat, and used it to redeem myself by discovering and preventing bottlenecks the following peak.

That got me thinking: if it was useful for one team, could it be useful to many teams? Those early days it was all about market fit. There were other products out there, both internal and external. What could my product provide that others didn’t? What was the differentiating “wow” feature? I’ve written about these early days in other blogs, so I’ll point you to this story if you are curious about the “greenfield days”. This story today is more about what happened next.


2014: Stem coming out of the?ground…

As TPSGenerator became more well known in the company and more people started adopting it, I realized that this was now an entirely different beast. The initial days were: be scrappy, iterate like crazy, throw things at the wall and see what stuck. Listen to every single customer relentlessly. I personally responded to any questions about my product within a couple of hours.

Now though, I had to balance innovation and stability. Innovation is disruptive. Sometimes when you move a thousand miles an hour you make mistakes. That’s an acceptable tradeoff with greenfield, but I now had customers that depended on my product working. A few times I pushed a brainless bug out to production in my quest for delivering new features asap. But now the stakes were higher: I broke hundreds of customers. Because my product integrated natively with Amazon’s CI/CD tooling, when my product didn’t work, teams couldn’t push their new code changes to production. That wasn’t just annoying: it was a security risk.

It felt a bit like the shoemaker’s children go barefoot… for years I had been telling others how to improve their test practices… but how about my own??? I had cut corners to achieve the desired delivery speed, but I eventually had to pay the price. I had accrued technical debt. That’s the endless tension between { Deliver Results + Bias for Action } and { Insist on the Highest Standards } in Amazon’s Leadership Principles.

So I slowed down and focused on building better gates for releasing my own product. I built better telemetry to understand when my product wasn’t working. I carried a pager for a while. I started developing more maturity around operations. I wrote more tests.

Those tests weren’t just for me, and weren’t just to validate that the code I had just written worked right now. They were an insurance policy for when a more junior engineer touched that code years from now, without having the full context that lived in my brain when I initially wrote the code. But with limited time, I ended up writing more integration tests than unit tests, because integration tests covered more code, more quickly. So instead of a test pyramid, I ended up with an ice cream cone pattern (shame shame shame). These integration tests took longer, were more brittle, and found problems too late in the cycle?—?most should have been unit tests.

All these poor choices weren’t decisions I was deliberately making, they were just things that I was organically gravitating towards in my haste to Deliver Results and in the craziness of exponential customer growth. When you’re driving 100mph, you don’t naturally pause to reflect on where you’re going?—?you’re spending your energy calculating how you’re going to take the next curve that is coming in 10 seconds, and you’re going to spend as little time at the pit stop as humanly possible.

I eventually course-corrected, but I didn’t entirely fix this: years down the road, I got to see many examples of newer engineers cursing my crappy testing practices, or some of the unexpected side effects of changing this code or that code. These were all invaluable lessons in software engineering you don’t see if you leave a product after v1.

The more customers I got, the more features they wanted. The speed of innovation needed went significantly beyond my individual ability to write code. I simply didn’t have enough hours in the day. So customers volunteered to write the code. That was one way I could scale, so I enthusiastically welcomed contribution after contribution. I still kept a pretty tight leash on the architecture, code style and code quality, painstakingly brainstorming on high level design for hours with the would-be-contributor, and then scrutinizing and approving every single code review that touched my product.

I did not realize it back then, but I can see now, there is no free contribution. Even as those engineers were willing to add a cool feature to my product “for free”, I was introducing complexity to the codebase, and I was adding long term operational load to the product. I should have been a little more thoughtful about which features I was adding, and how broadly applicable they were to my general customer-base. Some ended up being niche to a small number of customers, while having outsized operational overhead.


2015: And it’s a full-grown tree!

TPSGenerator had begun as a side project, but it was clear I needed an actual team to support it. So I moved to Amazon’s Developer Tools, which was the centralized organization that owned all the internal source, build, code review, test and deploy tools for the company. TPSGenerator had been a side project back when I was in the Amazon Store, but it was an Amazon-wide tool now and it needed to live in the right org for its future. Products that aren’t aligned with the business priorities of their parent org eventually die off.

I focused on building scalable and sustainable processes.

Part of that was stepping back and empowering others to own the product. It was my baby and I was jokingly referred to as its “benevolent dictator for life,” but I asked myself in earnest: what happens to it if I leave or get hit by a bus? TPSGenerator needed to live a life independently of mine. There were a number of meetings where I bit my tongue: this was our product, not my product. I needed to let the new generation drive its destiny, even when I disagreed with some of the paths taken.

Our customer base was growing exponentially, and with that came exponential operational load. We had to focus on how to achieve linear or sub-linear operational load with exponential customer growth. Every bug that came in was an opportunity to think: how do we prevent a customer from filing a similar bug in the future? Sometimes it was better documentation or better education, sometimes it was more descriptive and clear error messaging, sometimes it was deprecating a low-value confusing or buggy feature. It was also creating a community where customers could help other customers, instead of the oncall in my team being the only one.

I also realized we had given our customers too much freedom without enough guardrails. To be fair, that’s probably one reason the customer adoption was exponential (the product was so flexible that people could make it do whatever they needed), but it also created a mess. TPSGenerator could generate millions of transactions per second with a simple command on a unix shell, so it was like handing a loaded gun to interns or new hires. I started appreciating the value of opinionated software more and more. Yet if your product has let customers do whatever they wanted for years, and it has thousands of customers, it’s extremely difficult to start baking opinions, best practices and guardrails into it without breaking customers.


2019: It’s time to start killing the?tree…

Towards 2019, the architecture started showing some cracks. I am shocked the code lasted for a decade in a place where services get rewritten every couple of years. It became harder and harder to make significant changes without breaking things. As more of the company was moving from running services inside the Amazon prod network to running them on native AWS, we faced a choice: do we evolve the old tooling (which made a lot of assumptions about running in the prod network) to work in native AWS or do we create new tooling?

I hated the idea of deprecating a thing that I had personally worked on in one way or another for a decade. It felt like a piece of my soul. But I needed to think clearly and objectively about the needs of the business and the right thing to do for the company.

Thought leaders own a domain, not a specific tool. TPSGenerator was a way for amazonians to properly load test their services. Going a step further, proper load testing was a way to ensure amazon services could gracefully handle peak events and massive amounts of traffic. While I owned TPSGenerator, I actually owned helping Amazon handle peak events. Up-leveling your thinking like this helps you make decisions that are better for the company.

I knew perfectly well some of the decisions I made in terms of architecture in 2010 had not aged well. I made a painful calculation on how much it would cost to evolve the platform vs. how much it would cost to create a new product, while leveraging the knowledge we had acquired by operating the old product. I took a pitch to build a new product to my Senior VP, and secured funding. It was emotional and very personal. But it was time.

Credit


2025: Well, guess what, it's still around...

Turns out deprecating something is a LOT of work, who knew?

TPSGenerator still lives today, and it’ll probably still be around for many more years. Fully deprecating it will be a daunting task. But more and more of the company is moving to the new tooling every day.

The day we turn off the lights for TPSGenerator, I will pour a generous shot of a 21-yr old Scotch. Maybe Laphroaig, Lagavulin or Ardbeg. Something rich and peaty, like campfire. I’ll sit in my office and slowly sip it. And reflect on the 15-yr old journey of that codebase and smile.

Ian Yang

?? Venture Starter | Global Tech & Impact | Adobe & Shorelite Ambassador | Biz Dev Intern | Auburn '25 ?? adobe.ly/Auburn

1 个月

Deprecating a product you've poured your heart into is tough. I've been through that process, and it's definitely an emotional experience. But it's also a chance to build something even better. I agree that focusing on the domain, not just the tool.

回复
Natalie Leal Blanco

ECE @ UT Austin | Software Engineer | AI /ML & Computer Vision | Data Science | Open to Work

1 个月

Your content has become a course and a guide for knowledge I want to reinforce.

kripakaran Ravivarman

Senior Software Engineer at Adobe

1 个月

I enjoyed reading this article

回复
Zobia Shahzadi

Helping eCommerce Businesses Succeed on TikTok Shop and Boost their Sales Online?? | Social Commerce Specialist | Digital Marketing Expert | AI-Driven Strategies for Sustainable Growth ??

1 个月

Love this follow-up! Managing a product’s evolution and legacy is a whole new challenge—can’t wait to hear your insights! Carlos Arguelles

回复
Matias Mayo

Empathy Driven Solutions Provider: Software Development, Crombie's Ambassador | International Trade Consultant and Professor, Imports & Exports| IBS WorldWide | Olympic Weightlifting coach | Flag Football QB

1 个月

Loving this Blogs Carlos, learning so much over here! amazing how looking back at old code reveals not just technical growth, but also personal and professional evolution. One thing I’ve learned is that ‘ugly’ code often teaches the most valuable lessons, especially about simplicity and maintainability. I'm curious on how do you balance the temptation to refactor old projects with the need to move forward and ship new features?

要查看或添加评论,请登录

Carlos Arguelles的更多文章

社区洞察

其他会员也浏览了