Reducing cloud costs - what worked for us
Background
In the last 2 years I have worked on a bunch of COGS (cost of goods sold) reduction initiatives (aka reducing cloud cost) for our cloud services. In my discussions with friends in the industry I have time and again seen that it's a topic which many find interesting and intriguing.
The purpose of this post is to share the key learning with others.
Note:
1) Given that this is a public post, I didn't get into the details specific to my organization. Instead I made an effort to distill the learning into principles and ideas which are more broadly applicable and useful to majority of people.
2) The work we did and success we achieved was a "team work" by a group of highly committed and capable individuals spread across multiple geos. No one "star performer" can show the stamina and achieve the results that a motivated team can.
Without further ado let's dive right in.
Two approaches to saving costs
We explore both the categories in turn.
Principle 1: You can't change what you can't see. Getting to the source of truth.
The first and foremost thing to get in place-if you don't have it already-is dashboards that show consolidated monthly costs of services and components based on real data. It should be possible to drill down and see which are the top components contributing to the cost of a service. It should also be possible to filter based on geo and deployment type (prod, staging etc.)
Note: It's common to share infra across in which case there needs to be a clearly defined, proportional and transparent cost attribution (to consumers) model for that infra.
Principle 2: Make the dashboard public. The power of gamification.
When everyone (withing the org) can see what everyone else is contributing to the COGS it unleashes a very powerful dynamic - gamification. A scoreboard that everyone can see motivates you to play your 'A' game. The satisfaction of seeing the results of your work on the dashboard (costs going down) is immense.
Principle 3: Choose your battles wisely. Don't sweat the small stuff.
Focus on the top 3-5 contributors to cost at any give point based on what the dashboards show. It's easy to get sucked into areas which are exciting to pursue but yield low returns. Here is where the dashboard act as a compass and points you in the direction of what to go after to get maximum ROI.
Principle 4: Data driven decisions.
Savings come primarily in two ways. Down-sizing or right-sizing resources or eliminating them altogether. Making decisions is always hard unless you have the right data to guide you.
领英推荐
To right size you need to understand the usage of existing resources, historical and current traffic volumes, and expected growth. This is lot of data to gather and analyze but without this it's impossible to make meaningful decisions.
To deprecate features you again need data on active use by paying customers. This can be an eye opening exercise because it will show you how customers are using your product as a function of the cost you are incurring.
Fun fact: When we did this exercise we found (for the first time) that a "cool" feature we had built was costing us 12$ per API call and there was hardly any active use for it.
Principle 5: No one size fits all. A horse for a course.
The topic of right sizing/optimizing is vast and probably deserves a post of its own. Some of the things we did includes - reducing nodes in clusters, changing plan, changing the type of nodes, resource reservation, negotiate pricing with vendors based on actual usage, tuning the number of always-on functions, replacing costly databases with cheaper ones, managing log volumes and retention, eliminating unused features, sharing infrastructure where it made sense for better utilization, redesigning, ....
Tip: Start with the assumption that you have over-engineered and there is resource wastage, and you will most likely find it.
Principle 6: Not always a zero sum game. It can be a win win.
It's natural to think that by reducing costs we may lose some other desirable attribute. While such tradeoffs do exist, it's not always the case. There are chances that you can reduce COGS and in the process get better. A case in point is an API where we reduced the COGS (of API gateway) by 90% and improved the performance of a customer script that uses the API by 6X.
So far we looked at how to reduce the COGS for what is already existing. It is equally important to learn how to be cost optimal from the beginning.
Principle 7: Prevention is better than cure. The easiest way to fix a problem is by not creating it.
COGS review is now a critical part of our design review process. COGS estimation and justification is something we do even before writing the first line of code. In the review process, COGS gets the same weight as the functional requirements. If we can't develop a feature in cost-effective way we would rather not do it.
Architects debating the cost aspects of design with passion is a sight to behold and music to the ears. Especially if you have gone through the grind of controlling the costs of an existing service in production with active customers.
This brings us to the end of what I had to share and I think that's quite a lot to digest and assimilate.
I will end the post with an insightful comment one of our exec made [paraphrased]:
"Optimizing costs is being respectful of the customer because costs get passed to the customer eventually."
~S~
#workstories #cloudcost
People and Engineering advocate | Engineering Manager
5 个月Some great stuff in here. I’m going to share this with our engineering leadership team. Thanks for sharing, Subu ??