The FinOps Firehose: Stop Drowning, Start Systems Thinking
Nikko Macaspac - Unsplash

The FinOps Firehose: Stop Drowning, Start Systems Thinking

Summary

The TL;DR is that primitive cloud financial management is no more. We now have an ever-evolving ecosystem of practitioners, frameworks, guidance, thought leadership, and new tooling from cloud service providers like AWS that is making our lives easier.

There is no excuse for an executive, cloud leader, or the like, to only start considering FinOps when trying to put out fires and drowning in unknown cloud spend.

Leaders must consider FinOps with a holistic, systems-thinking mindset. By starting early, and with the whole in mind. You will find there are opportunities everywhere to improve outcomes, and those opportunities are available much earlier than you think and proliferate areas of your business you may have never thought of. So, let's stop drowning and start systems thinking.

Introduction

It's Monday morning, your first day back for 2023 after a well-deserved break. The phone rings, and you guessed it, it's not the lottery office. It's the Chief Financial Officer (CFO).

You see, over that joyous festive break that you thoroughly enjoyed. No one was attending to their emails (rightly so), looking at logs, or monitoring application behaviour. Those on call were there for the break glass in case of emergency-type situations, and they certainly weren't on call to care about cost.

It seems over the break, costs have exploded...

Can you guess what went wrong?

  • If you were the receiver of this call or the CFO, the default response may be 'everything', everything went wrong. That's certainly how it feels the first-time bill shock happens.

It's akin to drinking from a firehose as teams scramble to undertake a new form of incident response, determine root cause, all while trying to decipher an AWS Cost & Usage Report (CUR) for the first time

  • If you are a FinOps practitioner or the like you may have jumped to suggest one of the following common low hanging-fruit we see in organisations that don't practise FinOps or are within the crawl phase of their maturity:

  1. Overprovisioned and underutilized compute;
  2. Non-production environments that were left running;
  3. Data egress costs;
  4. Forgotten instances and their backups/snapshots like AWS Elastic Block Store (EBS) or;
  5. Other provisioned yet forgotten resources like Elastic IPs etc.

However, there's something all of the above have in common, they are all reactive. Though technically correct, and likely one or more is a 'cause' (more likely a symptom of something seemingly unrelated) of some portion of the exciting new big bill the CFO is less than chuffed about, the question I pose is why were they? We will come back to this.

Reactive to Proactive

So what do I mean they are all reactive? Well, it's simple.

They are all trailing indicators, a service was rendered, a cost was incurred and collectively you are now drowning in unknowns as the team looks to turn off any of the above taps to curb costs.

Put frankly, the above list is the easy part of doing FinOps. They as Rumsfeld once said, are the "Unknown Knowns", things as an individual or organisation you may not know much about but someone else (the community) does. They are universal, they apply to all in the cloud and are the most documented elements of implementing FinOps online. They are the quick fixes for which significant initial % savings can be achieved once employed. But they aren't forever.

In fact, those 'fixes' are what most 3rd party tooling, and for a pretty penny, look to help you minimise costs related to. As they are the most commonly occurring issues in all organisations as teams navigate moving from on-premises fixed infrastructure to the multitude of options available in the 'cloud'. Errors are bound to happen. Though once the quick fixes are exhausted, you are left with significant sums just to visualise data using those 3rd party tools yet the true root cause/s of why are still left undiscovered.

So I ask again, what really has then been going wrong?

In my opinion, it's that the whole is not being considered and is the reason for this article.

Organisations, consultancies, and the like are targeting the easy wins, they aren't being bold and helping organisations/themselves actually peel back the layers of interdependencies in system processes, existing methods, and the related step changes cloud brings to uncover where processes need improvement, change management needs change and people need support etc.

Operating in the cloud (multi, hybrid or otherwise) is complex and change is hard. In order to understand where change is required we must first understand how people, processes and technology work as a whole. Across on-premises, cloud, and related business processes which allow each to operate in unison. This requires organisations to look where they didn't expect to look in order to find opportunities which shifts left any 'reactive' lagging indicator to be something that provides more clarity upfront and the data required to make business decisions.

We must switch our thinking from patching problems post mortem using expensive tooling to embedding the necessary rigour upfront to design cost conscious solutions which are cloud suitable

An example of this is data egress costs, a question organisations should ask themselves is could we embed a process earlier on in the system design/architectural approvals which applies an assessment of the architecture with a networking lens, a data lens, an XYZ lens that enables relevant professionals to assess suitability against given criteria.

In turn, shifting from a reactive bill shock to a proactive measure that enables 'best effort' upfront to minimise cost blowouts and applies balanced governance without analysis paralysis. Setting up AWS Cloud Watch Alarms and other notifications is not enough, they are just part of a well-established holistic set of FinOps processes.

Uncovering these opportunities all starts by looking past the quick and dirty, taking a look within your organisation and applying a systems thinking mindset. It means, refreshing your cloud strategy and operating models to ensure they aren't just shelfware and are driving the right message. Reconsidering the roles and responsibilities of teams and their interaction in approvals, assessments and validation processes. All the way through to considering if each individual team has been given the necessary context, training and support to understand their role and responsibility in driving cost-conscious business system development and maintenance in the cloud.

So How Do You Start?

A simple way to do this analysis is to use the Ishikawa (Fishbone) Diagram and the Five Whys . This will allow you/your organisation to identify the cause, not just symptoms, and eliminate issues in the system for good.

No alt text provided for this image

Example: Finance Domain Issue (Root Cause elsewhere)

  • Problem Statement: Cloud expenditure for Application A is above budgeted costs
  • Why? Workloads compute and storage is costing more than expected
  • Why? No analysis was done into what would be suitable in the cloud compared to on-premises
  • Why? No one has the necessary skills to do it
  • Why? No one has been trained or has been hired to support that process and therefore is not being undertaken

So by undertaking this process, which sometimes doesn't require all '5 whys' we are able to come to make the following observations:

  1. A process has correctly been established to perform costing of the solution in the cloud prior to deployment as part of non-production validation - Great work!
  2. However, due to not having the necessary skills the process is not being undertaken to ensure proactive right sizing of compute and overall solution architecture suitability
  3. Therefore, the cost of the overall solution is more than 'allocated' because the correct amount was never allocated due to not knowing what should or shouldn't be for the solution to be cost-optimised and successfully able to meet business objectives

This is just one of many examples, but shows the simplicity and value of asking why? repeatedly! Through one simple example we have determined that though Finance has now seen unexpected budget expenditure, there are architecture teams skirting approval processes and the organisation is missing critical skills and hasn't established needed development. All of which are crucial to ensuring good governance, and success moving forward.

Conclusion

In order to avoid being railroaded by the FinOps Firehose, we must look beyond the easy, the known, and begin to challenge ourselves and our organisations 'unknown unknowns' - the things we don't know because we simply don't know. And that is ok!

Guidance is aplenty online, but please, it's time to look broader, think big, have a bias for action and begin to peel back the layers of complexity to actually solve the root causes. FinOps has come so far in the last 12-18 months that we need to demand more from each other in really architecting for success. Make 2023 that year!

Thanks for reading.

#finops #systemsthinking #aws #gcp #azure


This article represents my personal opinion only and not that of my employer.

See other articles I have written relating to my interests in Distributed Ledger Technology (DLT), Distributed Systems, and Public Value here:

要查看或添加评论,请登录

Benjamin Hall的更多文章

社区洞察

其他会员也浏览了