The Total Cost of Cloud
Dennis Jaheruddin
Senior Director AI & Data @ Artefact | Leading AI, Data Engineering, Cloud, and BI for BeNeDACH
Many companies are considering to move part of their Data landscape to the Cloud. However, it is not trivial to estimate the total cost of this strategic decision. This article explains the main costs of running a Big Data landscape into the Cloud. The article is written with Azure, AWS and GCP in mind but extends to any major Cloud.
This article is mainly focused on estimating the annual run rate, but also comments on the transition into the Cloud.
Annual Run Rate
There are many ways to come to a cost estimate of annual Cloud costs, here we will use a top-down Cloud agnostic approach. The main line of thought: A certain amount of work that is currently done on-premises needs to be done in the cloud. For this we need resources (compute, storage, network, other) and regardless of the provider or form, we will always need to pay for these as long as they are used.
Compute Cost
Every compute solution comes with its own cost model, and we can categorize these into two categories:
The first category contains solutions where you have visibility on the underlying hardware, for instance Virtual Machines or Containers. For these solutions you will pay hardware based compute costs (may be embedded in license costs) and the applicable license costs.
This leaves us with the second category, which contains solutions without visibility into which servers are used under the hood. For instance Functions, or even pure Software as a Service. Providers do not offer anything for free, so the usage fees that you pay, indirectly cover the hardware expense.
Hardware based compute cost
One advantage of using the Cloud is that you can provision resources more flexibly. As such we should not simply take the on-premises footprint and assume this is what we need in the Cloud. Instead we can safely assume that a fraction of the waste is eliminated, for instance half, and now we know the amount Cloud of hardware required. For example: if your on-prem servers are running at 60% utilization, we expect to see a smaller cloud footprint, realistically achieving 80% effective utilization of cloud servers.
To enable this flexibility, we will not fully commit the exact Cloud usage three years advance, but we do expect to make a significant commitment. As such we can calculate with discount rates that assume a large fraction of the total spend is committed in advance, and we can assume the average server is committed to for some time, perhaps 2 years.
A comment on comparing with on-premise hardware: on-premise hardware is often depreciated over 3 years, but tends to run for 5 in practice. The additional overhead of on-prem hardware can be of similar magnitude as the depreciation.
Usage based compute cost
If you have a consumption based architecture, hardware costs are not directly incurred. Instead there will be a different license model, typically counting the number of function calls. Compared to a server which is mostly idle, this can lead to savings of up to 99%. However, compared to a VM running at full utilization, using a function based solution like AWS lambda tends to cost 10x the original VM cost.
This allows for the following method of calculating the cost: Determine how much it would cost to run the work in a VM without waste, then multiply by 5-10x. In practice this means that it is only cost efficient to run a usage based model if the VM utilization for this workload would be below 10%, but of course there may be other reasons to go for this architecture.
Storage Cost
Storing large volumes of data in the Cloud can be attractive and is likely cheaper than on premises when leveraging native object storage, but storing data on disk in the cloud can be very expensive. As such we will assume that only a fraction of data is kept on disk, the remainder in object storage and a significant part of that perhaps even in cold object storage. For convenience, we introduce the term 'a day of data' which we can obtain by dividing the total data volume, by the average age of the data. So having 365TB of data, with an average age of a year, would define a day of data as 1TB.
Though older solutions may require you to keep more data on disk, modern Big Data solutions typically only require a fraction of your data to be stored on this expensive medium. Please look carefully at your solution design, but as a rule of thumb we may assume that we will a total disk volume of 7-14 days of data. Data often needs to be backed up, a conservative estimate would be that the backup storage needs to be 100% of the data storage.
All data that does not need to be stored on hot disks, can be stored in the much more economical object storage. In practice we may not expect a perfect separation between hot and warm storage, so a fraction of data on disk will also live in the object store, perhaps half. If historical volumes are limited. or historical data is accessed frequently, it can make sense to only use regular object storage. Otherwise we can decide to put the first 60 or so days in regular object storage, and the remainder in cold object storage.
Network Cost
Network costs changed over time, but in the coming year the main Cloud providers will have a model where data ingestion is free but data transfer out of a region is paid. This applies to both data transferred out of the Cloud, as well as data transferred between regions.
The data that moves out of the Cloud is the easiest to understand, any solution that lives on-premises or in a different Cloud may need to receive data from the Cloud. In this case we can calculate the Cloud egress cost. Note that transferring a large quantity of data over the internet may also require a dedicated internet or dark fiber connection.
Data that moves within the Cloud is harder to estimate. First of all different parts of the company may be working in different regions, and end up sending data across. Secondly a single team may have a backup or DR setup that needs to stay synchronized. Thirdly the solution itself may be built across availability zones, for better redundancy or perhaps simply latency. All of these will incur the data to be transferred across regions or zones. Finally, it is also worth noting that having a bucket, for instance S3, which is configured to keep copies in multiple regions, adds a regional transfer per selected region. So a bucket that keeps data in three places already incurs two transfers.
License Cost
Especially for the hardware based infrastructure, software is required. This can be software provided by the Cloud provider, or by a different software provider. For smaller scale solutions, the software costs can easily outweigh the hardware costs, but for Big Data solutions a reasonable estimate is that software will cost approximately half of the hardware infrastructure.
Other Cloud cost
The above are the main costs of running in the Cloud. However, for completeness it is worth mentioning some other costs as well. Think about user licenses, network costs, api costs and all other little things. We may expect these to add up to 10-15% for a Big Data platform, but if you operate on a smaller scale this percentage can come out higher.
It is also worth noting that this method assumes an optimized architecture that is used reasonably well. In practice, the transition to an optimized architecture may never be completed, so it is worth adding perhaps another 15% for expected inefficiencies on top of the estimated budget.
People Cost
As you can read in any article written by a Cloud provider, a significant amount of savings can be achieved as it takes less people to manage a similar footprint. However, it is also not realistic to simply eliminate the full people costs in a TCO analysis. Hence we will look into three categories of people:
- People handling physical hardware: Here one indeed would expect 100% savings.
- People managing lifecycle, patching, automation: This still needs to be done, but their workload should be less, perhaps we can expect 50% savings.
- People developing the application and integrations: Here no change is expected.
It is worth noting that both in the cloud and on-premise the integration costs can be significant, and using an integrated platform can help to keep people costs under control.
Transitioning to Cloud
The annual run rate provides decent insight into the long term costs, but this is only relevant once you have made the transition. As such we should also not forget to account for the transition. The main driver of the transition cost is the length of this transition.
Duplicate Footprint Cost
In general we cannot phase out the existing hardware and licenses before something new has been developed. As such, for the duration of the transition duplicate budgets will be needed. It may be possible to negotiate on this, especially with the party where you are moving to, but it will still not be pleasant to carry a double burden for too long.
In addition, we may expect duplicate efforts from maintenance and management teams while the transition is in progress, it may be good to temporarily strengthen the team to avoid problems.
Development Cost
The estimate of the annual run rate assumes that the architecture is designed efficiently, and implemented well. Otherwise it is very easy to run into double annual costs or more for just the Cloud infrastructure.
For fully new usecases that do not have any interaction with existing data or processes, the impact is light, and will mostly consist of people learning how to develop the solution right. Depending on how different your Cloud solution is from what is used on-premise it can be worth bringing in expertise in this area.
An additional point is that existing solutions may need to be re-architected to become Cloud native. The cost of simply rewriting solutions may already exceed the cost of running the infrastructure for a year. Not to mention the time that it may cost which could have been spent on valuable use cases. Which brings us to the final point.
Opportunity Cost
The Cloud landscape is beautiful, but moving towards it and redesigning towards an efficient architecture is a significant effort. There are two main ways to mitigate this:
- Choose a solution that resembles what you have on-premises, ideally one that is the same in any Cloud
- Bring in resources to assist your development and automation teams, to ensure you minimize the gap in time that valuable use cases can be added.
Conclusion
The costs of Cloud are not always easy to understand, but by evaluating the above categories it is possible to know what it will cost to run in the Cloud. It also becomes easy to validate if suppliers are presenting you with a complete story. If your costs come out higher than what you currently spend on-premise it is likely sensible to look into a Hybrid solution.
The transition will come at a cost, but by choosing a solution that does not diverge too much from the current landscape, and by bringing in resources to set a good pace, this can be minimized.
Finally, let us not forget why Big Data platforms exist in the first place: To support the business and generate value far beyond the cost of any implementation. We should of course be cost conscious, but must primarily decide on the right path to greater business value.
Enterprise Business Development l Sales Executive l CloudOps l Automated Cloud Optimization
2 年Congratulations on your insightful article on total cloud costs. Your analysis was thorough and informative, providing valuable insights for professionals in the technology industry.
Data Engineering & Data Infrastructure for AI | Surfalytics.com - helping start your data career fast | ex-Amazon/Microsoft | 7x tech book author | University Instructor
4 年I didn't find any $$$ in the post ??