Data Warehousing Basics: Cost

Data Warehousing Basics: Cost

If you call yourself a data engineer you need to be aware of 2 additional things compared to a developer. The first one is NFR (Non Functional Requirements), which I wrote here. The second one is cost.?

The Reason?

The reason you need to understand cost is, as a data engineer you will be designing the solutions, based on the architecture devised by the solution architect. And you need to point out if the design is not fit for purpose. And that can be because of technical reason, or because of costs / timeframe.?

For example, you need to ingest pricing data from Bloomberg every 5 minutes. There were 3 solutions (API, Data Sharing and Iceberg Tables) and your solution architect chose solution 1 because a) it works, b) the cost is acceptable and c) the build time is acceptable. You need to point out to the solution architect if it turns out that a) it doesn’t work, b) the cost is unacceptable/too high and c) it takes way too long to build.?

Now, a) and c) requires sound technical knowledge. But b) requires understanding on cost. At least the basic understanding.?

OK so that’s the background. Now the basic of costing.?

Iceberg Analogy?

The cost of data warehousing is like an iceberg. Only 10% of iceberg is visible above the water. 90% of the iceberg is under the water, because of the density difference between ice and water. DW cost is like that. Only 10% of the cost is clearly visible above the water, whereas 90% of the cost is hidden, not? visible from the ship (unless you dive of course).?Image source: link.

Above the water?

The visible part is the cost of the storage, compute, ELT tool, analytics tool and cloud. Let’s pick an example. Say you’re building a DW on Azure using Fabric, ADF and Power BI. As a developer, you created ingestion pipeline on ADF into OneLake, then into a Kimball mart in Fabric Warehouse. And from there your colleague created Power BI dashboards.?

So the visible cost here is the storage in OneLake, the compute pool behind the Fabric Warehouse (based on capacity unit), ADF costs including data movement, pipeline cost?and orchestration costs, and Power BI (say PPU add-on).?

So, you need to learn about:?

  1. Pricing for OneLake storage ($0.023/GB/month).
  2. Fabric capacity and reservation (F64 is $5002/month with reservation).
  3. ADF cost structure ($0.25/DIU-hour for data movement, $0.005/hour for pipeline activity, $1/1000 runs for orchestration).
  4. Power BI licensing scenarios ($10/user/month for Premium Per User add-on). Here’s the pricing guide: link (not just PBI but Fabric too).?

Then take into account:?

  1. The data volume at rest: say initially 100 GB, growing 200 GB in year 1, 400 GB in year 2, etc.
  2. The data volume going through the pipeline per day: say 10 GB/day in year 1, 20 GB/day in year 2, etc.
  3. Number of developers: say 2 data analysts, 2 data engineers.
  4. Number of users: say 20 users in year 1, 50 users in year 2, etc.?

And voila you get a rough idea on the overall cost of the DW platform you are building, for the next 3 years. Simple right??

Under the water?

No. As they say in the navy: not by a long shot. As I said before, 90% of the cost is not visible to the naked eye. You need to dive into the water to see it. Those under the water costs are:?

  1. DevOps cost
  2. Testing cost
  3. Security cost
  4. Network cost
  5. Environment cost
  6. Resilience cost
  7. Data Quality cost
  8. Data Catalog cost
  9. Access control cost
  10. Administration cost
  11. Support cost
  12. Monitoring cost
  13. Skill cost
  14. Migration cost
  15. Audit cost?

There are 15 costs outside of that original cost. Fifteen! No wonder they say that 90% of the cost is under water. If it’s only one or two we missed it’s understandable. But as you can see, we missed 15 costs! Fifteen!?

So let’s go through them.?

  1. DevOps cost? The cost for setting up code repository, branch policies, deployment pipelines, and connecting those pipelines the Dev, Test, PreProd and Prod environments. In the above case it's the ADF code that we put into ADO (Azure DevOps). If you use PySpark notebooks, dbt, Lambda functions, Informatica or anything else, it’s that code that you need to put into ADO (or Jenkins). ? And yup, this includes the cost of GitHub, Maven and Gradle if you use them. And yup it includes the cost for setting up CI/CD pipelines, including the people cost. And operating all the above of course, not just setting them up. ? It is you the data engineer who are in the best position to understands this DevOps cost you see. Architects or Dev Manager don’t usually know how long it takes to do all the above, both setting up and operating it.
  2. Testing cost? The cost of testing the data warehouse business functionalities and testing the non-functional requirements (NFR). For example, the performance of the a) querying the Fabric Warehouse, b) running Power BI report, and c) loading data into OneLake and Warehousing using ADF. You also need to test the resilience requirements, such as the time to fail over to the secondary server or restoring the back up. ? You also need to test the access control (a user accessing the Power BI dashboard from their laptop/VDI, a data engineer accessing ADF, a data analyst querying the Fabric Warehouse using SQL Query Editor.? You also need to test the logging, the monitoring and the orchestration (for example using Control-M/PowerShell and DataDog).? Don’t forget that this testing will be on going,?not just until the first go live, but also for the subsequent releases. And you also need to include the planning not just the execution of the test. You also need to include the retesting effort, after the bug has been fixed. ? You need to include both the hours, and also the license cost of any testing software you use. Including setting up the test data.
  3. Security cost? In the Fabric Warehouse above, the security cost is the cost of setting up authentication using Microsoft Entra ID, storing and retrieving keys and Service Principals into the Key Vault, creating and securing the Key Vault, and the access tokens issued by the backend capacity platform.? Data stored in Microsoft Fabric is encrypted at rest, and in-transit data between ADF, OneLake, Warehouse and Power BI is encrypted with at least TLS 1.2. There are at least 10 different things with regards to Microsoft Fabric security including Power BI Security, Workspace Identity, Permission Model, Item Sharing,?WAF, NSG, and Telemetry. Please see here for the fundamentals.? Bear in mind that in addition to the cost of setting up all the above security stuff, there is also operation costs. And this includes both the man-hour cost and the software cost.
  4. Network cost? Let’s face it, you are not a network engineer. Nor a security engineer. So you need to get a network engineer to protect in the inbound traffic (link) including the Private Link, the Service Tags and opening TCP port 43 for the listed endpoints:? ? For the outbound you need managed virtual network (link), managed private endpoints and trusted workspace access.? You also need to consider data security, namely guest user sharing and customer lockbox. Here is an overview of the security stuff that you need to consider for Microsoft Fabric: link.
  5. Environment cost: cost of setting up Dev, Test, PreProd and Prod environment.
  6. Resilience cost: ...
  7. Data Quality cost: ...?
  8. ....?

I don’t think I can specify all those 15 points one by one, in detail. It will take me all night. In the above I only wrote 4 and it has taken me almost an hour! But my message is clear: you need to include those things when calculating the warehouse cost and build time. They all cost you money and time.?

Hidden Costs?

And then there are things which are called Hidden Costs or Ghost Cost. These are costs of lost business opportunities, or cost of issues caused NOT having a data warehouse. Nothing to do with the build or development. For example, while we spend a year building the warehouse: ?

  1. Cost of lost business opportunities: the business users are not able to understand the customers properly to initiate cross sales. ?

  1. The cost of customers leaving us and switching to our competitors, because we don’t have that business capability or product offering. ?

  1. The cost of using several technologies with the same capabilities, for example: several DW platforms, several BI tools, several ELT tools. And several teams to support them each.?

  1. Cost of customers data being not integrated, such as we send multiple letters to the same customers.?

  1. Cost of not knowing the total revenue, profit and costs?DAILY (they do know them monthly but not daily).?

  1. And many others.?

Summary?

So that the 3 types of costs related to building a data warehouse:?

  1. Costs which are “above the water”: cost related to development, such as storage, compute, ingestion, analytics (both people and software).?

  1. Costs which are “under the water”: cost not directly related to development, such as cost of DevOps cost, testing cost, security cost, network cost, environment cost, resilience cost, data quality cost, etc.?

  1. Hidden costs or ghost costs: cost not related to the build or development, but cost of lost business opportunities and cost of issues caused NOT having a data warehouse.?

So if you just take away one thing from this article it is this: be aware that, apart from the cost that you can directly see,?there are a lot of other costs.?

Tom Overton

Microsoft Fabric | SQL Server | Integration | Power BI Reporting | Monitoring | Data Mirroring

2 天前

Wow! Great points to consider. Thanks for sparking the conversation.

要查看或添加评论,请登录

Vincent Rainardi的更多文章

  • Data engineer becoming solution architect

    Data engineer becoming solution architect

    Are you a data engineer thinking about transitioning to a cloud solution architect? Data engineer are good with…

    2 条评论
  • Asset Mgt vs Fund Mgt vs Investment Mgt vs Wealth Mgt: What's the difference?

    Asset Mgt vs Fund Mgt vs Investment Mgt vs Wealth Mgt: What's the difference?

    If you work in banking or investment or any other sector in financial services, you might be wondering about the above.…

  • My Linkedin post & articles

    My Linkedin post & articles

    The list below goes back to Nov 2024. For older than that see here.

    9 条评论
  • Data Warehousing Basics: Single Customer View

    Data Warehousing Basics: Single Customer View

    Imagine that you work for an insurance company who sell health insurance (HI), life insurance (LI), general insurance…

    2 条评论
  • Data Warehousing Basics: NFR

    Data Warehousing Basics: NFR

    What I’m about to tell you today failed a lot of data warehousing projects which is why it’s worth paying attention so…

    1 条评论
  • ML and AI - What's the difference?

    ML and AI - What's the difference?

    Machine Learning covers about 20-30 algorithms such as Logistic Regression, Decision Tree, Gradient Boosting, Random…

    4 条评论
  • Microsoft Fabric or Synapse Analytics?

    Microsoft Fabric or Synapse Analytics?

    When it comes to Data Warehousing, Microsoft is confusing. Why? Because it has Microsoft Fabric and it also has Synapse…

    16 条评论
  • Data Warehousing Basics: Transformations

    Data Warehousing Basics: Transformations

    As Bill Inmon said, T is the most difficult thing to do in the ETL. And that is why ETL vendors swap it around - they…

    2 条评论
  • Data Warehousing Basics: Ingestion

    Data Warehousing Basics: Ingestion

    In data warehousing, ingestion is vital. It brings data from the source systems into the data warehouse.

    4 条评论
  • Data Warehousing Basics: OLAP vs OLTP

    Data Warehousing Basics: OLAP vs OLTP

    Data Warehousing is all about OLAP. It’s about analytics.

    10 条评论