登录查看更多内容

Data Mesh

Vincent Rainardi

Data Architect & Data Engineer

发布日期: 2024年11月6日

For many companies Data Mesh is not an option. It’s too expensive and too long. And they don’t have the departmental skills in data engineering or data modelling. Only corporate giants like HSBC, Siemens and Shell have such departmental skills. They operate in like a hundred countries so why would they want to centralise anything. Of course they should decentralise. But most companies and organisations are not like that. They work in one building. They don’t have thousands of IT staff, but only forty or sixty. They only have a small development team, probably five to ten people. Of course you would want to centralise things. You want to build a data mart, like a few fact tables with their dimensions. You don’t even have the options to put in a middle layer (silver layer) like Data Vault or EDW. It is too expensive and too long.

But then we have the mid size companies. Companies who can afford to have a silver layer. Companies with annual IT budget over a million dollars. Companies with IT departments of a few hundreds people, 50-70 of which are in the development team. They have quite a few engineers, a few data analysts, some business analysts, some analytics engineers, some data scientists. The some of the business users are “super users”. They can write SQL, they can create their own report on Tableau and they understand data models. The business users understand their business much better than IT. So rather than going back and forth to the IT dept asking for this and that (which btw always takes too long, like months) they ask IT to give them the data, and they will do the analysis and reporting themselves.

Decentralise

In that situation it makes sense to create data products. It makes sense to decentralise. Decentralise the data modelling. Decentralise the reporting function. The analysis function (both business analysis and data analysis). IT supplies the data. The data is delivered via direct access to database. Or through an API, via a web UI. So users go to the Intranet and browse the data catalog there to order the data (they specify some parameters, like months and product categories or account types). And click Download. Within seconds the data file arrives in the user’s desktop in CSV format. They can double click it to open it in Excel.

Data products are also about reports. Again users go to Intranet and open the reports they need. Set the dropdown lists to the right values (like months or product or account) and click Run. Within seconds the report opens showing all the numbers they need. On that report there is an Export button. Users can click it to export all the data on the screen into Excel.

Data products are also about data model. Super users can connect to a database. They can see the views there. They can join the views and query them using SQL. They can ODBC from Power BI or Looker, specify the SQL and get the data out straight into their reports. They ODBC from Excel and Access, and get the data into their financial models. Or their sales models. Or inventory models. Whatever business scenarios they are working on, they can get the data straight from this database.

Some companies call this database a Data Hub. Some call it Snowflake (because it’s on Snowflake). Some call it a Data Warehouse.

In some companies this database is a Kimball star schema. In other companies this database consists of relational tables. They call it a Third Normal Form (3NF) but some of the tables are in the first or second normal form. But it is quite simple to join the tables. Users know that they need to look for column names with the word “key” or “ID” in them. Usually the first column. Those are the joining columns.

Centralise the data engineering

In those companies, although they decentralise the analysis function, the reporting function and the data modelling function, they still centralise the data engineering function. Bringing the data from outside the companies, jumping through the hoops of SFTP and firewall, scheduling data refresh and running automated data quality checks, those are all centrally done within IT. Not in sales, not in finance, not in investment desks, but in IT. IT has something called DataDog which monitors the data platform in both AWS and Azure. IT has Fivetran (which they brought in to replace MOVEit) which do all the data movements from outside and inside the company. IT has Control-M which monitors every single data job in the company. And there are hundreds of jobs running everyday. Whenever a job failed, IT gets an alert, and they go and fix it.

Those tasks are centralised in IT. Business users don’t want to deal with them. They want to focus on the business analysis, data analysis and reporting. They are not equipped to do it like IT. IT is the right place. IT has data engineers. IT has security engineers. IT has software engineers. And they have Testers too. Business users don’t have all that.

Data Mesh

Is that Data Mesh? Strictly speaking no. In Data Mesh, data products are developed by individual departments. Or divisions. Well if you are a corporate giant where every division has their own 200-people IT department, yes by all means decentralise the data engineering. Imagine that you are Siemens. You have aerospace business, electronics business, mining industry, life sciences, transportation, water industry and you also make wind farms. It make sense to decentralise the data engineering. Every division creates their own data warehouse. Every division does their own data ingestions, data modelling and data monitoring. Because you have like many divisions! Each division operates in multiple countries. You have the IT resources (and IT budget!) on every divisions.

You can afford to do the real Data Mesh. Data products are developed by each IT team (because you have IT teams in different divisions in different countries). The data ownership is distributed. Every division do their own data ingestions, data monitoring, data quality checks. Yep, DataDog, FiveTran and Control-M. The whole thing. You have IT team in Singapore. You have IT team in the US. You have IT team in London. You can afford to distribute the data ownership to different business teams and IT teams in different countries. The aerospace business are very different to the life sciences business. The electronics business is very different to the mining business. It makes sense to decentralise.

The role of the central IT team is to do data provisioning. In a data lake. They provide all the data that every division needs in AWS Data Lake. They call it S3. And in Azure Data Lake too. They call it ADLS. And in GCP too. They call it GCS. Yep they are corporate giants with IT budget over a hundred million dollars, perhaps even into the billions. Here’s a guide: IT budget is 2% to 3% of the annual revenue. You can look up the annual revenue of every corporate giant in their annual report. The annual revenue is in billions. Like 50 billion or 100 billion dollars. So 2% of that is 1 to 2 billion dollars. Compare that to your own IT budget. So you understand now that they play a different ball game to most companies. They can afford to do Data Mesh. Most companies don’t have that luxury.

Let’s recap.

领英推荐

Building A High-Impact Data Analytics Team

Analytics8 | Data & Analytics Consultancy 8 个月前

Why Every Organization Needs a Data Analyst

Rachel Phillips 3 年前

Standardizing Data Delivery with Data as a Product

Kevin Petrie 3 年前

Now that we have gone through the reality on the ground, let’s end it with a bit of theory. Data Mesh is about: (thanks to Zhamak Dehghani)

1.????? Data as a product

2.????? Distributed ownership

3.????? Domain-oriented

4.????? Self-service

5.????? Federated governance

Data products are developed by teams who best understand the data. And it is domain-oriented. The electronics business has their own data team (and business teams) who develop their own data products. The life sciences business has their own data team (and business teams) who develop their own data products.

The central IT team provides the data in a data lake. All divisions and all business functions have access to that data lake. Sometimes there are multiple data lakes. In multiple cloud platforms too: AWS, Azure, GCP.

And everything is self service. Each business function shared their data products that other business functions can access in a self-service way.

The data governance is federated. Some data governance are done centrally, some data governance are decentralised.

But if your IT budget is less than half a million dollars and your development team is only 10-20 people, stick to data warehousing. One warehouse. One IT team. IT build the warehouse, business users use it.

If your IT budget is over a million dollars and your development team is 50-70 people, decentralise the analysis function, the reporting function and the data modelling function, but still centralise the data engineering function.

If your IT budget is hundreds of millions dollars and you have multiple divisions/businesses in multiple countries, decentralise everything, and adopt the Data Mesh in full. With federated data governance.

Of course in reality it's not a straightforward cut into 3 distinct models like that. The boundaries are not clear cut. You might have to combine a few things and improvise, creating your own pathways in Data Mesh journey, to suit on your unique circumstances.

As always I welcome your comments and opinions. And corrections. Thank you.

Lili Marsh

3 个月

Great article, Vincent! I'm curious how you see Data Area servers evolving in the future and their impact on data exchanges. Do you think they will drastically change current data management practices?

1 次回应

prashant dixit

Data Management & Privacy Consultant

3 个月

Good read

1 次回应

Eric Marcoux

Customer Data Owner at Michelin

3 个月

I found your article very interesting, enlightening and your point of view very pragmatic. Thank you very much.

Daniel Liu

SAP HANA | SAP Datasphere | Snowflake Data Cloud

4 个月

Very informative

1 次回应

Joakim Dalby

Consultant database, BI, data warehouse, data mart, cube, ETL, SQL, analysis, design, development, documentation, test, management, SQL Server, Access, ADP+, Kimball practitioner. JOIN people ON data.

4 个月

May I add: 6. Data Area servers and some data will be exchanged between them.

1 次回应

查看更多评论

要查看或添加评论，请登录

Vincent Rainardi的更多文章

Snowflake vs SQL Server

2025年3月20日

Snowflake vs SQL Server

Sometimes we need to remind ourselves that Snowflake is not an OLTP database. I know today is the era of Hybrid tables…

4 条评论
Data engineer becoming solution architect

2025年3月18日

Data engineer becoming solution architect

Are you a data engineer thinking about transitioning to a cloud solution architect? Data engineer are good with…

2 条评论
Asset Mgt vs Fund Mgt vs Investment Mgt vs Wealth Mgt: What's the difference?

2025年3月17日

Asset Mgt vs Fund Mgt vs Investment Mgt vs Wealth Mgt: What's the difference?

If you work in banking or investment or any other sector in financial services, you might be wondering about the above.…
Data Warehousing Basics: Cost

2025年3月16日

Data Warehousing Basics: Cost

If you call yourself a data engineer you need to be aware of 2 additional things compared to a developer. The first one…

2 条评论
My Linkedin post & articles

2025年3月15日

My Linkedin post & articles

The list below goes back to Nov 2024. For older than that see here.

9 条评论
Data Warehousing Basics: Single Customer View

2025年3月15日

Data Warehousing Basics: Single Customer View

Imagine that you work for an insurance company who sell health insurance (HI), life insurance (LI), general insurance…

2 条评论
Data Warehousing Basics: NFR

2025年3月15日

Data Warehousing Basics: NFR

What I’m about to tell you today failed a lot of data warehousing projects which is why it’s worth paying attention so…

1 条评论
ML and AI - What's the difference?

2025年3月13日

ML and AI - What's the difference?

Machine Learning covers about 20-30 algorithms such as Logistic Regression, Decision Tree, Gradient Boosting, Random…

4 条评论
Microsoft Fabric or Synapse Analytics?

2025年3月11日

Microsoft Fabric or Synapse Analytics?

When it comes to Data Warehousing, Microsoft is confusing. Why? Because it has Microsoft Fabric and it also has Synapse…

16 条评论
Data Warehousing Basics: Transformations

2025年3月10日

Data Warehousing Basics: Transformations

As Bill Inmon said, T is the most difficult thing to do in the ETL. And that is why ETL vendors swap it around - they…

2 条评论

See all articles

Data Mesh

Vincent Rainardi

Data Architect & Data Engineer

领英推荐

Vincent Rainardi的更多文章

社区洞察

其他会员也浏览了

Modeling Data Classes

Anatomy Of A Data Stack (2024 Update)

Data Analysis Tools For Perfect Data Management

The Business Value of Data and Advanced Analytics

10 Practical OKR Examples in Data Analytics

Tales of A Data Mom: Data Stewardship – Are You Curious?

Mastering Semi-Structured Data Handling in Snowflake: A Technical Deep Dive

Common Pitfalls of Do It Yourself (DIY) Data Strategy & Implementation

Data Lake

Dashboards and the Future of Data Roles

领英推荐

Vincent Rainardi的更多文章

Snowflake vs SQL Server

Data engineer becoming solution architect

Asset Mgt vs Fund Mgt vs Investment Mgt vs Wealth Mgt: What's the difference?

Data Warehousing Basics: Cost

My Linkedin post & articles

Data Warehousing Basics: Single Customer View

Data Warehousing Basics: NFR

ML and AI - What's the difference?

Microsoft Fabric or Synapse Analytics?

Data Warehousing Basics: Transformations

社区洞察

其他会员也浏览了

Modeling Data Classes

Anatomy Of A Data Stack (2024 Update)

Data Analysis Tools For Perfect Data Management

The Business Value of Data and Advanced Analytics

10 Practical OKR Examples in Data Analytics

Tales of A Data Mom: Data Stewardship – Are You Curious?

Mastering Semi-Structured Data Handling in Snowflake: A Technical Deep Dive

Common Pitfalls of Do It Yourself (DIY) Data Strategy & Implementation

Data Lake

Dashboards and the Future of Data Roles