登录查看更多内容

ETL v Eventual Connectivity

Dominic Rebello

Director | Data & AI

发布日期: 2019年5月23日

IETL has been the standard of importing data for a long time. Why change something that works? Because it doesn’t work anymore. There is a point where ETL does not scale. I don’t mean processing and speed-wise, I mean that ETL makes it very hard to blend data from many systems. ETL works off a couple of premises that in reality do not work:

ETL assumes that data is clean enough to blend perfectly.
ETL requires upfront planning on how systems join onto one another and it is typically a point to point type of architecture.
ETL assumes that identifying how systems join onto one another is a simple task.
When adding new systems to your hub of data, the architects need to go back to the drawing board to figure out how that new source blends to all other sources.
ETL was typically designed with the end goal being the Data Warehouse i.e. a relational database.

Speaking from experience, if I was pulled onto a data project to blend data from many systems and my manager told me that there were 50 systems to connect, I would be completely worried about how many months or years it would be that I would know what the modelling should be before we can start ingesting data. However, there is light at the end of the tunnel.

If this light is so obvious, why was everyone not doing it in the past?

Easy. For the last 30 years, we have tried to solve all our data problems with the relational database. Over the last 10 years, we threw Document databases in to help us with solving certain problems. But how many companies have adopted what I think is one of the more important database families, the Graph?

The Graph database was not something readily available, off the shelf, ready for many to easily pull into their solution. In today's market, you do have a good choice of Graph Databases available. How does this relate? Although graph databases need modelling, they don’t need it “as much” as the relational database does. In fact, joins between “tables” or “nodes” in the Graph are something that is not expensive calls. In fact, you might join across 100’s of nodes or tables in the Graph, and get back a response within under 100ms. This is important because it is a critical component of the new wave of ETL that we like to call “Eventual Connectivity”.

Eventual Connectivity is the idea that we do not join and blend data on entry or load into another system. Rather, we load all of the data first, along the way we are flagging each record with some “edges” and metadata to flag unique or fuzzy references to this object. As two records notice that they have an “edge” with the same unique code - they will then merge or build a relationship in the graph.

So why is this the better approach?

(I will just preface myself with that I am a big candidate of explaining times where there is no better or worse, but better for certain occasions. This is one of those moments where I can’t actually see a better reason to use ETL, I do believe that ETL is not needed as long as you have the right infrastructure to support this new Eventual Connectivity approach).

If we had to blend data from over 500 systems, we can easily take one system at a time and not care about the other systems. In fact, with this approach, we can take one table inside a database at a time. We don’t need to know how that table joins onto other tables at all.
This helps with discovering denormalised references and relationships between data.
Bringing on new systems does not require us to go back to the drawing board.
The modelling will end up in the Graph Database and hence we are not restricting ourselves for the consumer of this data i.e. Data Warehouse, Machine Learning etc.
It will be very clear what data did not blend well i.e. they will have floating “edges”. Makes this much easier to identify WHY they did not blend.

Let’s also list the reasons why Eventual Connectivity is bad:

During ingestion of data, you might be wondering why there are so many duplicates! It is because this process will usually be done after ingestion has finished.
You might feel like you don’t have control of how things are joining onto each other. (In reality, it is usually because you missed this in your architecture sessions)
It is much more of a Spray and Pray type of method. However time after time, we have proven that this approach yields significantly better results in blending data.

There is an analogy I like to use that helps me understand the different approaches very well.

Imagine you are reading a book. On page 14, someone asks you to tell them how all the characters in the book are related to each other. Well, knowing the twists and turns that happen in most books, you would most likely tell them “I don’t know”. This is what we are doing with ETL.

Now imagine another situation. Instead, you answer “I am going to finish the book first, I am going to mark relationships along the way” and in the end, I will put all the relationships into a big “network” or “map” and then I can answer this question. You are now describing “eventual connectivity”. Then imagine you start reading the second version of that book, with similar characters. You already have a map that is created and most likely when you run through the same process, you will have new characters related to existing characters.

So you might be able to see why people believe the ETL process is flawed in its design, let alone its implementation. It, however, has been a great part of organisations path to building a Data Foundation. I can take the learning from that and bring it into a modern scalable process - CluedIn.com

Please comment if you disagree or I have missed something out.

Happy to provide more information or a personalised demonstration.

Andrew Davis

Enterprise Data Architect, DW/Data Lake/MDM Solution Architect and ERP Data Specialist

1 年

Hi Dom, this is all very sensible and I would consider it best practice anyway. Some of us have been doing it for years using a variety of techniques (Generic Entity Modelling, Data Vault) and it certainly doesn't need a graph database to work - just a tuple table and careful design of unique identifiers so any 2 Entities can be linked across any 2 Entity Types (and potentially different physical tables). I know Graph databases do this stuff automatically, but they don't handle the OLAP style queries so well - the best way to perform those is to denormalize the data out into star schemas - but how you materialize graphs / networks into dimension tables, especially with SCD2 is a whole other topic.

Georg H.

MLOps, data science and AI expert with over 14 years of experience in various industries covering online retail, construction, smart factory, digital twinning and high performance computing. Former ML Platform @Zalando

4 年

are there any projects on github using this pattern? Would love to see this in real.

查看更多评论

要查看或添加评论，请登录

Dominic Rebello的更多文章

Lessons Learnt in Master Data Management

2023年4月18日

Lessons Learnt in Master Data Management

What did we learn from past customers? Lesson 1: Know every detail of what you're working on before going ahead with…

14 条评论
Build Versus Buy

2020年12月17日

Build Versus Buy

Quite a contentious discussion that has plagued the technology world for some time is the concept of build versus buy…

1 条评论
New Zeland: Privacy Act 2020 is in effect

2020年12月2日

New Zeland: Privacy Act 2020 is in effect

1 December 2020, the new Privacy Act 2020 came into effect. The Act introduces a number of new privacy protections for…

1 条评论
European Commission proposes measures to boost data sharing and support European data spaces

2020年12月1日

European Commission proposes measures to boost data sharing and support European data spaces

To better exploit the potential of ever-growing data in a trustworthy European framework, the Commission today proposes…
What is the difference between the Data Warehouse and CluedIn?

2020年6月16日

What is the difference between the Data Warehouse and CluedIn?

It is a relevant question. Very relevant.
COOL VENDOR: AWARDED ?

2020年5月11日

COOL VENDOR: AWARDED ?

Oh, Boy! What a year it's been since we relocated the family to sunny Brisbane to launch CluedIn APAC. Like any big…

85 条评论
Data Maturity (Model) Frameworks

2020年1月13日

Data Maturity (Model) Frameworks

Working in the data field, it is very apparent that there is a need for frameworks and maturity models so that…

4 条评论
What resources and process do I need to make a Data Foundation a success?

2019年9月11日

What resources and process do I need to make a Data Foundation a success?

Companies that approach building a data foundation with a technology-only focus will fail. A Data Foundation covers…
The idea of a Universal ID?

2019年6月13日

The idea of a Universal ID?

It is the dream of the data architect that there is a ubiquitous and universal ID that stitches its way through all…
Merging data can be simple if you have the right tool kit...

2019年6月11日

Merging data can be simple if you have the right tool kit...

Blending data is hard. Let’s start with that realisation.

1 条评论

See all articles

ETL v Eventual Connectivity

Dominic Rebello

Director | Data & AI

Dominic Rebello的更多文章

社区洞察

其他会员也浏览了

The ETL to ELT to EtLT Evolution, and data pipelines

Reverse ETL on Snowflake

ETL IS DEAD

To hell and back with ETL. The unstoppable rise of data warehouse automation.

Reverse ETL vs. ETL

AWS Glue, Athena and Visual ETL based Data Quality Improvement

Building Resilient ETL Pipelines: Advanced Strategies for Handling Failures and Ensuring Data Integrity

?? Integrations Unlocked: ETL Pipelines (Part 2) ??

GenAI based ETL & Visualization

Ace Microsoft Fabric: Understanding Dataflows Gen2

Dominic Rebello的更多文章

Lessons Learnt in Master Data Management

Build Versus Buy

New Zeland: Privacy Act 2020 is in effect

European Commission proposes measures to boost data sharing and support European data spaces

What is the difference between the Data Warehouse and CluedIn?

COOL VENDOR: AWARDED ?

Data Maturity (Model) Frameworks

What resources and process do I need to make a Data Foundation a success?

The idea of a Universal ID?

Merging data can be simple if you have the right tool kit...

社区洞察

其他会员也浏览了

The ETL to ELT to EtLT Evolution, and data pipelines

Reverse ETL on Snowflake

ETL IS DEAD

To hell and back with ETL. The unstoppable rise of data warehouse automation.

Reverse ETL vs. ETL

AWS Glue, Athena and Visual ETL based Data Quality Improvement

Building Resilient ETL Pipelines: Advanced Strategies for Handling Failures and Ensuring Data Integrity

?? Integrations Unlocked: ETL Pipelines (Part 2) ??

GenAI based ETL & Visualization

Ace Microsoft Fabric: Understanding Dataflows Gen2