ETL v Eventual Connectivity
IETL has been the standard of importing data for a long time. Why change something that works? Because it doesn’t work anymore. There is a point where ETL does not scale. I don’t mean processing and speed-wise, I mean that ETL makes it very hard to blend data from many systems. ETL works off a couple of premises that in reality do not work:
- ETL assumes that data is clean enough to blend perfectly.
- ETL requires upfront planning on how systems join onto one another and it is typically a point to point type of architecture.
- ETL assumes that identifying how systems join onto one another is a simple task.
- When adding new systems to your hub of data, the architects need to go back to the drawing board to figure out how that new source blends to all other sources.
- ETL was typically designed with the end goal being the Data Warehouse i.e. a relational database.
Speaking from experience, if I was pulled onto a data project to blend data from many systems and my manager told me that there were 50 systems to connect, I would be completely worried about how many months or years it would be that I would know what the modelling should be before we can start ingesting data. However, there is light at the end of the tunnel.
If this light is so obvious, why was everyone not doing it in the past?
Easy. For the last 30 years, we have tried to solve all our data problems with the relational database. Over the last 10 years, we threw Document databases in to help us with solving certain problems. But how many companies have adopted what I think is one of the more important database families, the Graph?
The Graph database was not something readily available, off the shelf, ready for many to easily pull into their solution. In today's market, you do have a good choice of Graph Databases available. How does this relate? Although graph databases need modelling, they don’t need it “as much” as the relational database does. In fact, joins between “tables” or “nodes” in the Graph are something that is not expensive calls. In fact, you might join across 100’s of nodes or tables in the Graph, and get back a response within under 100ms. This is important because it is a critical component of the new wave of ETL that we like to call “Eventual Connectivity”.
Eventual Connectivity is the idea that we do not join and blend data on entry or load into another system. Rather, we load all of the data first, along the way we are flagging each record with some “edges” and metadata to flag unique or fuzzy references to this object. As two records notice that they have an “edge” with the same unique code - they will then merge or build a relationship in the graph.
So why is this the better approach?
(I will just preface myself with that I am a big candidate of explaining times where there is no better or worse, but better for certain occasions. This is one of those moments where I can’t actually see a better reason to use ETL, I do believe that ETL is not needed as long as you have the right infrastructure to support this new Eventual Connectivity approach).
- If we had to blend data from over 500 systems, we can easily take one system at a time and not care about the other systems. In fact, with this approach, we can take one table inside a database at a time. We don’t need to know how that table joins onto other tables at all.
- This helps with discovering denormalised references and relationships between data.
- Bringing on new systems does not require us to go back to the drawing board.
- The modelling will end up in the Graph Database and hence we are not restricting ourselves for the consumer of this data i.e. Data Warehouse, Machine Learning etc.
- It will be very clear what data did not blend well i.e. they will have floating “edges”. Makes this much easier to identify WHY they did not blend.
Let’s also list the reasons why Eventual Connectivity is bad:
- During ingestion of data, you might be wondering why there are so many duplicates! It is because this process will usually be done after ingestion has finished.
- You might feel like you don’t have control of how things are joining onto each other. (In reality, it is usually because you missed this in your architecture sessions)
- It is much more of a Spray and Pray type of method. However time after time, we have proven that this approach yields significantly better results in blending data.
There is an analogy I like to use that helps me understand the different approaches very well.
Imagine you are reading a book. On page 14, someone asks you to tell them how all the characters in the book are related to each other. Well, knowing the twists and turns that happen in most books, you would most likely tell them “I don’t know”. This is what we are doing with ETL.
Now imagine another situation. Instead, you answer “I am going to finish the book first, I am going to mark relationships along the way” and in the end, I will put all the relationships into a big “network” or “map” and then I can answer this question. You are now describing “eventual connectivity”. Then imagine you start reading the second version of that book, with similar characters. You already have a map that is created and most likely when you run through the same process, you will have new characters related to existing characters.
So you might be able to see why people believe the ETL process is flawed in its design, let alone its implementation. It, however, has been a great part of organisations path to building a Data Foundation. I can take the learning from that and bring it into a modern scalable process - CluedIn.com
Please comment if you disagree or I have missed something out.
Happy to provide more information or a personalised demonstration.
Enterprise Data Architect, DW/Data Lake/MDM Solution Architect and ERP Data Specialist
1 年Hi Dom, this is all very sensible and I would consider it best practice anyway. Some of us have been doing it for years using a variety of techniques (Generic Entity Modelling, Data Vault) and it certainly doesn't need a graph database to work - just a tuple table and careful design of unique identifiers so any 2 Entities can be linked across any 2 Entity Types (and potentially different physical tables). I know Graph databases do this stuff automatically, but they don't handle the OLAP style queries so well - the best way to perform those is to denormalize the data out into star schemas - but how you materialize graphs / networks into dimension tables, especially with SCD2 is a whole other topic.
MLOps, data science and AI expert with over 14 years of experience in various industries covering online retail, construction, smart factory, digital twinning and high performance computing. Former ML Platform @Zalando
4 年are there any projects on github using this pattern? Would love to see this in real.