Data Ledgers for Data Integration
This is part of a Data Mesh blog series here on the LinkedIn articles platform. I am basing this series of posts on content I developed for the youtube playlist and the Oracle technical paper about Dynamic Data Fabric and Trusted Data Mesh.
I’ve already posted about the most significant and challenging aspect of Data Mesh, the organizational and cultural shift towards Data Product Thinking, and now we’re into what I consider to be the top 3 tech attributes of a Data Mesh: (1) Decentralization, (2) Ledger-based, and (3) Polyglot Streams.
Stop Copying/Cloning SQL Databases for Data Integration!
In my previous post about decentralization, I already pontificated (for perhaps too long!) on the historic legacy of monolithic data integration 'hubs'. In those data integration hubs (ETL tools or ETL cloud services) we are mainly copying SQL data stores around. Yes, there are nuances to this (Files, XML, JSON etc) and I am oversimplifying, but the central point is that these ETL tools are usually used to copy and transform SQL data between OLTP and OLAP analytic data stores (eg; data marts and data warehouses).
This pattern has worked great for 30+ years, it has been used for data warehousing (Kimball, Inmon, Data Vault Modeling, etc), master data management (MDM), offline operational data integration etc. But it has many limitations, including: (1) batch processing create long and slow dependency chains, (2) impossible to do rapid CI/CD for complex enterprise ETLs, (3) effectively dealing with data-drift is extraordinarily complex due to large dependency graph, (4) legacy ETL tools struggle with newer doc-based formats while (5) newer data pipeline ETL tools aren't robust, reliable, enterprise-ready enough to run the whole enterprise on.
But most of all, what drives most IT data execs that I work with absolutely bonkers crazy is all the copying of data that has to go on. It used to be the cost of storage that caused the most pain, although this pain point has now largely gone away. Nowadays it is just the sheer amount of data that IT data execs have to take care of. It really doesn't matter if it's in the cloud or on-premises, the logistics, operations and risks that come with managing so many copies of data all over the place is daunting. What is in sync? What can be trusted? Can we delete that data? Which group uses this database? Why do five different departments need a copy of this database? The list goes on and on.
Can we stop doing ETL yet?
The unfulfilled holy grail for data integration (for my entire career of 20+ years) has been the "magical metadata layer" that runs purely virtual data views while all the physical data stays right where it is - without ever having to copy it anywhere. Unfortunately, the laws of physics (data can't move faster than light speed, yet) just keep getting in the way of these magical solutions. Alex Woodie over at Datanami also wondered, "Can we stop doing ETL yet?" Over the years, I've seen many unfulfilled promises from:
- Query Federation Tools - like MetaMatrix, Composite Software, Denodo Software, and BEA AquaLogic among many others
- Virtual Data Warehouse / Logical Data Warehousing - which started out decades ago with a focus on query federation technology but now more reasonably includes the broad ecosystem surrounding a physical DW, including the data lake
- Data Virtualization - which is an overloaded term sometimes referring to either of the above two topics, but can also include so called 'copy data management' management which is really just automating the lifecycle management of database clones
No time in this blog for a deep-dive in the ills and challenges with data virtualization... but it's worth pointing out here that in the long search for the holy grail of "magical metadata", the various Data Virtualization technologies were there when the myths were written.
Data integration is hard. Perhaps once we have quantum data integration in the distant future we won't need to physically move data. But for now, data must move.
Although there is no "magic," we can stop doing the basic-SQL copy work that basic ETL tools do. ETL and pipeline tools of all kinds literally open up parallel SQL cursors and run queries to copy data. Copies of OLTP into ODS (as 3rd normal form), staging copies of SQL data into data warehouses and running cloning operations from standby databases or staging databases for more data integration SQL copies. This is the kind of madness we can stop - by shifting to a ledger-based approach for data integration.
The idea of a ledger-based integration is not new. The influential blog from LinkedIn's Jay Kreps gave rise to an entire big data ecosystem around 'Kappa-style' streaming patterns and he famously went on co-found the Apache Kafka vendor Confluent. Prior to Jay, there were others who used ledgers for integration, for example the bay area startup GoldenGate Software which uses a decentralized data ledger for distributed real-time database transactions (aka: logical data replication).
A Narrative of Transactions vs. Copies of Data
"Oh, well... hey," you might say, "but copying data to a ledger is just as bad copying a database, right?"
Nope. There is an altogether different thing going on here. Rather than just copying tables at a point-in-time (or for that matter, files etc), what a ledger is doing is preserving the atomic events that happened on the data. In a SQL database, this usually focuses on the DML/Commits like Inserts, Updates, Deletes etc. (it's more complicated, but you get the idea...).
To illustrate this idea, let consider how much we can understand about a chess match if we simply check the position of the pieces every 30 minutes vs. tracking each and every event that happens on the board or between the players.
ETL tools, and SQL copies/clones are like checking the position of the pieces at some batch interval. You can certainly get an accurate read on the data at a point in time, but you know almost nothing about the narrative (the story) of how the data got into that state. Let's consider all reasons you might want 'the whole story' rather than just snapshots:
- You are doing a data science project and you'd like to best understand user behavior
- You may be building a 'smart data integration' solution to provide data services on-the-fly (rather than instantiating copies of data sets)
- You have an event-based stream of data you are running analytics on (eg; for next-best-action, real-time marketing, fraud detection etc)
- You are working to provide your business sponsors with a 'no latency' solution for analytics and reporting - so that the business people can define the data SLA without regard to artificial batch processing limitations imposed by IT
The list could go on...
This metaphor of the chess match is one that I originally tuned into from Mike over at Confluent who wrote about it here and here. The rise of Apache Kafka and the team at Confluent have had a huge impact in elevating the concept of ledger-based data integration patterns -- but they've also taken great risk by over-stretching Kafka use cases into domains that are probably better suited for other kinds of ledger technologies.
The Hammer Called Kafka
Since 2014 Confluent has been fueled by nearly half a billion dollars of VC funding and you can just imagine the amount of marketing power that buys you. In an all out rush for revenues, every problem in high tech starts to look like a nail when you have a hammer called Kafka. Fundamentally, Apache Kafka is a horizontally scalable messaging platform - but it's been heavily positioned as a solution to a myriad of problems:
- Pub/Sub based messaging - arguably the best sweet-spot for the core tech
- Blockchain ledger - run your blockchain on Kafka, boy this one is a stretch
- Event Sourcing for Microservices - run Kafka as MSA event ledger for communications
- Message Queue for Transactions - competing in the same lane as JMS, RabbitMQ etc
- Database Ledger - some even say Kafka is a database, while also adding paid CDC
Over at Ycombinator, developers have noticed how damaging the "use Kafka for everything" mentality is becoming:
This notion of “stream-table duality” might be the most misleading, damaging idea floating around in software engineering today. Yes, you can turn a stream of events into a table of the present state. However, during that process you will eventually confront every single hard problem that relational database management systems have faced for decades. You will more or less have to write a full-fledged DBMS in your application code. And you will probably not do a great job, and will end up with dirty reads, phantoms, and all the other symptoms of a buggy database. Kafka is a message broker.
It’s not a database and it’s not close to being a database.
and...
Recently I watched a 50-engineer startup allocate more than 50% of their engineering time for about two years to trying to cope with the consequences of using Kafka as their database, and eventually try to migrate off of it. Apparently the primary reason they went out of business was sales-related, not purely technical, but if they hadn't used Kafka, they could have had 2x the feature velocity, or better yet 2x the runway, which might have let them survive and eventually thrive.
Imagine, thinking you want a message bus as your primary database.
In this post, I am clearly a huge advocate of the power of the Ledger as a fundamental and awesome shift in data architecture, but don't confuse my enthusiasm of the core pattern with support for using one single tech (eg; Kafka) for every. single. solution.
Choosing the Right Ledger for the Job to be Done
Back in the Data Product Thinking blog post as part of this series, we discussed the importance of Clayton Christensen's JTBD theory - focusing on the customer outcomes and the specific job that is to be done. In the context of choosing a ledger technology, it's like the old adage, choose the right tool for the job.
There is no single best technology for all different needs around enterprise ledgers. The individual use cases for ledgers are wide-ranging and could include a focus on event-sourcing for microservices, log collection for analytics or as general-purpose data services, among others. Accordingly, the exact technology selection for a given use case should often be tailored to the business needs (the job to be done). For example, data and event ledgers could be crafted from technologies such as:
- Microservices Event Stores – that include native features that align to common microservices patterns such as CQRS, tools include Event Store and Axon
- Time Series Databases – which are optimized for high-volume writes, typically from IoT devices, specialty tools include InfluxDB and TimescaleDB
- Message Queues or Service Bus – proven queues for trusted business process transactions and structured payloads, tools like RabbitMQ, Solace, and IBM MQSeries
- Event Streaming Platforms – such as Apache Kafka, Apache Pulsar and various proprietary cloud messaging services
- Data Replication Tools – which are natively integrated with database event logs for maximum data consistency, tools like Oracle GoldenGate and Debezium
- Blockchain – especially useful for multi-party event ledgers, where immutability and transparency are mandatory, hyperledger being the most widely used for enterprise
For general purpose, enterprise-wide solutions a combination of technologies may provide the best-fit. For example, combining a replication tool like Oracle GoldenGate (providing strong data consistency and data recoverability) with open-source framework like Apache Kafka (for horizontal scale-out and rich polyglot data support) is a common approach used by large enterprise IT organizations (LinkedIn, Intuit, EBay, Netflix, WellsFargo, AT&T, Paypal etc have all written about how they use GoldenGate and Kafka).
For enterprises that choose use many different ledger technologies, a data catalog or schema registry may be used to govern data domains that span different ledgers or data zones.
Ledgers are the Narrative of the Truth
The ultimate power in this approach is that by using ledger technologies we can tap into the full story of what the data can tell us.
Ledgers inject the arrow of time into the semantics of what the data can tell us.
It is this 'flow of events' that illustrate and provide context to the otherwise sterile 'state of the data' that we see when we are only working with snapshots, clones and copies of the data. Semantics are supposed to help us understand the meaning of the data, but static data and metadata alone are not enough, it is the contextualized flow of events and state-changes over time that can empower our software to find meaning in the data.
If you consider a typical IT data architecture, the ultimate source of the truth is typically the Application itself. Applications are where the users meet 'the glass' and where the object models of the software most closely model the intended meaning and behavior of the truth. The data stores are the state of the truth because this is ultimately the recovery layer if there is mass failure, the data stores are the most durable records of the application state. The kinds of ledgers bring us the running narrative of the truth - with this arrow of time we can move 'position' in the ledger to evaluate application logs and database logs over points in time. This narrative contextualizes what is going on at application, data and infrastructure tiers from one millisecond to the next.
Ledgers De-couple the Data Architecture
In the context of data integration, a very important benefit of the ledge-based data architecture is that enables a de-coupled way of fetching raw data - typically without the need to invasively reach back to the applications or data stores.
Years ago, we called this "the sushi principle of data" and I liked that term because it highlights the value of moving and consuming the data raw. Once you adopt this mentality, it greatly simplifies the "E" part of the ETL architecture because you simply make sure that your applications and data stores can continuously emit events. The Ledger provides a proxy to the data consumers, and there's no more need for these old-fashioned ETL pipelines where the Extract step happens on the source system, and the Transform step happens in a monolithic software hub.
Ledgers Empower Continuous Transformation and Loading (CTL)
Obviously, not everyone can eat sushi at every meal (I've been tempted...). Sometimes you just have to 'cook the data' which essentially means doing Data Preparation and Data Transformation prior to or during Analytics. Another huge benefit to the ledger-based data architecture approach is that it empowers you to perform very low latency data transformations - I call this CTL, Continuous Transformation and Loading.
Because the ledger provides (a) the full narrative of data events in (b) a de-coupled architecture, we can now use event stream processing (ESP) and complex event processing (CEP) as a means to continuously process, transform and analyze the data while it is flowing in real-time.
This is a total game-changer in helping IT organizations free themselves from the chains of ETL and batch processing data lifecycle.
Decentralized Ledger for Data Integration
The GoldenGate technology provides distributed, decentralized ledger for trusted (ACID) database transactions. It can be used for operational use cases (eg; disaster recovery, DB migrations, OLTP sync) as well as analytic use cases (eg; data lake ingest, stream analytics). Thousands of customers (including 84% of Fortune 100) depend on GoldenGate for data integration and high availability solutions.
Along with the core ledger-based, data integration capabilities we have also included a stream processing capability (ie; full ESP/CEP) for supporting Stream Analytics and Continuous Transformation and Loading use cases.
In Conclusion
In the Oracle technical paper about Dynamic Data Fabric and Trusted Data Mesh that this whole blog series is about, a central feature of the Data Mesh is rooted in this fundamental architectural shift towards "the Ledger" as a dominant next-gen way of doing data integration at enterprise scale. There is not just one single Ledger technology out there, as with all tech, you should choose the right technology to get the job done.
As time marches on and core data integration tech becomes more modern, the primacy of the data ledger will become self-evident as a central defining feature of what we consider to be state of the art for data integration and data mesh.
Data and Generative AI Strategist | Data Governance | Customer Success
3 年Jeff, excellent article. You have me thinking about our future approach to ledgers and CTL. Thanks for sharing.
PM/PMO/Oracle Utilities Consultant - Technology Consulting - EY
3 年Noor Ali Yasir Aziz Huzaifa Sadiq Zubair Ahmed SYED SANA ULLAH Mobeen Ijaz Ali Akbar Aakash Lakhani
PM/PMO/Oracle Utilities Consultant - Technology Consulting - EY
3 年"the sushi principle of data"
Jeffrey T. Pollock Thank you for these! As we help companies take their first steps in to their #datamesh , helping them understand these topics early on is proving incredibly valuable. Keep on meshin'!