Data Events: Trust, Transactions and ACID Properties

Data Events: Trust, Transactions and ACID Properties

It’s been more than a month since my last post in the Data Mesh blog series. April went by in a flash, I think I was working every day that month, I am so proud of my team at Oracle as we shipped OCI Database Migration service, did our official launch for OCI GoldenGate which is running the newest GoldenGate 21c software which is also brand new… three new product launches in one month would keep anyone busy!

So far in this series I’ve hit on 4 critical aspects: (1) shift to Data Product Thinking, (2) why Decentralization is crucial, (3) the Ledger-based integration style, and (4) the need for Polyglot Data Streams. Next up, as I’ve teased in the earlier posts, it is time to do a ‘double-click’ into how Data Events are a different and special kind of event that belongs at the heart of a modern data architecture.

A Bold Claim: Data Events are Better Events

Data events are a vastly superior point of integration for most data-centric use cases. When compared to other data integration patterns (such as ETL or Data Virtualization) that just copy snapshots data around, or cloning techniques that take whole copies of databases or data files, using ledger-driven data events is way more flexible and should be used more often.

But why stop at data integration? In many mission-critical situations, data event driven integration is also superior to Kafka and Microservices Event Sourcing/CQRS patterns. At this moment in time, the whole of enterprise software development is deeply enamored by Kafka style messaging for integration as well as microservices design patterns, and both are being misused and wrong-headedly applied to data problems better suited to other kinds of tools.

Data Events vs. App Events

Within any enterprise software stack, including down in the infrastructure tiers, there are all sorts of interesting events always firing off. The GUIs produce user events, the software produces system events for communication and also for logging, the data tier produces events for data operations and committed transactions, and going even deeper, the host platforms, servers and cloud infrastructures also generate events from the OS layers on down through physical devices and networking.

No alt text provided for this image

For the purposes of this blog post, I’m going to call “data events” the events produced by the data tier of a software architecture. Most typically, that means database events. Database events come in many flavors, but the two most common are:

  • DML – data manipulation language (eg: inserts, updates, deletes…)
  • DDL – data definition language (eg; alter table, add column, alter constraint…)

Pretty much all databases (operational, analytic, doc stores, graph stores, etc) implement some form of DML and DDL, and I would even go further to say that pretty much any software with a SQL-like declarative interface will usually have some kind of DML and DDL operations.

No alt text provided for this image

In contrast, the many variations and layers of Application Events is vast by comparison. Events from the GUI layers tell us about user behaviors, how they use the screens and even what content they linger on. All business applications (such as ERP, HCM, CRM etc) support messaging events for business process integrations that can be orchestrated via REST, SOAP, or OData APIs. The applications themselves emit telemetry in the form of system logging events that provide low-level clues about what’s going on deep in the internals of the software. Application events include a vast array of different detail, levels, message formats and trustworthiness.

Data Events are More Trusted

There is nothing wrong with application event streams, but data events are easily the most trustworthy events in a software stack. This is so because by far the vast majority of applications use some kind of database as a durable and consistent store for data. The moment when an application issues a “COMMIT” statement is the moment when you have an iron-clad guarantee that the data is saved and recoverable (I’m not going to delve into HA/DR theory around RPO and RTO, so just roll with me on this!). This high level of trust derives from the fact that relational databases implement high levels of ACID compliance.

data events happen at a precise point in time where something fleeting becomes durable

ACID means something specific. Although different RDMS engines may implement different levels (in some cases tunable levels) of ACID properties, there is scientific basis for what ACID means and it is testable and provable with algebras. The Wikipedia page for ACID is a good place for a refresh, and the page for “Consistency Model” is a remarkably great reference on the continuum of consistency rules in computer science.

Compared to application events, the data events happen at a precise point in time where something fleeting becomes durable, it is the moment we can have assurances that data event is “real” for the application and that it is fully recoverable in the case where disaster strikes.

No other event or point-in-time in the lifecycle of software events is quite like a data event. Every other application or system event tempts fate that when disaster strikes, or when systems hit unplanned high concurrency workloads, that these events will be lost to the ether – or worse, that corrupted and inconsistent events will be unknowingly recorded without a whisper of an alert that something bad just happened.

At the heart of “trust” is having the confidence that data won’t get messed up if something unexpected happens.

Events, Transactions and Concurrency Control

I can’t get too far into a discussion of ACID and trust without writing about transactions, and specifically Transaction Processing Systems (TPSs). Databases are a kind of TPS (both OLTP and OLAP). There are other kinds of TPSs, and I specifically want to include Message Queues (MQs) in this discussion.

Databases and MQs both excel at transactions. What do I mean? The central defining attribute of a TPS is its ability to maintain data integrity. In the face of massive read/write concurrency users are protected from data corruption and from inadvertently working on the same data at the same time. This ‘concurrency control’ is called Isolation (in ACID), and along with Consistency it is the other big area where pure event streaming systems are just not as trustworthy.

Event streaming solutions (eg; Apache Kafka) really shine in use cases where there is a high volume of simple events, or data payloads, that need to be broadcast to many consumers. The default transaction semantics of Kafka are “at most once” which maximizes performance. Users can also configure “at least once” semantics, which may produce duplicate events on the log. Kafka also supports “exactly once” semantics as well as ‘transactions’ but it’s pretty well known that (1) using Kafka for exactly-once eliminates most of the benefits of general purpose event streaming, and (2) Kafka ‘transactions’ are not really transactions in the way that databases and MQ’s mean. A kind way of saying it is that Kafka transactions put the onus on the consumer to correctly handle transactionality.

No alt text provided for this image

In comparison, Databases and MQ’s are both specifically built for transactional workloads. Databases provide transactions on relational structures (DML/DDL) whereas MQs provide transactions for distributed messages, for example with guaranteed delivery.

Crucially, it is important to understand that microservices programming models, object storage buckets and messaging systems for event streams (such as Apache Kafka) are not inherently transaction-safe or ACID capable and they do not handle Consistency and Isolation in the same way as transaction processing systems. There is a vast ocean of difference between tools that can (with much developer overhead) support transactions vs. tools that actually enforce transactions as a core part of their behavior.

There is No Magic, __________ is Not a Database!

On the analytic data engineering side of the enterprise software stack we’ve been witnessing a massive explosion in the set of cool new cloud tech we have at our disposal. It seems easy to forget, but just 15yrs ago the only technically viable place to do large scale data analytics was in an expensive data warehouse. Nowadays anyone with a credit card can easily spin up a Hadoop cluster, an Object Storage based data lake with Apache Spark, a Kafka environment for ingesting data events, Graph data stores, and search-based analytics tools like ElasticSearch or Solr.

But the perennial challenge with Analytics has always been about how to manipulate vast amounts of data (eg; prepare and transform) without losing its meaning – how to trust that your analysis is correct.

Data engineers do what data engineers do…

By far, most of the most important business data is born from a relational database. (eg; ERP systems, financial systems, customer management, telecommunications, ecommerce, etc etc.) This data is born as ACID compliant operational transactions.

Yet, when data engineers move this data into non-relational systems (eg; Object Storage, Kafka, ElasticSearch, Hadoop etc) what is the first thing that is lost? The transactionality, the consistency and any referential integrity that the data was born with.

So, what to do?

Data engineers do what data engineers do – they write code to try and keep the original integrity and quality of the data intact:

  • Data preparation that standardizes messy raw application data
  • ETL pipelines that flatten, join and merge data that needs to stay together
  • Kafka ‘consumers’ that aim to pick file payloads off Topics as a ‘transaction’
  • Script solutions for ‘data drift’ as source schema change, how to update the registries (Kafka) and metastore (for files in Object Store)
  • Engage in esoteric discussions about how to achieve ACID-like behaviors with file formats like Parquet vs. Avro vs. JSON vs custom formats (eg; Delta lake format)

This high-minded software engineering is what the Kafka community affectionally calls “Turning the Database inside-out” – which is pretty much another way of saying “have fun implementing ACID properties in bespoke code!” It’s a very similar mentality for data lake developers that pass data as files through object storage buckets, often on the way back into a database anyways (!).

But the Kafka and Data Lake folks aren’t the only communities enamored with ‘going their own way’ with regards to ACID – so too are the microservices purists.

For the most hardcore of the microservices community, a central tenet of the approach is to avoid using databases at all costs. With near religious fervor, that line of thinking is that databases inexorably slow down the agility necessary to maintain continuous integration / continuous delivery (CI/CD) objectives and rapid iterations of the software application tier.

Microservices purists often choose to follow an Event Sourcing Pattern, where application (ie; microservices) events are persisted as time-series events inside “event stores” (which are a type of event ledger, and many also use Kafka for this purpose). This creates other challenges, including inefficient read operations (eg; queries). So microservices developers add another pattern – Command Query Responsibility Segregation (CQRS) – to help with application level reads/queries.

So, you can see where this is going.

microservices developers gonna do what microservices developers gonna do

In order for a ‘purist microservices developer’ to write an application with ACID properties (particularly, Isolation and Consistency) they’re going to have to go ahead and write ACID into their application. Yep, they’re going to write a database into their application tier.

Ok but nobody actually does that, right? That’s why in the purist communities of microservices they’ll pretty much just say, “don’t bother with consistency… your application should tolerate eventual consistency by default!” Which pretty much rules out entire classes of mission-critical software systems that require high consistency and transaction isolation.

Pay no attention to the man behind the curtain!
No alt text provided for this image

When your engineering teams make a choice to go with non-ACID based solutions like ‘purist Microservices’, Apache Kafka and Object Storage based data lakes, just know that there is going to be a person (or perhaps a large team of people!) behind the curtain who has write all that code to ensure that the data stays correct and trustworthy.

This is a fundamental fact that comes anytime you are moving vast amounts of ACID data into non-ACID systems. Maintaining the data ordering, grouping, dependencies, referential integrity, and various flavors of transaction consistency will be a foundational part of the “pipelines” that enable you to trust the output of your applications and analytics.

A Better Way

The global popularity of non-ACID solutions (microservices, object storage, kafka, etc) speak to their power at providing other benefits that we should not ignore here. Microservices are extremely popular because of the promise of greater agility and CI/CD responsiveness with DevOps. Object Store data lakes can be very cost effective and work well for vast amounts of unstructured and semi-structured data. And, Kafka is an exceptionally powerful open source platform for non-transactional event streaming solutions.

so can we make ACID data play nice in non-ACID systems?

Data events are the key to unlock the power of effectively working with trusted data alongside non-ACID systems like Microservices applications, Apache Kafka and file-based data lakes. It *is* possible to provide a distributed, decentralized, event-driven way of moving data between services while preserving the isolation and consistency - data events!

For the remainder of this post, I am going to be using Oracle GoldenGate as my example for working with ‘data events’. Some other data replication / change data capture tools may also be suitable for some of the examples that I give, but GoldenGate is fairly unique in its class (based on its decentralized microservices architecture) thus many examples I give may not be possible with other tools.

GoldenGate 101: Distributed Ledger for Data Events

At heart, what GoldenGate does is to take database events (DML & DDL transactions) and other data oriented actions (eg; MQ, Kafka etc), and distribute them across a decentralized mesh of microservices. Each node in a GoldenGate mesh can capture or deliver transactions into other GoldenGate services, popular data stores, data lakes, and messaging systems.

No alt text provided for this image

GoldenGate has its own canonical wire protocol for transactions, called Trail Protocol. This protocol is the distributed ledger, replicated to different nodes while preserving ACID consistency and isolation of the base transactions, and providing >99.999% SLAs against data loss even in highly distributed networks. Each node is maintains the write and read position of multiple targets and manages the lifecycle of the ledger for durability guarantees.

How GoldenGate Works with Data Events (as a Mesh)

GoldenGate is itself a set of microservices. It is a C-based software program without a backing database, each microservice is fully encapsulated and provides its own HTTPS access point for APIs and Certificates. The heart of GoldenGate are ‘extracts’ and ‘replicats’ which are unique to each of the 100’s of supported sources and targets that GoldenGate natively integrates with. Extracts capture data events from sources, while replicats write the data events into target technologies.

No alt text provided for this image

The picture above focuses on showing how GoldenGate distributes its ledger – the Trail. The trail is a ledger of data events that serves as a durable point of recovery for used by GoldenGate in cases of failure for source or target data store, network failure, or GoldenGate process failures. The trail is typically stored on highly durable storage and may also be regenerated from the source logs when available and if needed.

How GoldenGate Works with Data Events (as a Hub)

For centralized solutions that don't need to distribute data events over a WAN, GoldenGate may also be deployed as a hub-style architecture. This can simplify the administration of GoldenGate and reduce the number of ‘hops’ that data events must take across the network.

No alt text provided for this image

In distributed or hub mode, GoldenGate is continually moving data events while preserving the isolation and consistency of the transactions. When the target is a database, the transactions are converted into the native transaction semantics of that particular database and ACID properties are preserved all the way through the entire flow of events. But when the target is not a database (eg; Kafka or Object Storage), the data is converted to files and if the developer is not careful, they can lose the isolation and consistency of the data while it is written into the target.

Anatomy of a GoldenGate Data Event

At the heart of every TPS (such as a database or a message queue) is a running log of operations. The TPS is responsible for keeping tabs on which operations belong to certain transactions. For example, a database engine may have a massive concurrency of writes (eg; inserts & updates) from 1000’s of users and will guarantee the isolation and consistency of each operation happening within picoseconds of each other. If you attempt to update a record that I have deleted a moment ago, your update will always fail. The very first step in a GoldenGate event is reading from this database log in realtime.

No alt text provided for this image

As these database log events are captured, they are converted into the GoldenGate canonical ledger called Trail. All the data event logs of the 100’s of combinations of supported sources/platforms are converted into this single canonical Trail format. It is the Trail which is distributed via the GG Distribution Microservice. Each Trail may have one or more ‘paths’ and Trail consumers may initiate their own access to a path, or GoldenGate can be configured to push Trail downstream into a Receiver Microservice.

GoldenGate is aware of all source operations as they hit the database logs, and each different kind of database handles operations differently. As a general rule, GoldenGate emits the data events once they are committed on the source system. In database systems, many operations can be grouped together in a single commit. In fact, in some long running transactions, millions of objects might be modified as part of a single transaction that takes many hours to complete and is interleaved with thousands of smaller uncommitted operations. GoldenGate groups together, isolates and preserves the consistency of these transactions as they move across the networks.

Ultimately, on the target side, GoldenGate will apply the transactions in a guaranteed, exact-once, and transactionally consistent manner.

Preservation of ACID

When the GoldenGate target is an ACID capable store (eg; relational databases, Apache Hive, MongoDB, etc) the user can be certain that the data events are correctly handled as they are written and merged into the target data store. This provides the huge benefit of reliably ‘synchronized data’ among ACID capable data stores.

But in the case where GoldenGate is writing data events into a non-ACID store (eg; Apache Kafka, object storage, search-based indices etc) it often becomes necessary for GoldenGate to take an extra step to ‘decorate’ the payloads with extra metadata so that the consumer of the data events can reconstruct the data events correctly, keep track of schema changes, and decipher the position of readers in case of the need to perform rollback operations.

For example, the kind of metadata that GoldenGate users often want to see in their data events include:

  • Transaction before and after images
  • Source operation type (insert, update, delete, truncate, create, alter, drop)
  • Source commit sequence number (SCN, LSN, RBA, etc.)
  • Source commit timestamp
  • Source metadata such as table and column definitions
  • GoldenGate target trail sequence number
  • GoldenGate target trail RBA
  • Metadata versions
  • Source database name & type
  • Source operation log position (if available)
  • Source transaction name (if available)
  • Source transaction user (if available)

GoldenGate output formats support a variety of non-relational filetypes such as: CSV, Fixed-length, XML, JSON, Avro, Parquet, ORC. Increasingly advanced filetypes are trying to inject high levels of schema evolution and other relational-like attributes, for example as with Avro in Kafka registries and Parquet in metastore catalogs. But none of these file formats are anywhere nearly complete with regards to handling the richness of ACID capable databases.

Ultimately, developers are left to stitch together the isolation and consistency on their non-ACID targets, but the GoldenGate metadata goes a long way towards making this simpler to do. If you choose to use GoldenGate for Stream Analytics, there are built-in patterns that support GoldenGate metadata out-of-the-box by default, this further simplifies and 'bakes in' best practices for dealing with consistency and isolation within stream processing.

Correctness and Consistency Matter

For most mission-critical systems there is no substitute for ACID capable systems and when data must move as events, there is no substitute for ACID capable data events.


No alt text provided for this image


When Netflix migrated its billing system to Amazon cloud, Oracle GoldenGate was the preferred tool to ensure that the migration could be trusted. Link: https://netflixtechblog.com/netflix-billing-migration-to-aws-part-iii-7d94ab9d1f59


No alt text provided for this image


When PayPal users see their transactions inside their app’s Activity Stream, it is GoldenGate that delivers 100% correct and trusted data to these microservices apps.

No alt text provided for this image

PayPal considered Event Sourcing and Kafka but decided on GoldenGate because data events have a higher trust factor and are going to be correct. Link: https://www.slideshare.net/r39132/big-data-fast-data-paypal-yow-2018


No alt text provided for this image


WellsFargo runs more than 1000 instances of GoldenGate to protect their most mission critical banking, payments and credit card systems from planned or unplanned outages. Link: https://www.oracle.com/a/tech/docs/tip4020-wells-fargo-slides.pdf


No alt text provided for this image


The heart of EBay’s event-based architecture is a Kafka-based Rheos platform and all data events from 100’s of mission critical databases are sourced via GoldenGate. Link: https://www.slideshare.net/jtpollock/2017-openworld-keynote-for-data-integration/21?src=clipshare

No alt text provided for this image


Event Sourcing the Dependable Way: Transaction Outbox

Developers want the flexibility and agility of microservices, but when they need to distribute data events is there an easier and more dependable way?

The use of Change Data Capture (aka “Data Events”) to replicate highly consistent events between applications has been around for many years. Now, in the brave new world of microservices application development using data events for reliable transactions has been codified as its own unique pattern called Transaction Outbox.

For microservice applications that use a database as a durable store, there is often a need to notify other remote microservices when transactions are completed in the local microservice. The microservices community has noted that it is quite complicated to provide reliable and consistent notification of completed transactions wholly from the microservices application tier. There is simply too much room for errors and inconsistencies in the distributed event scheme.

The general solution for a Transaction Outbox looks like this:

No alt text provided for this image

Link: https://microservices.io/patterns/data/transactional-outbox.html

Quite often, the Transaction Outbox is implemented with CDC technologies like Oracle GoldenGate, which would materialize the solution like this:

No alt text provided for this image

Link: https://www.youtube.com/watch?v=l6By_JEXcyw

The PayPal example above is a good example of a GoldenGate Transaction Outbox with a customer. In the GoldenGate instantiation, GoldenGate is acting as the Message Relay (by monitoring and capturing the source database logs) and customers may optionally use Kafka as the Broker for inserting the payload into a publish-subscribe layer. As is done by PayPal, GoldenGate may be used in a “brokerless” architecture by using the GoldenGate Replicat to deliver directly to the Target (eg; another data store).

Better Than the Alternatives

The time will come in many software development projects when it is necessary to move data from System A to System B. Developers have many choices on how to move data. Some techniques may be bespoke coded into the Apps, other techniques may move data through different storage tiers and formats, and other techniques will use tools that provide some assurances around data consistency.

For important and mission critical situations where data must move, the assertion I’ve been making in this blog post is that using ‘data events’ is superior to other options:

  • Turn the database inside-out with Apache Kafka – Kafka is great at many things, but maintaining transaction isolation and high consistency guarantees is not its sweet-spot. Attempts to make it so require an unhealthy amount of clever engineering on the data consumer side of things, which is risky in any situation that is not tightly controlled.
  •  Purist microservice patterns for Event Sourcing & CQRS – purist microservices apps are often limited to eventual consistency models, or worse, developers try to write their own ACID level isolation and consistency into the microservices app tier. Mission critical systems that require isolation and consistency can just use a database + Transaction Outbox patterns to preserve the benefits of microservices with the strong guarantees of data events.
  • Transactions to files to data lakes and back to databases again – modern data lakes and data lake houses benefit from cheap and durable object storage, but the impedance mismatch problem that arises from moving data structures from ACID compliant transactions, to non-relational files for processing in data lake tech (such as Spark or Lambda) and then ultimately moving back into a (cloud) data warehouse is harmful to maintaining the integrity of the data records themselves.

Most companies just don’t have the time or resources to hire enough qualified developers that can essentially write and rewrite ACID capable database logic into microservices applications, file-based data pipelines and Kafka consumers.

Without a doubt, Databases are still the best tech for ACID capable highly isolated and strongly consistent transactions. Most mission-critical applications require these attributes. And when it comes to distributed event-based architectures for these mission-critical applications… or when your project must ‘bridge the gap’ between ACID and non-ACID data stores… nothing beats Data Events.

A Dependable, Trusted Solution for the Average Joe

For mainstream IT organizations with mainstream software/data engineers and hundreds of legacy systems that must be integrated with new applications, it just makes sense to use tools that are fit for purpose to the tasks at hand that solve the difficult problems of data consistency and reliability.

Even web companies and startups whose development teams live in rarified air, with big teams of elite engineers that build applications from the ground up need to worry about reliability and maintainability. These companies have tight deadlines and a need to focus on delivering value to customers instead of hand coding the ability to handle inconsistent and fragmented data into their applications. It just makes sense to use tools that are fit for purpose to the tasks at hand.

use the right tool for the job!

When you have applications that need to be reliable, data that must be trusted and you need to integrate and distribute events from these apps across clouds, microservices and data lakes – it just makes sense to use data event patterns – like Transaction Outbox – and tools like Oracle GoldenGate to distribute the most trusted data in a data mesh or a data fabric.

No alt text provided for this image

So, as for my bold claim that “data events are better events” it is of course subjective. Event streaming systems like Kafka have an important role to play, MQ/business events are critical for business process integration, modern microservices design patterns and data lake architectures are here to stay.

But I hope that I’ve convinced you, and shown you in some detail, why data events are uniquely powerful in distributed systems (for ensuring easy, correct and consistent data as it flows between operational databases and analytic data stores). Data events should be a ‘top shelf’ option when crafting modern solutions to microservices events, data movement, data integration, data lake, streaming data, data fabric and data mesh type use cases!

Please find also this article showing how "data events" are actually much better... both for event sourcing and consuming... https://medium.com/mlearning-ai/why-an-event-driven-architecture-should-be-data-centric-2e2922e0eed4

回复

Great article Jeff, really appreciated! I recently published an article on a similar topic: "Real Time Consistency" https://luigi-scappin.medium.com/from-transactional-consistency-to-real-time-consistency-937fbb28c99c

回复
Jeffrey T. Pollock

Vice President Product Development

3 年

Rick Greenwald, Merv Adrian, Mark Beyer, Ted Friedman your respective research around "data consistency flaws", "value of data", #logicaldatawarehouse , and #datafabric inspired much of the content in this post. thanks for the inspiration from the #gartner team! ps: if I wrote something you egregiously disagree with I am sorry and all thoughts are my own :-)

回复

要查看或添加评论,请登录

Jeffrey T. Pollock的更多文章

  • Data Mesh is not a Data Lake!

    Data Mesh is not a Data Lake!

    Data Mesh is not a Data Lake. Nor is it a Data Lakehouse, or a Data Warehouse.

    6 条评论
  • Trusted, Polyglot Data Streams

    Trusted, Polyglot Data Streams

    This is part of a Data Mesh blog series here on the LinkedIn articles platform. I am basing this series of posts on…

  • Data Ledgers for Data Integration

    Data Ledgers for Data Integration

    This is part of a Data Mesh blog series here on the LinkedIn articles platform. I am basing this series of posts on…

    6 条评论
  • Decentralized, Modular Data Mesh

    Decentralized, Modular Data Mesh

    This is part of a Data Mesh blog series here on the LinkedIn articles platform. I am basing this series of posts on…

    2 条评论
  • Data Product Thinking and Data Product Managers

    Data Product Thinking and Data Product Managers

    This is part of a Data Mesh blog series here on the LinkedIn articles platform. I am basing this series of posts on…

    6 条评论
  • Data Mesh: 2021 and Beyond

    Data Mesh: 2021 and Beyond

    This is the first of a multi-part series that I plan to cover here on the LinkedIn articles platform. I am basing this…

    24 条评论

社区洞察

其他会员也浏览了