Putting data in motion (as sushi)
Looking back at 2021, I have spent a huge share of my working time on how to make use of data within the enterprise. As customer demand continues to grow when it comes to the expected "smartness" of features, as corporate decisions should be backed by evidence rather than eminence, data is becoming more and more crucial to achieve almost any business goal. "Data is the new gold" or "It's the age of Artificial Intelligence" - I spare you such off-the-shelf phrases, and discuss some of my challenges with data.
This article is a non-academic summary of my key takeaways so far. The internet is full of (alleged) success stories on how to build data-intensive applications on a green field. As my professional background is rooted in a hugely complex environment of a market-leading and multi-national retail bank with a history of more than 200 years, I find most of these tipps not applicable, sometimes even childishly naive.
Although my write-up is rather technical, it is well-suited for everyone who is working with data in any larger organization. Eventually it is the business people who have a much better understanding of data, its value and its key properties.
Understanding the paradigm shift
It does not matter if you're a business user or an engineer: Most probably your mental model of data still is a spreadsheet or a relational table. Forget this! Forget about tables, macros, CRUD, warehouses, SQL, ETL, etc. Nowadays data can be something completely different. Data is moving continuously and is directly tied to business logic.
For quite some time now I have been intrigued by the concept of asynchronous computing. Multitouch on our phones, serverless lambda functions in the clouds, Facebook's React, B2C giants implementing the CQRS pattern, real-time streams using Apache Kafka, corporates abandoning their batch processing (...) these are all differently flavored manifestations of the same idea, an idea to speed things up by making better use of waiting times.
Take a look at the title image of this post and think of a Japanese chef's workload when preparing sushi. As customers enter the bar and start ordering, he will fulfill each individual request. Growing customer numbers naturally put him under increasing pressure. To remedy this situation (before he turns overloaded), he might look for an assistant (horizontal scaling), or start to cook rice and cut fish in parallel (multithreading). Yet, the most efficient solution is to better distribute his workload across his working hours.
A conveyor belt inside a sushi bar is an extremely good metaphor for how we should reason about data, and how we should work with modern data stacks:
The true challenges of modern(izing) data stacks
Let's get back from our sushi bar to the reality of data engineering and the problems I've encountered in my professional context - and how I intend to tackle them (I'm both curious and slightly nervous upon looking back at 2022 in a year from now):
Recommending a data architecture in the 2020s is "easy": You deprecate most of your relational databases in favor of log, document, or graph storage, which often better suits your requirements. You go cloud-native to benefit from utmost flexibility in deployment. And - foremost - you push for real-time streaming to get rid of latency.
However, regardless of what specific use case you're looking at, there are two fundamental issues that will keep haunting you: First, your existing IT systems are most probably not built to push data in real-time, and second, you will stumble across a certain flavor of data which is not suited for being streamed in the first place.
Number One: Make the existing landscape push data in real-time
This first issue seems obvious. Yet it's surprising to me that most white papers, blog posts, books, or conference talks on data architecture don't tackle it sufficiently: Given any heterogenous IT landscape, it's utterly naive to believe one could simply exchange *all* required applications to enable streaming for use cases further downstream - be it machine learning, real-time notifications, process automation, whatever. Since most companies won't dare an expensive and risky migration of their systems of record, they're doomed to keep processing data in batches. (For them, buzz words like reverse ETL might even sound cool because they simply have no other choice when trying to bring analytical insights into their systems of engagement.)
Luckily there's a compelling solution to our problem of legacy. Rather than modernizing all applications which write data, it's much easier to integrate on the database level. Relational database management systems (RDBMSs) trick their users into believing that there'd be consistency when reading and writing data (remember: ACID). The way databases achieve that, is by strictly serializing all writing transactions in a so-called write-ahead log (WAL). Transactions like INSERTs, UPDATEs, and DELETEs happen concurrently by different users logged in to the database, yet each of them can safely believe their state would be consistent. Change data capture tools (CDC) like Red Hat's debezium, hook into exactly that WAL and turn data manipulations into streamed events. The actual users of the database - a plethora of connected applications - won't have to change a single line of code, they won't even notice their domain has now been turned into an event stream as a side effect.
The screenshots taken from my demo application show how it works: A change in a relational database on the left, the change event streamed via Kafka on the right.
In short: With a CDC tool our first problem seems to be facilitated by several magnitudes. Instead of changing all of our applications' software we favor configuring the database.
Trivia: With the advent of Big Data it is "thanks" to intelligence agencies tackling this problem as the first big ones. Therefore, today there are software projects like Apache NiFi flying around which was open-sourced by the NSA.
Number Two: Immutable events versus evolving facts
The second problem is less technical - it revolves around the business requirements of what needs to be done with data: The evangelists of streaming architectures keep talking about events which have a clear ordering and are immutable. Luckily the financial industry is full of such "streamable" events - think of payments happening on your current account or trade orders sent to your broker.
However, in contrast to such self-contained events there's a lot of evolving facts we have to deal with; e.g. the balance on your account or the positions within your portfolio are the latest results of - potentially streamed - events having happened in the past. The purists would now argue that such stateful data shouldn't be stored anyhow - a balance would always be the aggregate of all transactions seen. Yes, such an approach makes a lot of bugs disappear (remember: functional programming and lambdas!) - yet, also this approach is again naive when you think of data that is the result of events exceeding your storage capacity, external API calls which cannot be repeated easily (e.g. information from a credit bureau), manual intervention like a review or an error correction, slowly updated inputs from machine learning models, etc.
It took me quite some time and effort to understand this problem in hopefully sufficient detail. I found out about the Kafka Streams API (amongst dozens of other potential solutions). Yet, it was an open source Python library called faust that helped me to finally untangle my brain. This library is maintained by Robin Hood who're using it for order execution, fraud detection, logging, and the like. Moreover I ?? Python and I'm convinced it's the right tool for the task - not because I'd be interested in dogmatic discussions about programming languages, but merely because data scientists are usually native to Python and have little knowledge about JVM-based and comparable languages.
Screenshot: Bare minimum example showing the processing of a transaction stream. Whereas each financial transaction carries an amount by itself (I called this self-contained above), the running balance must be calculated from the last-known balance. Faust offers sharding of such in-memory structures across the whole cluster. The output is again written to a (derived) stream.
Bottom line: The data stack a well-established company most probably needs is a mixture of real-time streams and stored state. Performance must scale across millions of users. Data should always be up-to-date, yet at the same time it is impossible to rerun the whole pipeline whenever a customer clicks on a button (users are very impatient nowadays). Like in our sushi bar example, we need a conveyor belt which is filled up in advance, soy sauce, napkins and chopsticks however are available directly on the table - drinks are served on demand as they're specific for each customer.
My async bank // demo application
The screenshots above are artifacts from a demo application I've put together to grasp the essence of streaming to its full extent. My little contribution shows:
(Ultimately it's probably the most complicated way to update a financial balance on a terribly looking screen :-) )
In case you're interested, have a look at https://github.com/mathiasfrey/aiobanking and get in touch. If you like what we're doing at Erste - exciting data, massive scale, honorable ethics - consider joining us!
Managing Director @ data square
2 年Now I'm a little afraid that we will lose you to the IKI :-)
Managing Partner | Architect of Innovation | Empowering Data-Driven Transformations
2 年Love to work with analogies and?I particularly like this one ??
Chief Technologist @ SQUER — We are hiring
2 年I couldn't think of a better new year's resolution ??