Scaling Data Products [1/3] - Scaling an Event Platform
This article, the first of three in a series describing my thoughts on Scaling Data Products, highlights key technology strategy considerations when attempting to scale up your data ingestion, processing and distribution.
I've been meaning to put some of my thoughts in this space down on "paper" for quite some time... Hopefully, some of you that happen upon it find it useful or insightful, but for me, it was simply cathartic!
Know your Audience
In today's data-driven world, the ability to effectively scale data products is crucial for businesses aiming to stay competitive and leverage the power of big data. As the volume, variety, and velocity of data continue to grow at an exponential rate, organisations must find innovative ways to process, analyse, and extract valuable insights from this vast sea of information. But what actually are data products? How do they create value and who uses them?
Traditionally, data products were only consumed by your customers. Product teams meticulously crafted the definitions of these products from the requirements of their target markets, creating linear data value chains dedicated to their end users.
Now, whilst this method of value delivery is reasonably efficient and relatively simple to implement, it provides us with some unanswered questions:
There's an answer to these... but before we open that can of worms, let's do some grounding prep work...
Juggling Streams
Managing streams of event data is hard and at scale, it's even harder. There are a plethora of different streaming technologies, formats and protocols to choose from, as well as different data platform architectures.
There are major cost considerations, including platform elasticity, storage regimes and horizontal and vertical scaling challenges. The list of complexities goes on and on!
A robust and scalable infrastructure capable of handling large volumes of data is absolutely critical to scaling your data products. As a data-driven organisation, data is your lifeblood, and a healthy heart brings both longevity and success!
Traditional monolithic systems often struggle to cope with the ever-increasing demands of big data, so adopting a distributed and scalable architecture will lay the foundations for building and expanding your data products effectively. I generally separate this kind of platform into three technology layers, and we'll discuss each of them in this article:
Platform Interconnectivity
So you have events coming at you thick and fast... How are you going to handle these incoming streams, how will we store the events, and how will we distribute stream data once we've done some useful work on it?
I have one word for you: flexibility. Keep your options open. No one knows what the future will bring; different customers, suppliers and internal services have different integration requirements, and today's tech of the year can be tomorrow's punch cards (apologies to the older crowd!)
By decoupling integration technologies from the core of your data platform, you aren't wedding yourself to a decision that could prevent you from expanding your reach in the future!
Most enterprise event stream platforms and frameworks - commercial or community - have their own ecosystem of connectors to facilitate interconnectivity. Some examples are object storage, databases, CRMs and ERPs.
Now, on to the Event Streaming layer and, with that, it's metaphor time!
Event Streaming = Data Plumbing
I'm a big fan of analogies to simplify otherwise complex topics. So let's invent - or rather, reuse - a great one! Water. Good old, H2O.
Data is the water; it flows, it's valuable and, with the right kind of plumbing, it can be moved quickly!
Our platform interconnectivity layer is now a set of inlet pipes and taps, allowing the flow of our data water in and out of our event streaming plumbing.
It's important to note that we're storing our water data in two places in the example:
The latter, indirect storage correlates to the retention policy of the event streaming layer. The longer the retention, the more data we'll be storing.
There are a few more "topics" (??) to cover in this layer, but they'll be easier to follow once we've talked about...
领英推荐
Stream Processing
Stream processing, also known as data streaming, is a software paradigm that ingests, processes, and manages continuous streams of data while they're still in motion.
The final layer in our analogy; yes, you'll soon be rid of this!
Parts of our plumbing system that affect the water, but are still internal, correspond to our stream processing layer. In the real world, they are adding value to our data, by enriching, aggregating, joining, etc.
Efficiency in our stream processing layer is a critical factor in scaling data products, and as the volume of ingested data grows, processing it in real-time becomes more challenging. To accommodate large throughputs and to enable future scalability, it's wise to adopt technologies that support scaled stream processing, such as Apache Hadoop, Apache Spark, or Apache Flink. These frameworks enable distributed processing and parallel computing. They allow organisations to handle massive data volumes efficiently and, by leveraging in-memory computing techniques, they can significantly speed up data processing, enabling faster insights, product development and decision-making.
Typically, our stream processing components read one "class" of events from our event streaming layer, and write back another. These classes are often called "topics", a semantic classification of a set of related events.
Event Stream Ecosystems
There are lots of open-source and commercial event stream ecosystem offerings that can be used as solutions in the areas we've described. Each of them has its own positives and drawbacks but understanding their good and bad sides should help you make an informed decision.
I've captured some of the key features of Apache Kafka, Confluent Cloud and Solace, below:
Apache Kafka
Pros:
Cons:
Confluent Cloud
Pros:
Cons:
Solace
Pros:
Cons:
Redesigning the Data Value Chain
Building a flexible, scalable event streaming platform allows us to ingest, process, store and distribute data using the same linear data pipelines as we discussed earlier on. But what if we can break the mould and create a multi-asset data fabric, or "data mesh", instead? One where our topics of curated product data can feed into processing workloads to create other data product topics...
In Data Mesh Architecture - covered in more detail in my next article! - product teams treat their data assets as valuable products in their own right, serving the needs not only of their customers but of other teams (internal customers). Well-catalogued topics of high-value data served internally, can turn our linear value delivery into a value-generating, albeit complex, web of product activity!
If you've made it this far, thanks! Let me know if there's anything you like to read more about, less about, or just general feedback.
Read more:
Empowering CTOs and founders with rapid digital execution, outpacing the competition and avoiding missed deadlines. ?? Golang consultant (contractor) ?? Founder ?? Mentor ?? Coach
1 年Justin Taylor, I'm eagerly anticipating the forthcoming pair of articles and beyond. I'm particularly excited about delving deeper into the topics of flexibility and adaptability within teams, especially in the context of exploration and research of new and better approaches. I find it disheartening that, in numerous projects, these qualities are often discouraged, leading to a stagnant environment.