Scaling Data Products [1/3] - Scaling an Event Platform

Scaling Data Products [1/3] - Scaling an Event Platform

This article, the first of three in a series describing my thoughts on Scaling Data Products, highlights key technology strategy considerations when attempting to scale up your data ingestion, processing and distribution.

I've been meaning to put some of my thoughts in this space down on "paper" for quite some time... Hopefully, some of you that happen upon it find it useful or insightful, but for me, it was simply cathartic!

Know your Audience

In today's data-driven world, the ability to effectively scale data products is crucial for businesses aiming to stay competitive and leverage the power of big data. As the volume, variety, and velocity of data continue to grow at an exponential rate, organisations must find innovative ways to process, analyse, and extract valuable insights from this vast sea of information. But what actually are data products? How do they create value and who uses them?

Traditionally, data products were only consumed by your customers. Product teams meticulously crafted the definitions of these products from the requirements of their target markets, creating linear data value chains dedicated to their end users.

No alt text provided for this image
Open Data Watch - Data2x

Now, whilst this method of value delivery is reasonably efficient and relatively simple to implement, it provides us with some unanswered questions:

  1. How can I create new product value chains quickly?
  2. How can I get even more value out of our data?

There's an answer to these... but before we open that can of worms, let's do some grounding prep work...

Juggling Streams

From https://ghostbusters.fandom.com/
Ghostbusters; pioneers of stream jokes

Managing streams of event data is hard and at scale, it's even harder. There are a plethora of different streaming technologies, formats and protocols to choose from, as well as different data platform architectures.

There are major cost considerations, including platform elasticity, storage regimes and horizontal and vertical scaling challenges. The list of complexities goes on and on!

A robust and scalable infrastructure capable of handling large volumes of data is absolutely critical to scaling your data products. As a data-driven organisation, data is your lifeblood, and a healthy heart brings both longevity and success!

Traditional monolithic systems often struggle to cope with the ever-increasing demands of big data, so adopting a distributed and scalable architecture will lay the foundations for building and expanding your data products effectively. I generally separate this kind of platform into three technology layers, and we'll discuss each of them in this article:

  • Interconnectivity - integrations with external systems, e.g. IoT devices, queues, databases and storage
  • Event Streaming - efficiently curating events, accessible in real-time
  • Stream Processing - performing value-enhancing operations on events in-stream

Platform Interconnectivity

So you have events coming at you thick and fast... How are you going to handle these incoming streams, how will we store the events, and how will we distribute stream data once we've done some useful work on it?

I have one word for you: flexibility. Keep your options open. No one knows what the future will bring; different customers, suppliers and internal services have different integration requirements, and today's tech of the year can be tomorrow's punch cards (apologies to the older crowd!)

By decoupling integration technologies from the core of your data platform, you aren't wedding yourself to a decision that could prevent you from expanding your reach in the future!

Most enterprise event stream platforms and frameworks - commercial or community - have their own ecosystem of connectors to facilitate interconnectivity. Some examples are object storage, databases, CRMs and ERPs.

Now, on to the Event Streaming layer and, with that, it's metaphor time!

Event Streaming = Data Plumbing

I'm a big fan of analogies to simplify otherwise complex topics. So let's invent - or rather, reuse - a great one! Water. Good old, H2O.

Data is the water; it flows, it's valuable and, with the right kind of plumbing, it can be moved quickly!

No alt text provided for this image
"Data as Water" analogy and corresponding streaming stack layers

Our platform interconnectivity layer is now a set of inlet pipes and taps, allowing the flow of our data water in and out of our event streaming plumbing.

It's important to note that we're storing our water data in two places in the example:

  1. Our storage "sink", analogous to the hose filling up our digital hot tub
  2. The plumbing itself; water is naturally stored for as long as it takes to traverse the pipe network

The latter, indirect storage correlates to the retention policy of the event streaming layer. The longer the retention, the more data we'll be storing.

There are a few more "topics" (??) to cover in this layer, but they'll be easier to follow once we've talked about...

Stream Processing

Stream processing, also known as data streaming, is a software paradigm that ingests, processes, and manages continuous streams of data while they're still in motion.

Confluent: What is Stream Processing?

The final layer in our analogy; yes, you'll soon be rid of this!

Parts of our plumbing system that affect the water, but are still internal, correspond to our stream processing layer. In the real world, they are adding value to our data, by enriching, aggregating, joining, etc.

Efficiency in our stream processing layer is a critical factor in scaling data products, and as the volume of ingested data grows, processing it in real-time becomes more challenging. To accommodate large throughputs and to enable future scalability, it's wise to adopt technologies that support scaled stream processing, such as Apache Hadoop, Apache Spark, or Apache Flink. These frameworks enable distributed processing and parallel computing. They allow organisations to handle massive data volumes efficiently and, by leveraging in-memory computing techniques, they can significantly speed up data processing, enabling faster insights, product development and decision-making.

Typically, our stream processing components read one "class" of events from our event streaming layer, and write back another. These classes are often called "topics", a semantic classification of a set of related events.

Event Stream Ecosystems

There are lots of open-source and commercial event stream ecosystem offerings that can be used as solutions in the areas we've described. Each of them has its own positives and drawbacks but understanding their good and bad sides should help you make an informed decision.

I've captured some of the key features of Apache Kafka, Confluent Cloud and Solace, below:

Apache Kafka

Pros:

  • Kafka is a highly popular open-source streaming platform with a large and active community.
  • Built to handle high-throughput and high-volume data streams, making it suitable for large-scale applications.
  • Provides built-in replication and fault tolerance, ensuring data durability and system resilience.
  • Has a rich ecosystem of connectors, libraries, and tools, allowing seamless integration with various data systems.

Cons:

  • Setting up, managing, and scaling a Kafka cluster can be complex and time-consuming, requiring expertise and infrastructure.
  • Kafka primarily focuses on stream processing and lacks some advanced features like point-to-point messaging, queuing, and request/reply patterns.

Confluent Cloud

Pros:

  • Confluent Cloud offers a fully managed Kafka service, handling infrastructure provisioning, scaling, and maintenance, reducing operational overhead.
  • Provides access to Confluent's extensive ecosystem, including connectors, schema registry, and KSQL, enabling easy integration and stream processing.
  • Allows effortless scaling of Kafka clusters to handle varying workloads and spikes in data traffic.
  • Integrations with popular cloud platforms like AWS, Azure, and Google Cloud, facilitate seamless deployment and integration.

Cons:

  • Vendor lock-in
  • Expensive compared to open-source solutions like Kafka.
  • Limited control over the underlying infrastructure and configuration options.

Solace

Pros:

  • Solace offers a comprehensive set of messaging features, including pub/sub, queuing, request/reply, and support for different messaging patterns.
  • Known for its high-performance capabilities, enabling low-latency and high-throughput message delivery.
  • Supports various protocols like MQTT, AMQP, JMS, and REST, allowing easy integration with diverse applications and ecosystems.
  • Provides robust security features, including authentication, encryption, and message integrity, ensuring data protection.

Cons:

  • Expensive compared to open-source solutions like Kafka.
  • Mastering its full capabilities requires some learning and training.
  • Vendor lock-in

Redesigning the Data Value Chain

Building a flexible, scalable event streaming platform allows us to ingest, process, store and distribute data using the same linear data pipelines as we discussed earlier on. But what if we can break the mould and create a multi-asset data fabric, or "data mesh", instead? One where our topics of curated product data can feed into processing workloads to create other data product topics...

In Data Mesh Architecture - covered in more detail in my next article! - product teams treat their data assets as valuable products in their own right, serving the needs not only of their customers but of other teams (internal customers). Well-catalogued topics of high-value data served internally, can turn our linear value delivery into a value-generating, albeit complex, web of product activity!

If you've made it this far, thanks! Let me know if there's anything you like to read more about, less about, or just general feedback.

Read more:

Janis Horsts

Empowering CTOs and founders with rapid digital execution, outpacing the competition and avoiding missed deadlines. ?? Golang consultant (contractor) ?? Founder ?? Mentor ?? Coach

1 年

Justin Taylor, I'm eagerly anticipating the forthcoming pair of articles and beyond. I'm particularly excited about delving deeper into the topics of flexibility and adaptability within teams, especially in the context of exploration and research of new and better approaches. I find it disheartening that, in numerous projects, these qualities are often discouraged, leading to a stagnant environment.

要查看或添加评论,请登录

Justin Taylor的更多文章

  • Scaling Data Products [3/3] - Scaling Delivery

    Scaling Data Products [3/3] - Scaling Delivery

    This article builds on my previous article about Data Product Teams: In this final edition of my three-part series on…

    3 条评论
  • Scaling Data Products [2/3] - Scaling Teams

    Scaling Data Products [2/3] - Scaling Teams

    This article builds on my previous article about Event Platforms: Welcome back, my fellow data product champions! ?? Is…

    4 条评论
  • The origins of Microservice Architecture (MSA)

    The origins of Microservice Architecture (MSA)

    Whether you've just started researching Microservice Architecture, are in the middle of an implementation or are…

    3 条评论

社区洞察

其他会员也浏览了