LinkedIn and Apache Kafka

LinkedIn and Apache Kafka

Kafka was designed at LinkedIn to get around some of traditional message brokers' limitations and avoid the need to set up different message brokers for different point-to-point setups. LinkedIn’s use cases were predominantly based around ingesting very large volumes of data such as page clicks and access logs in a unidirectional way while allowing multiple systems to consume that data without affecting the performance of producers or other consumers. LinkedIn had dozens of data systems and data repositories at that time. Connecting all of these would have led to building custom piping between each pair of systems, looking something like the figure below.

No alt text provided for this image

It's worth noting that data often flows in both directions, as many systems (databases and Hadoop) are both sources and destinations for data transfer. This meant that people at LinkedIn would end up building two pipelines per system: one to get data in and one to get data out. This clearly could have taken an army of people to build and could never be operable. As engineers approached full connectivity, we could end up with something like O(N2) pipelines. Instead, they produced something generic as shown in the figure below:

No alt text provided for this image

As much as possible, they needed to isolate each consumer from the source of the data. The consumer should ideally integrate with just a single data repository (perhaps now we know that repository with the name Kafka) that would give her access to everything. The idea is that adding a new data system—be it a data source or a data destination—should create integration work only to connect it to a single pipeline instead of to each consumer of data.

This experience led them to focus on building Kafka to combine what we had seen in messaging systems with the log concept popular in databases and distributed system internals. They wanted something to act as a central pipeline first for all activity data, and eventually for many other uses, including data deployment out of Hadoop, monitoring data, and so on.

Amazon has a service that is very similar to Kafka called Kinesis, and do you know they took inspiration from which infrastructure product, Kafka. Which is neither a database nor a log file collection system nor a traditional messaging system. The similarity between Kafka and Kinesis goes right down to the way partitioning is handled and data is retained, as well as the fairly odd split in the Kafka API between high- and low-level consumers.

Why LinkedIn named "Kafka", which is now Apache Kafka:

It refers to the German language writer, "Franz Kafka", whose work was so freakish and surreal, it inspired an adjective based on his name.

In the case of LinkedIn, their data infrastructure and the ability to work with it had become so nightmarish and scary that they named their solution after the author whose name would best describe the solution they were hoping to escape from.

"Kafkaesque", characteristic or reminiscent of the oppressive or nightmarish qualities of Franz Kafka's fictional world.

So,

In effect, Kafka’s reason for being is to enable the sort of messaging architecture that the Universal Data Pipeline describes.

Some challenges before Kafka:

  • Be extremely fast
  • Allow massive message throughput
  • Support publish-subscribe as well as point-to-point
  • Not slow down with the addition of consumers; both queue and topic performance degrades in ActiveMQ as the number of consumers rises on a destination
  • Be horizontally scalable; if a single broker that persists messages can only do so at the maximum rate of the disk, it makes sense that to exceed this you need to go beyond a single broker instance
  • Providing a clean model of persistence that packaged that warm fuzzy feeling of durable, replayable input sources from the batch world in a streaming friendly interface
  • Providing an elastic isolation layer between producers and consumers
  • Embodying the relationship between streams and tables, revealing a foundational way of thinking about data processing in general while also providing a conceptual link to the rich and storied world of databases.

In order to achieve all of this, Kafka adopted an architecture that redefined the roles and responsibilities of messaging clients and brokers. The JMS model is very broker-centric, where the broker is responsible for the distribution of messages, and clients only have to worry about sending and receiving messages. Kafka, on the other hand, is client-centric, with the client taking over many of the functions of a traditional broker, such as fair distribution of related messages to consumers, in return for an extremely fast and scalable broker. To people coming from a traditional messaging background, working with Kafka requires a fundamental shift in perspective.

Unified model

Kafka unified both publish-subscribe and point-to-point messaging under a single destination type—the topic. This is confusing for people coming from a messaging background where the word topic refers to a broadcast mechanism from which consumption is nondurable, Kafka topics should be considered a hybrid destination type.

要查看或添加评论,请登录

Muhammad Waqas Dilawar的更多文章

  • Simplifying Concurrency: The Power of Java Virtual Threads

    Simplifying Concurrency: The Power of Java Virtual Threads

    The limitations of traditional Java threads: The Java platform implements Java threads as wrappers around the…

    1 条评论
  • Cache Usage Patterns

    Cache Usage Patterns

    This post explores two key cache usage patterns I discovered in Ian Gorton's book, Foundations of Scalable Systems. 1-…

  • Apache Pulsar

    Apache Pulsar

    Apache Pulsar; developed by Yahoo! in 2013, Pulsar was first open-sourced in 2016, and only 15 months after joining the…

    2 条评论
  • Generative AI Use Cases

    Generative AI Use Cases

    Generative AI is a general-purpose technology used for multiple purposes across many industries and customer segments…

  • Calcite: One of those components you're all using, and you don't even know about it.

    Calcite: One of those components you're all using, and you don't even know about it.

    Apache Calcite provides query processing, optimization, and query language support to many popular open-source data…

  • Errors and exceptions in Java

    Errors and exceptions in Java

    When a program in Java experiences an issue, Java represents the issue in one of three ways depending on the severity…

  • Upcasting and downcasting in Java

    Upcasting and downcasting in Java

    Type Casting: Casting means picking an Object of one specific type and making it into another Object type. There are…

    1 条评论
  • What are Event-Driven Applications

    What are Event-Driven Applications

    Before diving any further, let's step back and define what is event. "An event is something that happens at the…

  • Get up and running Apache Kafka using Strimzi on Kubernetes

    Get up and running Apache Kafka using Strimzi on Kubernetes

    It's first post on series of blog posts on Apache Kafka using Strimzi in Kubernetes. In this series we'll see how we…

  • 101 of GraphQL..

    101 of GraphQL..

    GraphQL is an API specification. It is a query language for APIs and a runtime for fulfilling those queries with your…

社区洞察

其他会员也浏览了