登录查看更多内容

LinkedIn and Apache Kafka

Muhammad Waqas Dilawar

Staff Software Engineer at Unifonic | Microservices | Distributed Systems | Real-time Stream Processing | IAM | Cloud | Kafka | Java | Keycloak

发布日期: 2021年9月22日

Kafka was designed at LinkedIn to get around some of traditional message brokers' limitations and avoid the need to set up different message brokers for different point-to-point setups. LinkedIn’s use cases were predominantly based around ingesting very large volumes of data such as page clicks and access logs in a unidirectional way while allowing multiple systems to consume that data without affecting the performance of producers or other consumers. LinkedIn had dozens of data systems and data repositories at that time. Connecting all of these would have led to building custom piping between each pair of systems, looking something like the figure below.

It's worth noting that data often flows in both directions, as many systems (databases and Hadoop) are both sources and destinations for data transfer. This meant that people at LinkedIn would end up building two pipelines per system: one to get data in and one to get data out. This clearly could have taken an army of people to build and could never be operable. As engineers approached full connectivity, we could end up with something like O(N2) pipelines. Instead, they produced something generic as shown in the figure below:

As much as possible, they needed to isolate each consumer from the source of the data. The consumer should ideally integrate with just a single data repository (perhaps now we know that repository with the name Kafka) that would give her access to everything. The idea is that adding a new data system—be it a data source or a data destination—should create integration work only to connect it to a single pipeline instead of to each consumer of data.

This experience led them to focus on building Kafka to combine what we had seen in messaging systems with the log concept popular in databases and distributed system internals. They wanted something to act as a central pipeline first for all activity data, and eventually for many other uses, including data deployment out of Hadoop, monitoring data, and so on.

Amazon has a service that is very similar to Kafka called Kinesis, and do you know they took inspiration from which infrastructure product, Kafka. Which is neither a database nor a log file collection system nor a traditional messaging system. The similarity between Kafka and Kinesis goes right down to the way partitioning is handled and data is retained, as well as the fairly odd split in the Kafka API between high- and low-level consumers.

Why LinkedIn named "Kafka", which is now Apache Kafka:

It refers to the German language writer, "Franz Kafka", whose work was so freakish and surreal, it inspired an adjective based on his name.

领英推荐

Introduction to Apache Kafka

Brij kishore Pandey 9 个月前

A Comprehensive Overview Of Apache Kafka

InRhythm 2 年前

Learn Kafka In Just 5 minutes

Shrey Batra 2 年前

In the case of LinkedIn, their data infrastructure and the ability to work with it had become so nightmarish and scary that they named their solution after the author whose name would best describe the solution they were hoping to escape from.

"Kafkaesque", characteristic or reminiscent of the oppressive or nightmarish qualities of Franz Kafka's fictional world.

So,

In effect, Kafka’s reason for being is to enable the sort of messaging architecture that the Universal Data Pipeline describes.

Some challenges before Kafka:

Be extremely fast
Allow massive message throughput
Support publish-subscribe as well as point-to-point
Not slow down with the addition of consumers; both queue and topic performance degrades in ActiveMQ as the number of consumers rises on a destination
Be horizontally scalable; if a single broker that persists messages can only do so at the maximum rate of the disk, it makes sense that to exceed this you need to go beyond a single broker instance
Providing a clean model of persistence that packaged that warm fuzzy feeling of durable, replayable input sources from the batch world in a streaming friendly interface
Providing an elastic isolation layer between producers and consumers
Embodying the relationship between streams and tables, revealing a foundational way of thinking about data processing in general while also providing a conceptual link to the rich and storied world of databases.

In order to achieve all of this, Kafka adopted an architecture that redefined the roles and responsibilities of messaging clients and brokers. The JMS model is very broker-centric, where the broker is responsible for the distribution of messages, and clients only have to worry about sending and receiving messages. Kafka, on the other hand, is client-centric, with the client taking over many of the functions of a traditional broker, such as fair distribution of related messages to consumers, in return for an extremely fast and scalable broker. To people coming from a traditional messaging background, working with Kafka requires a fundamental shift in perspective.

Unified model

Kafka unified both publish-subscribe and point-to-point messaging under a single destination type—the topic. This is confusing for people coming from a messaging background where the word topic refers to a broadcast mechanism from which consumption is nondurable, Kafka topics should be considered a hybrid destination type.

要查看或添加评论，请登录

Muhammad Waqas Dilawar的更多文章

Simplifying Concurrency: The Power of Java Virtual Threads

2024年8月20日

Simplifying Concurrency: The Power of Java Virtual Threads

The limitations of traditional Java threads: The Java platform implements Java threads as wrappers around the…

1 条评论
Cache Usage Patterns

2024年4月26日

Cache Usage Patterns

This post explores two key cache usage patterns I discovered in Ian Gorton's book, Foundations of Scalable Systems. 1-…
Apache Pulsar

2023年12月29日

Apache Pulsar

Apache Pulsar; developed by Yahoo! in 2013, Pulsar was first open-sourced in 2016, and only 15 months after joining the…

2 条评论
Generative AI Use Cases

2023年11月28日

Generative AI Use Cases

Generative AI is a general-purpose technology used for multiple purposes across many industries and customer segments…
Calcite: One of those components you're all using, and you don't even know about it.

2023年2月22日

Calcite: One of those components you're all using, and you don't even know about it.

Apache Calcite provides query processing, optimization, and query language support to many popular open-source data…
Errors and exceptions in Java

2022年7月16日

Errors and exceptions in Java

When a program in Java experiences an issue, Java represents the issue in one of three ways depending on the severity…
Upcasting and downcasting in Java

2022年7月9日

Upcasting and downcasting in Java

Type Casting: Casting means picking an Object of one specific type and making it into another Object type. There are…

1 条评论
What are Event-Driven Applications

2021年11月11日

What are Event-Driven Applications

Before diving any further, let's step back and define what is event. "An event is something that happens at the…
Get up and running Apache Kafka using Strimzi on Kubernetes

2021年8月26日

Get up and running Apache Kafka using Strimzi on Kubernetes

It's first post on series of blog posts on Apache Kafka using Strimzi in Kubernetes. In this series we'll see how we…
101 of GraphQL..

2021年8月13日

101 of GraphQL..

GraphQL is an API specification. It is a query language for APIs and a runtime for fulfilling those queries with your…

See all articles

LinkedIn and Apache Kafka

Muhammad Waqas Dilawar

Staff Software Engineer at Unifonic | Microservices | Distributed Systems | Real-time Stream Processing | IAM | Cloud | Kafka | Java | Keycloak

领英推荐

Muhammad Waqas Dilawar的更多文章

社区洞察

其他会员也浏览了

Kafka Simplified

Apache HBase

--- Apache Kafka vs Solace PubSub+: A Comprehensive Guide for Modern Messaging Systems

LinkedIn Handle 7 Trillion Messages Daily With Apache Kafka

Advanced Concepts in Apache Kafka

ZERO to HERO in 5 minutes in Apache KAFKA

?? Apache Kafka Internals-Part1

Understanding Apache Kafka: A Detailed Guide

Kafka and Kafka Connect

Mirroring High-Throughput Topics with Kafka MirrorMaker 2

领英推荐

Muhammad Waqas Dilawar的更多文章

Simplifying Concurrency: The Power of Java Virtual Threads

Cache Usage Patterns

Apache Pulsar

Generative AI Use Cases

Calcite: One of those components you're all using, and you don't even know about it.

Errors and exceptions in Java

Upcasting and downcasting in Java

What are Event-Driven Applications

Get up and running Apache Kafka using Strimzi on Kubernetes

101 of GraphQL..

社区洞察

其他会员也浏览了

Kafka Simplified

Apache HBase

--- Apache Kafka vs Solace PubSub+: A Comprehensive Guide for Modern Messaging Systems

LinkedIn Handle 7 Trillion Messages Daily With Apache Kafka

Advanced Concepts in Apache Kafka

ZERO to HERO in 5 minutes in Apache KAFKA

?? Apache Kafka Internals-Part1

Understanding Apache Kafka: A Detailed Guide

Kafka and Kafka Connect

Mirroring High-Throughput Topics with Kafka MirrorMaker 2