LinkedIn and Apache Kafka
Muhammad Waqas Dilawar
Staff Software Engineer at Unifonic | Microservices | Distributed Systems | Real-time Stream Processing | IAM | Cloud | Kafka | Java | Keycloak
Kafka was designed at LinkedIn to get around some of traditional message brokers' limitations and avoid the need to set up different message brokers for different point-to-point setups. LinkedIn’s use cases were predominantly based around ingesting very large volumes of data such as page clicks and access logs in a unidirectional way while allowing multiple systems to consume that data without affecting the performance of producers or other consumers. LinkedIn had dozens of data systems and data repositories at that time. Connecting all of these would have led to building custom piping between each pair of systems, looking something like the figure below.
It's worth noting that data often flows in both directions, as many systems (databases and Hadoop) are both sources and destinations for data transfer. This meant that people at LinkedIn would end up building two pipelines per system: one to get data in and one to get data out. This clearly could have taken an army of people to build and could never be operable. As engineers approached full connectivity, we could end up with something like O(N2) pipelines. Instead, they produced something generic as shown in the figure below:
As much as possible, they needed to isolate each consumer from the source of the data. The consumer should ideally integrate with just a single data repository (perhaps now we know that repository with the name Kafka) that would give her access to everything. The idea is that adding a new data system—be it a data source or a data destination—should create integration work only to connect it to a single pipeline instead of to each consumer of data.
This experience led them to focus on building Kafka to combine what we had seen in messaging systems with the log concept popular in databases and distributed system internals. They wanted something to act as a central pipeline first for all activity data, and eventually for many other uses, including data deployment out of Hadoop, monitoring data, and so on.
Amazon has a service that is very similar to Kafka called Kinesis, and do you know they took inspiration from which infrastructure product, Kafka. Which is neither a database nor a log file collection system nor a traditional messaging system. The similarity between Kafka and Kinesis goes right down to the way partitioning is handled and data is retained, as well as the fairly odd split in the Kafka API between high- and low-level consumers.
Why LinkedIn named "Kafka", which is now Apache Kafka:
It refers to the German language writer, "Franz Kafka", whose work was so freakish and surreal, it inspired an adjective based on his name.
领英推荐
In the case of LinkedIn, their data infrastructure and the ability to work with it had become so nightmarish and scary that they named their solution after the author whose name would best describe the solution they were hoping to escape from.
"Kafkaesque", characteristic or reminiscent of the oppressive or nightmarish qualities of Franz Kafka's fictional world.
So,
In effect, Kafka’s reason for being is to enable the sort of messaging architecture that the Universal Data Pipeline describes.
Some challenges before Kafka:
In order to achieve all of this, Kafka adopted an architecture that redefined the roles and responsibilities of messaging clients and brokers. The JMS model is very broker-centric, where the broker is responsible for the distribution of messages, and clients only have to worry about sending and receiving messages. Kafka, on the other hand, is client-centric, with the client taking over many of the functions of a traditional broker, such as fair distribution of related messages to consumers, in return for an extremely fast and scalable broker. To people coming from a traditional messaging background, working with Kafka requires a fundamental shift in perspective.
Unified model
Kafka unified both publish-subscribe and point-to-point messaging under a single destination type—the topic. This is confusing for people coming from a messaging background where the word topic refers to a broadcast mechanism from which consumption is nondurable, Kafka topics should be considered a hybrid destination type.