LinkedIn Handle 7 Trillion Messages Daily With Apache Kafka
?? Saral Saxena ??????
?11K+ Followers | Linkedin Top Voice || Associate Director || 15+ Years in Java, Microservices, Kafka, Spring Boot, Cloud Technologies (AWS, GCP) | Agile , K8s ,DevOps & CI/CD Expert
Observing Kafka from the perspective of the ski driver falling at 10,000 feet, it has a dead simple architecture: the brokers contain the topic, the producers are responsible for data writing, and the consumer is responsible for reading the data. Even with its simplicity, Kafka has become a core part of the infrastructure for companies of all sizes.
Overview
If you have not impressed with the statistics yet, here are some more numbers:
At LinkedIn, Kakfa is leveraged for a wide range of use cases; here are some large categories:
LinkedIn needs to operate Kafka in the most reliable and scalable way to manage its vast data and support a variety of use cases. In the following sections, we’ll explore how LinkedIn achieves these goals.
Tiers and Aggregation
An internet-scaled company like LinkedIn runs its infrastructure across multiple data centers.
Some applications only care about what is happening in a single data center, while others, such as building search indexes, need to operate across multiple data centers.
LinkedIn has a local cluster deployed in each data center for each message category. There is an aggregate cluster, which consolidates messages from all local clusters for a given category. With this strategy, the producer and consumer can interact with the local Kafka cluster without reaching across data centers.
Initially, they used Kafka Mirror Maker to copy data from the local to the aggregate cluster. Later, they encountered a scaling issue with this replication tool, so they switched to Brooklin, an internal solution that allows data to be streamed across different data stores.
When reading data, LinkedIn deploys consumers to consume data from the brokers in the same data center when reading data. This approach simplifies the configuration and avoids cross-datacenter network issues.
We can now see the tier of Kafka deployment at LinekdIn:
Operating at the many tiers raises a concern: the completeness of Kafka’s message when it has gone through many tiers. LinkedIn needs a way of auditing.
Auditing Completeness
Kakfa Audit is an internal tool at LinkedIn that ensures sent messages do not disappear when copied through tiers.
When the producer sends messages to Kafka, it tracks the count of messages sent during the current time interval. Periodically, the producer sends this count as a message to a special auditing topic.
On the consumption side, audit consumers from the Kafka Console Auditor application will consume messages from all topics alongside the consumers from other applications.
领英推荐
Like the producer, audit consumers periodically send messages into the auditing topic, recording the number of messages they consume for each topic.
The LinkedIn engineers will compare the message count from producers and audit consumers to check if the message has landed in Kafka.
If the numbers are different, there must be a problem with the producer. Their engineers can trace the specific service and host responsible to them.
Tracing is possible because the Kafka message’s schema at LinkedIn contains a header that includes metadata like the timestamp, the originating physical server, and the service.
LinkedIn Kafka release branches
LinkedIn maintained internal Kafka release branches to deploy their production environment.
Their goal is to keep their internal branch close to the open-source Kafka release branch, which helps them leverage new features or hotfixes from the community and allows LinkedIn to contribute to Apache Kafka’s open source.
LinkedIn engineers create an internal release branch by branching from the associated Apache Kafka branch; they call this the upstream branch.
They have two different approaches to commit Kafka patches developed at LinkedIn:
They commit changes to the upstream first, and if necessary, they issue a Kafka Improvement Proposal (KIP). Then, they cherry-pick them to their current LinkedIn release branch. This method is suitable for changes with low to medium urgency.
Keeping their release branch close to the upstream branch is a two-way process; in addition to syncing their internal patch to the upstream branch, they also need to cherry-pick patches from upstream branches to their internal ones. There are the following types of patches in the LinkedIn release branch:
Here is the LinkedIn Kafka development workflow:
If there is a new issue:
If there is a new feature:
The LinkedIn engineers will choose the Upstream First route or LinkedIn First route based on the urgency of the patch. Typically, patches addressing production issues are committed as hotfixes first. Feature patches for approved KIPs should go to the upstream branch first.