登录查看更多内容

LinkedIn Handle 7 Trillion Messages Daily With Apache Kafka

?? Saral Saxena ??????

?11K+ Followers | Linkedin Top Voice || Associate Director || 15+ Years in Java, Microservices, Kafka, Spring Boot, Cloud Technologies (AWS, GCP) | Agile , K8s ,DevOps & CI/CD Expert

发布日期: 2024年8月19日

Observing Kafka from the perspective of the ski driver falling at 10,000 feet, it has a dead simple architecture: the brokers contain the topic, the producers are responsible for data writing, and the consumer is responsible for reading the data. Even with its simplicity, Kafka has become a core part of the infrastructure for companies of all sizes.

Overview

If you have not impressed with the statistics yet, here are some more numbers:

100 Kafka Cluster
4000 Kafka Brokers
100,000 topics
7,000,000 partitions

At LinkedIn, Kakfa is leveraged for a wide range of use cases; here are some large categories:

Decoupling the sender and receiver: one part of the application produces messages, while another part consumes them.
Monitoring: Kafka acts as the event bus to receive monitoring metrics from the agents. LinkedIn installs agents in the servers to collect application-generated measurements, such as CPU, RAM utilization, etc…
Logging: LinkedIn routes application, system, and public access logs to Kafka.
Tracking: Tracking involves every action, whether by users or applications. This data is crucial for keeping search indices current, tracking paid service usage, and measuring growth in real time. LinkedIn uses stream processing systems like Samza to process action data from Kafka.

LinkedIn needs to operate Kafka in the most reliable and scalable way to manage its vast data and support a variety of use cases. In the following sections, we’ll explore how LinkedIn achieves these goals.

Tiers and Aggregation

An internet-scaled company like LinkedIn runs its infrastructure across multiple data centers.

Some applications only care about what is happening in a single data center, while others, such as building search indexes, need to operate across multiple data centers.

LinkedIn has a local cluster deployed in each data center for each message category. There is an aggregate cluster, which consolidates messages from all local clusters for a given category. With this strategy, the producer and consumer can interact with the local Kafka cluster without reaching across data centers.

Initially, they used Kafka Mirror Maker to copy data from the local to the aggregate cluster. Later, they encountered a scaling issue with this replication tool, so they switched to Brooklin, an internal solution that allows data to be streamed across different data stores.

When reading data, LinkedIn deploys consumers to consume data from the brokers in the same data center when reading data. This approach simplifies the configuration and avoids cross-datacenter network issues.

We can now see the tier of Kafka deployment at LinekdIn:

First tier: Producer
Second tier: Local cluster (across all data centers)
Additional tiers: Aggregate clusters
Final tier: Consumer

Operating at the many tiers raises a concern: the completeness of Kafka’s message when it has gone through many tiers. LinkedIn needs a way of auditing.

Auditing Completeness

Kakfa Audit is an internal tool at LinkedIn that ensures sent messages do not disappear when copied through tiers.

When the producer sends messages to Kafka, it tracks the count of messages sent during the current time interval. Periodically, the producer sends this count as a message to a special auditing topic.

On the consumption side, audit consumers from the Kafka Console Auditor application will consume messages from all topics alongside the consumers from other applications.

领英推荐

Hands-on with Apache Iceberg on Your Laptop: Deep…

Alex Merced 6 个月前

Change Data Capture (CDC) when there is no CDC

Alex Merced 5 个月前

Kafka Explained

Farshid A. 11 个月前

Like the producer, audit consumers periodically send messages into the auditing topic, recording the number of messages they consume for each topic.

The LinkedIn engineers will compare the message count from producers and audit consumers to check if the message has landed in Kafka.

If the numbers are different, there must be a problem with the producer. Their engineers can trace the specific service and host responsible to them.

Tracing is possible because the Kafka message’s schema at LinkedIn contains a header that includes metadata like the timestamp, the originating physical server, and the service.

LinkedIn Kafka release branches

LinkedIn maintained internal Kafka release branches to deploy their production environment.

Their goal is to keep their internal branch close to the open-source Kafka release branch, which helps them leverage new features or hotfixes from the community and allows LinkedIn to contribute to Apache Kafka’s open source.

LinkedIn engineers create an internal release branch by branching from the associated Apache Kafka branch; they call this the upstream branch.

They have two different approaches to commit Kafka patches developed at LinkedIn:

They commit changes to the upstream first, and if necessary, they issue a Kafka Improvement Proposal (KIP). Then, they cherry-pick them to their current LinkedIn release branch. This method is suitable for changes with low to medium urgency.

They commit to the internal release branch first, then to upstream later. This method is suitable in high-urgency scenarios.

Keeping their release branch close to the upstream branch is a two-way process; in addition to syncing their internal patch to the upstream branch, they also need to cherry-pick patches from upstream branches to their internal ones. There are the following types of patches in the LinkedIn release branch:

Patches from the upstream Kafka branch up to the branch point.
Cherry-picked patches from the upstream branch after the branch point.
Hotfix patches that are committed to the internal branch first and are prepared to be committed to the upstream branch.
LinkedIn-only patches appear only in the internal release branches. They tried to commit to upstream branches but were rejected by the open-source community.

Here is the LinkedIn Kafka development workflow:

If there is a new issue:

If the patch exists in the open-source Apache Kafka branch, they can cherry-pick from the upstream branch or catch up with this patch later in the next rebase.
If the patch does not exist in the upstream branch, it is attempted to be committed to both the upstream and internal branches.

If there is a new feature:

They will attempt to commit the patch to the upstream and internal branches. When committing to the upstream, they will issue the KIP if needed.

The LinkedIn engineers will choose the Upstream First route or LinkedIn First route based on the urgency of the patch. Typically, patches addressing production issues are committed as hotfixes first. Feature patches for approved KIPs should go to the upstream branch first.

要查看或添加评论，请登录

?? Saral Saxena ???????的更多文章

Validating Payloads with Spring Boot 3.4.0

2025年2月1日

Validating Payloads with Spring Boot 3.4.0

First, let’s examine a controller that receives a object. This object contains fields such as: first name, last name…
Limitations of Java Executor Framework.

2024年12月26日

Limitations of Java Executor Framework.

The Java Executor Framework has inherent limitations that affect its performance in high-throughput, low-latency…
??Structured Logging in Spring Boot 3.4??

2024年12月8日

??Structured Logging in Spring Boot 3.4??

Spring Boot 3.4 has been released ??, and as usual, I want to introduce you to some of its new features.
Sending large payload as response in optimized way

2024年12月1日

Sending large payload as response in optimized way

Handling large payloads in a Java microservices application, sending large responses efficiently while maintaining…
Disaster Recovery- Strategies

2024年11月30日

Disaster Recovery- Strategies

Backup and Restore This is the simplest of the approaches and as the name implies, it involves periodically performing…
Memory Optimization Techniques for Spring Boot Applications with Practical Coding Strategies

2024年10月27日

Memory Optimization Techniques for Spring Boot Applications with Practical Coding Strategies

Learn practical coding strategies to optimize memory usage in Spring Boot applications. This guide covers efficient…
Designing CI/CD Pipeline

2024年9月28日

Designing CI/CD Pipeline

Problem statement You are responsible for designing and implementing a CI/CD pipeline for a large-scale microservices…
Calculate CPU for containers in k8s dynamically

2024年9月27日

Calculate CPU for containers in k8s dynamically

It’s possible to dynamically resize the CPU on containers in k8s with the feature gate “InPlacePodVerticalScaling”…
Downside of the Executor Service with context to thread local

2024年9月22日

Downside of the Executor Service with context to thread local

The executor service creates a pool of threads that you can submit tasks to. The benefit of this approach is that you…

See all articles

LinkedIn Handle 7 Trillion Messages Daily With Apache Kafka

?? Saral Saxena ??????

?11K+ Followers | Linkedin Top Voice || Associate Director || 15+ Years in Java, Microservices, Kafka, Spring Boot, Cloud Technologies (AWS, GCP) | Agile , K8s ,DevOps & CI/CD Expert

Overview

Tiers and Aggregation

Auditing Completeness

领英推荐

LinkedIn Kafka release branches

?? Saral Saxena ???????的更多文章

社区洞察

其他会员也浏览了

The ScyllaDB Sync: April 2024

Kafka Acks Explained

The Future is Open

Kafka clusters: real-life challenges and how to avoid them

Intro to the Iceberg Kafka Connect Sink

Apache Kafka: An Introduction to Core Concepts and Terminology

How to use Apache Kafka for Data Integration

Mirroring High-Throughput Topics with Kafka MirrorMaker 2

Riding the Data Train: The Power of Apache Kafka

Overview

Tiers and Aggregation

Auditing Completeness

领英推荐

LinkedIn Kafka release branches

?? Saral Saxena ???????的更多文章

Validating Payloads with Spring Boot 3.4.0

Limitations of Java Executor Framework.

??Structured Logging in Spring Boot 3.4??

Sending large payload as response in optimized way

Disaster Recovery- Strategies

Memory Optimization Techniques for Spring Boot Applications with Practical Coding Strategies

Designing CI/CD Pipeline

Calculate CPU for containers in k8s dynamically

Downside of the Executor Service with context to thread local

社区洞察

其他会员也浏览了

The ScyllaDB Sync: April 2024

Kafka Acks Explained

The Future is Open

Kafka clusters: real-life challenges and how to avoid them

Intro to the Iceberg Kafka Connect Sink

Apache Kafka: An Introduction to Core Concepts and Terminology

How to use Apache Kafka for Data Integration

Mirroring High-Throughput Topics with Kafka MirrorMaker 2

Riding the Data Train: The Power of Apache Kafka