登录查看更多内容

Navigating Kafka and the Challenges of Asynchronous Communication

Enlin Xu

Co-Founder at Causely

发布日期: 2023年9月27日

Welcome back to our series, “One Million Ways to Slow Down Your Application.” Having previously delved into the nuances of Postgres configurations, we now journey into the world of Kafka and asynchronous communication, another critical component of scalable applications.

Kafka 101: An Introduction

Kafka is an open-source stream-processing software platform. Developed by LinkedIn and donated to the Apache Software Foundation, written in Scala and Java. It is designed to handle data streams and provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

Top Use Cases for Kafka

Kafka's versatility allows for different application use cases, including:

Real-Time Analytics: Analyzing data in real-time can provide companies with a competitive edge. Kafka allows businesses to process massive streams of data on the fly.
Event Sourcing: This is a method of capturing changes to an application state as a series of events which can be processed, stored, and replayed.
Log Aggregation: Kafka can consolidate logs from multiple services and applications, ensuring centralized logging and ease of access.
Stream Processing: With tools like Kafka Streams and KSQL, Kafka can be used for complex stream processing tasks.

Typical Failures of Kafka

Kafka is resilient, but like any system, it can fail. Some of the most common failures include:

Broker Failures: Kafka brokers can fail due to hardware issues, lack of resources or misconfigurations.
Zookeeper Outages: Kafka relies on Zookeeper for distributed coordination. If Zookeeper faces issues, Kafka can be adversely impacted.
Network Issues: Kafka relies heavily on networking. Network partitions or latencies can cause data delays or loss.
Disk Failures: Kafka persists data on disk. Any disk-related issues can impact its performance or cause data loss.

Typical Manifestations of Kafka Failures

Broker Metrics

Brokers are pivotal in the Kafka ecosystem, acting as the central hub for data transfer. Monitoring these metrics can help you catch early signs of failures:

Under Replicated Partitions: A higher than usual count can indicate issues with data replication, possibly due to node failures.
Offline Partitions Count: If this is non-zero, it signifies that some partitions are not being served by any broker, which is a severe issue.
Active Controller Count: There should only ever be one active controller. A deviation from this norm suggests issues.
Log Flush Latency: An increase in this metric can indicate disk issues or high I/O wait, affecting Kafka's performance.
Request Handler Average Idle Percent: A decrease can indicate that the broker is getting overwhelmed.

Consumer Metrics

Consumers pull data from brokers. Ensuring they function correctly is vital for any application depending on Kafka:

Consumer Lag: Indicates how much data the consumer is behind in reading from Kafka. A consistently increasing lag may denote a slow or stuck consumer.
Commit Rate: A drop in the commit rate can suggest that consumers aren't processing messages as they should.
Fetch Rate: A decline in this metric indicates the consumer isn't fetching data from the broker at the expected rate, potentially pointing to networking or broker issues.
Rebalance Rate: Frequent rebalances can negatively affect the throughput of the consumer. Monitoring this can help identify instability in the consumer group.

领英推荐

Why You Should Consider Event-Driven Architecture And…

Vintage 7 个月前

Introduction to Apache Kafka

Brij kishore Pandey 8 个月前

Kafka Streams vs. Apache Flink: Choosing the Right…

VARAISYS PVT. LTD. 9 个月前

Producer Metrics

Producers push data into Kafka. Their health directly affects the timeliness and integrity of data in the Kafka ecosystem:

Message Send Rate: A sudden drop can denote issues with the producer's ability to send messages, possibly due to network issues.
Record Error Rate: An uptick in errors can signify that messages are not being accepted by brokers, perhaps due to topic misconfigurations or broker overloads.
Request Latency: A surge in latency can indicate network delays or issues with the broker handling requests.
Byte Rate: A drop can suggest potential issues in the pipeline leading to the producer or within the producer itself.

The Criticality of Causality in Kafka

Understanding causality between failures and how they are manifested in Kafka is vital. Failures, be they from broker disruptions, Zookeeper outages, or network inconsistencies, send ripples across the Kafka ecosystem, impacting various components. For instance, a spike in consumer lag could be traced back to a broker handling under-replicated partitions, and an increase in producer latency might indicate network issues or an overloaded broker.

Furthermore, applications using asynchronous communications are much more difficult to troubleshoot than those using synchronous communications. As seen in the examples below, it’s pretty straightforward to troubleshoot using distributed tracing if the communication is synchronous. But with asynchronous communication, there are gaps in the spans that make it harder to understand what’s happening.

Figure 1: Example of distributed tracing with sync communication

Figure 2: Example of distributed tracing for async communication

This isn’t about drawing a straight line from failure to manifestation; it's about unraveling a complex network of events and repercussions. For every failure that occurs, the developer must first manually determine where the failure happened—was it the Broker? The Zookeeper? The Consumer? Following this, they need to zoom in and figure out the specific problem. Is it a broker misconfiguration or a lack of resources? A misconfigured Zookeeper? Or is the consumer application not consuming messages quickly enough, resulting in disk full?

Software automation that captures causality can help get to the correct answer!

Figure 3: A Broker failure causes Producer failure

Signing Off

Delving into Kafka highlights the complexities of asynchronous communication in today's apps. Just like our previous exploration of Postgres, getting the configuration right and understanding causality are key.

By understanding the role of each component and what could go wrong, developer teams can focus on developing applications instead of troubleshooting what happened in Kafka.

Keep an eye out for more insights as we navigate the diverse challenges of managing resilient applications. Remember, it’s not only about avoiding slowdowns, but also about building a system that excels in any situation.

Andrew Mallaband

Growth Engineering | Enabling Tech Leaders & Innovators Around The Globe To Achieve Exceptional Results

1 年

Organisations running Kafka clearly have rigorous Service Level Objectives because of the nature of the applications it is used for. Your article provides some very good examples that highlight the importance of understanding causality because of the complexity involved as organisations scale out.

1 次回应

Jim Holland

Ex-IBM Software Development Manager - Turbonomic

1 年

Thanks Enlin. This is very informative.

1 次回应

Jason Knight

Rethinking Talent Acquistion to impact business & elevate people

1 年

Great stuff, Enlin! I am currently searching for a Sr Data Platforms Admin w/ Kafka expertise. This will help me greatly in better understanding my candidates and the problems they may be working through. James and Chad, Enlin may be a beneficial follow for you two.

1 次回应

Joe Benincasa

Product Leader, Cloud Optimization

1 年

Please keep this series going. This is the best thing I've read on LinkedIn in the past 6 months. I've got my eye on you!

6 次回应

Yotam Yemini

CEO at Causely (causely.ai)

1 年

Love to see you doing some writing, and really enjoyed the read as well!

2 次回应

查看更多评论

要查看或添加评论，请登录

Enlin Xu的更多文章

One Million Ways to Slow Down Your Application Response Time and Throughput

2023年6月13日

One Million Ways to Slow Down Your Application Response Time and Throughput

Navigating the Perilous Waters of Misconfigured MaxOpenConnection in Postgres Applications Welcome to the inaugural…

5 条评论

Navigating Kafka and the Challenges of Asynchronous Communication

Enlin Xu

Co-Founder at Causely

Kafka 101: An Introduction

Top Use Cases for Kafka

Typical Failures of Kafka

Typical Manifestations of Kafka Failures

领英推荐

The Criticality of Causality in Kafka

Signing Off

Enlin Xu的更多文章

社区洞察

其他会员也浏览了

Understanding Apache Kafka's KRaft Mode and Its Real-World Applications

Kafka: Navigating the World of Data Streaming

Understanding Apache Kafka and RabbitMQ: An Overview Written By David Ayobami-George

Running Kafka on a Single Node in K8s Cluster

RavenDB 5.4 LTS Released, Disaster Series: Article #2, Case Study: Beatman’s Decade of Success with RavenDB… All of RavenDB’s news.

Chaos! When Zookeeper accidentally had multiple leaders

Kafka vs. RabbitMQ: Which Message Queue Should You Choose? ??

Addressing Kafka Partition Imbalance: Strategies for Ensuring Even Distribution Across Brokers

002 – March 2023

Kafka and ZooKeeper a short introduction

Kafka 101: An Introduction

Top Use Cases for Kafka

Typical Failures of Kafka

Typical Manifestations of Kafka Failures

领英推荐

The Criticality of Causality in Kafka

Signing Off

Enlin Xu的更多文章

One Million Ways to Slow Down Your Application Response Time and Throughput

社区洞察

其他会员也浏览了

Understanding Apache Kafka's KRaft Mode and Its Real-World Applications

Kafka: Navigating the World of Data Streaming

Understanding Apache Kafka and RabbitMQ: An Overview Written By David Ayobami-George

Running Kafka on a Single Node in K8s Cluster

RavenDB 5.4 LTS Released, Disaster Series: Article #2, Case Study: Beatman’s Decade of Success with RavenDB… All of RavenDB’s news.

Chaos! When Zookeeper accidentally had multiple leaders

Kafka vs. RabbitMQ: Which Message Queue Should You Choose? ??

Addressing Kafka Partition Imbalance: Strategies for Ensuring Even Distribution Across Brokers

002 – March 2023

Kafka and ZooKeeper a short introduction