Kafka and AI
Dhiraj Patra
Cloud-Native (AWS, GCP & Azure) Software & AI Architect | Leading Machine Learning, Artificial Intelligence and MLOps Programs | Generative AI | Coding and Mentoring
Overview
Apache Kafka? is a hot technology amongst application developers and architects looking to build the latest generation of real-time and web-scale applications. According the official Apache Kafka? website “Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.”
Why Use a Queuing or Streaming Engine?
Kafka is part of general family of technologies known as queuing, messaging, or streaming engines. Other examples in this broad technology family include traditional message queue technology such RabbitMQ, IBM MQ, and Microsoft Message Queue. It can be said that Kafka is to traditional queuing technologies as NoSQL technology is to traditional relational databases.?These newer technologies break through scalability and performance limitations of the traditional solutions while meeting similar needs, Apache Kafka can also be compared to proprietary solutions offered by the big cloud providers such as AWS Kinesis, Google Cloud Dataflow, and Azure Stream Analytics.
Why Use Kafka?
The objectives we’ve mentioned above can be achieved with a range of technologies. So why would you use Kafka rather than one of those other technologies for your use case?
Looking Under the Hood
Let’s take a look at how Kafka achieves all this: We’ll start with PRODUCERS. Producers are the applications that generate events and publish them to Kafka. Of course, they don’t randomly generate events—they create the events based on interactions with people, things, or systems. For example a mobile app could generate an event when someone clicks on a button, an IoT device could generate an event when a reading occurs, or an API application could generate an event when called by another application (in fact, it is likely an API application would sit between a mobile app or IoT device and Kafka). These producer applications use a Kafka producer library (similar in concept to a database driver) to send events to Kafka with libraries available for Java, C/ C++, Python, Go, and .NET.
The next component to understand is the CONSUMERS. Consumers are applications that read the event from Kafka and perform some processing on them. Like producers, they can be written in various languages using the Kafka client libraries.
The core of the system is the Kafka BROKERS. When people talk about a Kafka cluster they are typically talking about the cluster of brokers. The brokers receive events from the producer and reliably store them so they can be read by consumers.
The brokers are configured with TOPICS. Topics are a bit like tables in a database, separating different types of data. Each topic is split into PARTITIONS. When an event is received, a record is appended to the log file for the topic and partition that the event belongs to (as determined by the metadata provided by the producer). Each of the partitions that make up a topic are allocated to the brokers in the cluster.?This allows each broker to share the processing of a topic.?When a topic is created, it can be configured to be replicated multiple times across the cluster so that the data is still available for even if a server fails. For each partition, there is a single leader broker at any point in time that serves all reads and writes.?The leader is responsible for synchronizing with the replicas. If the leader fails, Kafka will automatically transfer leader responsibility for its partitions to one of the replicas.
In some instances, guaranteed ordering of message delivery is important so that events are consumed in the same order they are produced. Kafka can support this guarantee at the topic level. To facilitate this, consumer applications are placed in consumer groups and within a CONSUMER GROUP a partition is associated with only a single consumer instance per consumer group.
The following diagram illustrates all these Kafka concepts and their relationships:
A Kafka cluster is a complex distributed system with many configuration properties and possible interactions between components in the system. Operated well, Kafka can operate at the highest levels of reliability even in relatively unreliable infrastructure environments such as the cloud.?
Simple way we can use for an AI/ML application where we are using video streaming for smart camera system. AI application detect the person and object for further analysis.
Here are the simple steps on how Kafka can be used for AI/ML application where video streaming is required to detect from image:
Here is a diagram of the steps involved:
Video Streaming Data
Producer --> Kafka --> Consumer --> Machine Learning Model
In details a step-by-step guide on how Kafka can be used for an AI/ML application that involves video streaming and image detection:
1. Set up a Kafka cluster: Install and configure Apache Kafka on your infrastructure. A Kafka cluster typically consists of multiple brokers (servers) that handle message storage and distribution.
2. Define Kafka topics: Create Kafka topics to represent different stages of your AI/ML pipeline. For example, you might have topics like "raw_video_frames" to receive video frames and "processed_images" to store the results of image detection.
3. Video ingestion: Develop a video ingestion component that reads video streams or video files and extracts individual frames. This component should publish each frame as a message to the "raw_video_frames" topic in Kafka.
4. Implement image detection: Build an image detection module that takes each frame from the "raw_video_frames" topic, processes it using AI/ML algorithms, and generates a detection result. This component should consume messages from the "raw_video_frames" topic, perform image analysis, and produce the detected results.
5. Publish results: Once the image detection is complete, the results can be published to the "processed_images" topic in Kafka. Each message published to this topic would contain the corresponding image frame and its associated detection results.
6. Subscribers or consumers: Create subscriber applications that consume messages from the "processed_images" topic. These subscribers can be used for various purposes such as real-time visualization, storage, or triggering downstream actions based on the detection results.
7. Scalability and parallel processing: Kafka's partitioning feature allows you to process video frames and image detection results in parallel. You can have multiple instances of the image detection module running in parallel, each consuming frames from a specific partition of the "raw_video_frames" topic. Similarly, subscribers of the "processed_images" topic can scale horizontally to handle the processing load efficiently.
8. Monitoring and management: Implement monitoring and management mechanisms to track the progress of your video processing pipeline. Kafka provides various tools and metrics to monitor the throughput, lag, and health of your Kafka cluster.
By following these steps, you can leverage Kafka's distributed messaging system to build a scalable and efficient AI/ML application that can handle video streaming and image detection.
You can find out more on https://kafka.apache.org/ Most of the information used here were collected from Apache Kafka site and its documents.
You can find many examples in internet including this one https://www.researchgate.net/figure/Overview-of-Kafka-ML-architecture_fig3_353707721
Thank you.