Introduction to Streaming - Apache Kafka
Filipe Balseiro
?? Data Engineer | ?? Snowflake SnowPro Core & dbt Developer Certified | Python | GCP BigQuery | CI/CD Github Actions. Let's elevate your data strategy!
References
What is a streaming data pipeline?
A data pipeline is a series of data processing steps in which each step delivers an output that is the input to the next step. This continues until the pipeline is complete.
A streaming data pipeline, by extension, is a data pipeline architecture that handle millions of events at scale, in real time. As a result, you can collect, analyze, and store large amounts of information. The capability allows for applications, analytics, and reporting in real time.
What is Kafka?
Apache Kafka is a message broker and stream processor. Kafka is used to handle real-time data feeds.
Kafka is used to turn a system architecture like this...
into an architecture like this one:
In a data project we can differentiate between two main actors: consumers and producers:
Connecting consumers to producers directly, as shows on the first image, can lead to an unstructured and hard to maintain architecture in complex projects. Kafka comes to the rescue by becoming an intermediary that all the other components connect to.
Kafka works by allowing producers to send messages which are then pushed in real time by Kafka to consumers.
Basic Kafka Components
Message
The basic communication abstraction used by producers and consumers in order to share information in Kafka is called a?message.
Messages have 3 main components:
Topic
A?topic?is an abstraction of a concept. Concepts can be anything that makes sense in the context of the project, such as "sales data", "new members", "clicks on banner", etc.
A producer pushes a message to a topic, which is then consumed by a consumer subscribed to that topic.
Logs
In Kafka,?logs?are?data segments?present on a storage disk. In other words, they're?physical representations of data.
Logs store messages in an ordered fashion. Kafka assigns a sequence ID in order to each new message and then stores it in logs.
领英推荐
Consumer Groups
A?consumer group?is composed of multiple consumers.
In regards of controlling read messages, Kafka treats all the consumers inside a consumer group as a?single entity: when a consumer inside a group reads a message, that message will?NOT?be delivered to any other consumer in the group.
Consumer groups allow consumer apps to scale independently: a consumer app made of multiple consumer nodes will not have to deal with duplicated or redundant messages.
Consumer groups have IDs and all consumers within a group have IDs as well.
The default value for consumer groups is 1. All consumers belong to a consumer group.
Partitions
Topic logs in Kafka can be?partitioned. A topic is essentially a?wrapper?around at least 1 partition.
Partitions are assigned to consumers inside consumer groups:
Partitions in Kafka, along with consumer groups, are a scalability feature. Increasing the amount of partitions allows the consumer group to increase the amount of consumers in order to read messages at a faster rate. Partitions for one topic may be stored in separate Kafka brokers in our cluster as well.
Messages are assigned to partitions inside a topic by means of their?key: message keys are hashed and the hashes are then divided by the amount of partitions for that topic; the remainder of the division is determined to assign it to its partition. In other words:?hash modulo partition amount.
Replication
Partitions are?replicated?accross multiple brokers in the Kafka cluster as a fault tolerance precaution.
When a partition is replicated accross multiple brokers, one of the brokers becomes the?leader?for that specific partition. The leader handles the message and writes it to its partition log. The partition log is then replicated to other brokers, which contain?replicas?for that partition. Replica partitions should contain the same messages as leader partitions.
If a broker which contains a leader partition dies, another broker becomes the leader and picks up where the dead broker left off, thus guaranteeing that both producers and consumers can keep posting and reading messages.
We can define the?replication factor?of partitions at topic level. A replication factor of 1 (no replicas) is undesirable, because if the leader broker dies, then the partition becomes unavailable to the whole system, which could be catastrophic in certain applications.
Kafka configurations
Topic configurations
Consumer configurations
Producer configurations
Acks: behaviour policy for handling acknowledgement signals. It may be set to?0,?1?or?all.