Introduction to Streaming - Apache Kafka

Introduction to Streaming - Apache Kafka

References

Alvaro Navas Notes

Data Engineering Zoomcamp Repository

What is a streaming data pipeline?

A data pipeline is a series of data processing steps in which each step delivers an output that is the input to the next step. This continues until the pipeline is complete.

A streaming data pipeline, by extension, is a data pipeline architecture that handle millions of events at scale, in real time. As a result, you can collect, analyze, and store large amounts of information. The capability allows for applications, analytics, and reporting in real time.

What is Kafka?

Apache Kafka is a message broker and stream processor. Kafka is used to handle real-time data feeds.

Kafka is used to turn a system architecture like this...

No alt text provided for this image

into an architecture like this one:

No alt text provided for this image

In a data project we can differentiate between two main actors: consumers and producers:

  • Consumers are those components that consume the data: web pages, micro services, apps, etc.
  • Producers are those who supply the data to consumers.

Connecting consumers to producers directly, as shows on the first image, can lead to an unstructured and hard to maintain architecture in complex projects. Kafka comes to the rescue by becoming an intermediary that all the other components connect to.

Kafka works by allowing producers to send messages which are then pushed in real time by Kafka to consumers.

Basic Kafka Components

Message

The basic communication abstraction used by producers and consumers in order to share information in Kafka is called a?message.

Messages have 3 main components:

  • Key: used to identify the message and for additional Kafka stuff such as partitions (we will cover that later in this article).
  • Value: the actual information that producers push and consumers are interested in.
  • Timestamp: used for logging.

Topic

A?topic?is an abstraction of a concept. Concepts can be anything that makes sense in the context of the project, such as "sales data", "new members", "clicks on banner", etc.

A producer pushes a message to a topic, which is then consumed by a consumer subscribed to that topic.

Logs

In Kafka,?logs?are?data segments?present on a storage disk. In other words, they're?physical representations of data.

Logs store messages in an ordered fashion. Kafka assigns a sequence ID in order to each new message and then stores it in logs.

Consumer Groups

A?consumer group?is composed of multiple consumers.

In regards of controlling read messages, Kafka treats all the consumers inside a consumer group as a?single entity: when a consumer inside a group reads a message, that message will?NOT?be delivered to any other consumer in the group.

Consumer groups allow consumer apps to scale independently: a consumer app made of multiple consumer nodes will not have to deal with duplicated or redundant messages.

Consumer groups have IDs and all consumers within a group have IDs as well.

The default value for consumer groups is 1. All consumers belong to a consumer group.

Partitions

Topic logs in Kafka can be?partitioned. A topic is essentially a?wrapper?around at least 1 partition.

Partitions are assigned to consumers inside consumer groups:

  • A partition?is assigned to?one consumer only.
  • One consumer?may have?multiple partitions?assigned to it.
  • If a consumer dies, the partition is reassigned to another consumer.
  • Ideally there should be as many partitions as consumers in the consumer group.
  • If there are more partitions than consumers, some consumers will receive messages from multiple partitions.
  • If there are more consumers than partitions, the extra consumers will be idle with nothing to do.

No alt text provided for this image

Partitions in Kafka, along with consumer groups, are a scalability feature. Increasing the amount of partitions allows the consumer group to increase the amount of consumers in order to read messages at a faster rate. Partitions for one topic may be stored in separate Kafka brokers in our cluster as well.

Messages are assigned to partitions inside a topic by means of their?key: message keys are hashed and the hashes are then divided by the amount of partitions for that topic; the remainder of the division is determined to assign it to its partition. In other words:?hash modulo partition amount.

  • While it would be possible to store messages in different partitions in a round-robin way, this method would not keep track of the?message order.
  • Using keys for assigning messages to partitions has the risk of making some partitions bigger than others. For example, if the topic?client?makes use of client IDs as message keys and one client is much more active than the others, then the partition assigned to that client will grow more than the others. In practice however this is not an issue and the advantages outweigh the cons.

Replication

Partitions are?replicated?accross multiple brokers in the Kafka cluster as a fault tolerance precaution.

When a partition is replicated accross multiple brokers, one of the brokers becomes the?leader?for that specific partition. The leader handles the message and writes it to its partition log. The partition log is then replicated to other brokers, which contain?replicas?for that partition. Replica partitions should contain the same messages as leader partitions.

No alt text provided for this image

If a broker which contains a leader partition dies, another broker becomes the leader and picks up where the dead broker left off, thus guaranteeing that both producers and consumers can keep posting and reading messages.

We can define the?replication factor?of partitions at topic level. A replication factor of 1 (no replicas) is undesirable, because if the leader broker dies, then the partition becomes unavailable to the whole system, which could be catastrophic in certain applications.

Kafka configurations

Topic configurations

  • retention.ms: due to storage space limitations, messages can't be kept indefinitely. This setting specifies the amount of time (in milliseconds) that a specific topic log will be available before being deleted.
  • cleanup.policy: when the?retention.ms?time is up, we may choose to?delete?or?compact?a topic log. Compaction does not happen instantly; it's a batch job that takes time.
  • partition: number of partitions. The higher the amount, the more resources Kafka requires to handle them. Remember that partitions will be replicated across brokers; if a broker dies we could easily overload the cluster.
  • replication: replication factor; number of times a partition will be replicated.

Consumer configurations

  • offset: sequence of message IDs which have been read by the consumer.
  • consumer.group.id: ID for the consumer group. All consumers belonging to the same group contain the same?consumer.group.id.
  • auto_offset_reset: when a consumer subscribes to a pre-existing topic for the first time, Kafka needs to figure out which messages to send to the consumer. If?auto_offset_reset?is set to?earliest, all of the existing messages in the topic log will be sent to the consumer. If?auto_offset_reset?is set to?latest, existing old messages will be ignored and only new messages will be sent to the consumer.

Producer configurations

Acks: behaviour policy for handling acknowledgement signals. It may be set to?0,?1?or?all.

  • 0: "fire and forget". The producer will not wait for the leader or replica brokers to write messages to disk. Fastest policy for the producer. Useful for time-sensitive applications which are not affected by missing a couple of messages every so often, such as log messages or monitoring messages.
  • 1: the producer waits for the leader broker to write the message to disk. If the message is processed by the leader broker but the broker imediately dies and the message has not been replicated, the message is lost.
  • all: the producer waits for the leader and all replica brokers to write the message to disk. Safest but slowest policy. Useful for data-sensitive applications which cannot afford to lose messages, but speed will have to be taken into account.

要查看或添加评论,请登录

Filipe Balseiro的更多文章

  • Spark - Setting up a Dataproc Cluster on GCP

    Spark - Setting up a Dataproc Cluster on GCP

    Dataproc is Google's cloud-managed service for running Spark and other data processing tools such as Flink, Presto…

    6 条评论
  • Apache Spark

    Apache Spark

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository Installing Spark Installation instructions for…

    3 条评论
  • DBT- Data Build Tool (Part II)

    DBT- Data Build Tool (Part II)

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository Testing and documenting dbt models Although testing…

    2 条评论
  • DBT- Data Build Tool (Part I)

    DBT- Data Build Tool (Part I)

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository What is dbt? dbt stands for data build tool. It's a…

    3 条评论
  • BigQuery

    BigQuery

    Partitioning vs Clustering It's possible to combine both partitioning and clustering in a table, but there are…

  • DataCamp - Data Engineering with Python

    DataCamp - Data Engineering with Python

    Data Engineers Data engineers deliver: The correct data In the right form To the right people As efficiently as…

  • Youtubers Popularity

    Youtubers Popularity

    Working with Youtube's API to collect channel and video statistics from 10 youtubers I follow and upload the data to an…

    12 条评论
  • Google Data Analytics Professional Certificate Capstone Project: Cyclistic

    Google Data Analytics Professional Certificate Capstone Project: Cyclistic

    Case Study: Help a bike-share company to convert casual riders into annual members In this article I showcase my…

社区洞察

其他会员也浏览了