登录查看更多内容

"Real-Time End-to-End Integration with Apache Kafka in Apache Spark’s Streaming"

Abhishek Singh

Technical Lead Data Engineer @ Publicis Sapient. Expertise in Python, Flink, Confluent Kafka , PySpark , SQL , Delta Lake , Airflow , AWS , and Azure cloud. Performance tuning of Spark, SQL & pipelines DBT, CI/CD.

发布日期: 2022年9月27日

Structured Streaming APIs enable building end-to-end streaming applications called?continuous applications?in a consistent, fault-tolerant manner that can handle all of the complexities of writing such applications. It does so without having to reason about the nitty-gritty details of streaming itself and by allowing the usage of familiar concepts within Spark SQL such as DataFrames and?Datasets. All of this has led to a high interest in use cases wanting to tap into it. From?introduction, to?ETL, to?complex data formats, there has been a wide coverage of this topic. Structured Streaming is also integrated with third party components such as Kafka, HDFS, S3, RDBMS, etc

Connecting to a Kafka Topic

Let’s assume you have a?Kafka cluster?that you can connect to and you are looking to use Spark’s Structured Streaming to ingest and process messages from a topic. The Databricks platform already includes an Apache Kafka 0.10 connector for Structured Streaming, so it is easy to set up a stream to read messages:

Streaming ETL

Now that the stream is set up, we can start doing the required ETL on it to extract meaningful insights. Notice that?streamingInputDF?is a DataFrame. Since DataFrames are essentially an untyped Dataset of rows, we can perform similar operations to them.

Let’s say that the generic ISP hit JSON data is being pushed to the Kafka?<topic>?above. An example value would look like this:

Notice in the command above, we are able to parse the zipcode out of incoming JSON messages, group them and do a count, all in real-time as we are reading data from the Kafka topic. Once we have the count, we can display it, which fires the streaming job in the background and continuously updates the counts as new messages arrive.

Windowing

Now that we have parse, select, groupBy and count queries continuously executing, what if we want to find out traffic per zip code for a 10 minute window interval, with sliding duration of 5 minutes starting 2 minutes past the hour?

Output Options

So far, we have seen the end results being displayed automatically. If we want more control in terms output options, there are a variety of output modes available. For instance, if we need to debug, you may wish to select the console output. If we need to be able to query the dataset interactively as data is being consumed, the memory output would be an ideal choice. Similarly, the output can be written to files, external databases, or even streamed back to Kafka.

Memory

In this scenario, data is stored as an in-memory table. From here, users are able to query the dataset using SQL. The name of the table is specified from the?queryName?option. Note we continue to use?streamingSelectDF?from the above windowing example.

Console

In this scenario, output is printed to console/stdout log.

领英推荐

Introduction to Apache Kafka

Machine Learning Reply GmbH 6 个月前

Kafka vs RabbitMQ: Biggest Differences and Which…

DataIns Technology LLC 1 年前

Apache Spark: Features, Use Cases, Advantages &…

Anush K. 2 年前

Databases

Often times we want to be able to write output of streams to external databases such as MySQL. At the time of writing, the Structured Streaming API does not support external databases as sinks; however, when it does, the API option will be as simple as?.format("jdbc").start("jdbc:mysql/.."). In the meantime, we can use the foreach sink to accomplish this. Let’s create a custom JDBC Sink that extends?ForeachWriter?and implements its methods.

We can now use the?JDBCSink

As batches are complete, counts by zip could be INSERTed/UPSERTed into MySQL as needed.

Kafka

Similar to writing to databases, the current Structured Streaming API does not support the “kafka” format, but this will be available in the next version. In the meantime, we can create a custom class named?KafkaSink` which extends _ForeachWriter. Let’s see how that looks:

Now we can use the writer:

You can now see that we are pumping messages back to Kafka topic?<topic2>. In this case we are pushing updated?zipcode:count?at the end of each batch. The other thing to note is that streaming Dashboard provides insights into incoming messages versus processing rate, batch duration and raw data that is used to generate it. This comes in very handy when debugging issues and monitoring system.

On the Kafka consumer side, we can see:

In this case, we are running in “update” output mode. As messages are being consumed, zipcodes that are getting updated during that batch are being pushed back to Kafka. Zipcodes that do not get updated are not being sent. You can also run in “complete” mode, as we did in the database sink above, in which all of the zipcodes with latest count will be sent, even if some of the zipcode counts did not change since the last batch.

I hope this will help you to learn kafka and spark streaming ETL with sink formats.

Thank you

要查看或添加评论，请登录

Abhishek Singh的更多文章

Interview Question for Lead Data Engineer at MAANG

2024年4月9日

Interview Question for Lead Data Engineer at MAANG

Interview Experience with MAANG at Jan-24 question asked by Faang at the L3 round : write a kafka pyspark code for the…
SQL Server Big Data Clusters on Azure

2022年9月30日

SQL Server Big Data Clusters on Azure

In SQL Server 2019 (15.x), SQL Server Big Data Clusters allow you to deploy scalable clusters of SQL Server, Spark, and…
Uber System Architecture Design

2022年9月30日

Uber System Architecture Design

We all are familiar with Uber services. A user can request a ride through the application and within a few minutes, a…
"Key Concepts, to Master Window Functions"

2022年9月28日

"Key Concepts, to Master Window Functions"

Introduction If work with data, window functions can significantly level up your SQL skills. If you have ever thought…
Netflix High-Level System Architecture

2022年9月24日

Netflix High-Level System Architecture

We all are familiar with Netflix services. It handles large categories of movies and television content, and users pay…
"How to improve SQL as a Senior Data Engineer"

2022年9月24日

"How to improve SQL as a Senior Data Engineer"

1. Introduction SQL is the bread and butter of data engineering.
What is the difference between a data lake and a data warehouse?

2022年9月23日

What is the difference between a data lake and a data warehouse?

A data lake is a repository that stores all of your organization's data — both structured and unstructured. Think of it…
Developing a Real-Time Data Warehouse

2022年9月23日

Developing a Real-Time Data Warehouse

Many data engineers coming from traditional batch processing frameworks have questions about real-time data processing…

2 条评论
"Spark Performance Tuning with help of Spark UI"

2022年9月22日

"Spark Performance Tuning with help of Spark UI"

Spark is distributed data processing engine which relies a lot on memory available for computation. Also, if you have…
Change Data Capture Using Kafka Debezium and PostgreSQL

2022年9月22日

Change Data Capture Using Kafka Debezium and PostgreSQL

Change data capture is a software design pattern used to capture changes to data and take the corresponding action…

See all articles

"Real-Time End-to-End Integration with Apache Kafka in Apache Spark’s Streaming"

Abhishek Singh

Technical Lead Data Engineer @ Publicis Sapient. Expertise in Python, Flink, Confluent Kafka , PySpark , SQL , Delta Lake , Airflow , AWS , and Azure cloud. Performance tuning of Spark, SQL & pipelines DBT, CI/CD.

Connecting to a Kafka Topic

Streaming ETL

Windowing

Output Options

Memory

Console

领英推荐

Databases

Kafka

Abhishek Singh的更多文章

社区洞察

其他会员也浏览了

--- Apache Kafka vs Solace PubSub+: A Comprehensive Guide for Modern Messaging Systems

Apache Kafka: What Product Managers Need To Know

How I Dockerized Apache Flink, Kafka, and PostgreSQL for Real-Time Data Streaming

Kafka less explored facts Part-1

Kafka's Evolution: Zookeeper vs. KRaft

Kafka and Zookeeper: The Dynamic Duo of Distributed Systems

Understanding DStreams in Apache Spark

Apache Kafka: Integration and Use in Ruby on Rails Applications

A Step-by-Step Guide to Apache Kafka with Docker and Node.js

Integrating Apache Kafka in KRaft Mode with RisingWave for Event Stream Processing

Connecting to a Kafka Topic

Streaming ETL

Windowing

Output Options

Memory

Console

领英推荐

Databases

Kafka

Abhishek Singh的更多文章

Interview Question for Lead Data Engineer at MAANG

SQL Server Big Data Clusters on Azure

Uber System Architecture Design

"Key Concepts, to Master Window Functions"

Netflix High-Level System Architecture

"How to improve SQL as a Senior Data Engineer"

What is the difference between a data lake and a data warehouse?

Developing a Real-Time Data Warehouse

"Spark Performance Tuning with help of Spark UI"

Change Data Capture Using Kafka Debezium and PostgreSQL

社区洞察

其他会员也浏览了

--- Apache Kafka vs Solace PubSub+: A Comprehensive Guide for Modern Messaging Systems

Apache Kafka: What Product Managers Need To Know

How I Dockerized Apache Flink, Kafka, and PostgreSQL for Real-Time Data Streaming

Kafka less explored facts Part-1

Kafka's Evolution: Zookeeper vs. KRaft

Kafka and Zookeeper: The Dynamic Duo of Distributed Systems

Understanding DStreams in Apache Spark

Apache Kafka: Integration and Use in Ruby on Rails Applications

A Step-by-Step Guide to Apache Kafka with Docker and Node.js

Integrating Apache Kafka in KRaft Mode with RisingWave for Event Stream Processing