登录查看更多内容

SPARK AND KAFKA INTEGRATION

Anshuman Anshuman

Senior Engineering Manager at Seagate

发布日期: 2021年11月9日

This write-up describes the integration between kafka and Spark. Very briefly I will touch upon basics of Kafka and Spark and how they can be integrated.

Kafka: - Using Kafka is a pub-sub solution; where producer publishes data to a topic and a consumer subscribes to that topic to receive the data.?Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, and?very fast.

Spark: - Spark is an ecosystem that provides many components such as spark core, spark streaming, spark Sql, spark Mlib, etc. and these components are used to compute the results by making RDDs and putting them into RAM; hence their performance is very fast.

The basic integration between kafka and spark is omnipresent in digital universe. But this blogs shows the integration where kafka producer can be customized to work as producer and feed the results to spark streaming working as consumer.

This integration will be used in multiple use cases that will follow this blog chain. Data used in this blog is dummy data, which is used to depict the integration between kafka and spark.

In the follow up blogs, we will work on real data.

Sample data

Schema of data –

Name – Name of employee

Age – Age of employee

Place – working place of employee

Salary – salary of employee

Department – The department in which he/she works

Sample rows/records of data look like: -

(“Ayaansh”,20,”Kolkata”,40000,”IT”)

(“Bahubali”,30,”Hyderabad”,60000,”Analytics”)

(“Curious”,40,”Bangalore”,90000,”Data science”)

(“Dear”,38,”Mumbai”,100000,”Business consultant”)

Step 1: -

Create a kafka topic: - kafka/bin/kafka-topics.sh --zookeeper <IP:port> --create --topic <topic_name> --partitions <numeric value> --replication-factor <numeric value>

Step 2: -

领英推荐

Kafka with KRaft (Kafka Raft)

Dhiraj Patra 1 个月前

Data Glossary: Know the terms. #BigData

Nick DigitalEvolution 6 年前

Spark with Kubernetes

Chandrashekhar Kumar 2 年前

Verify whether created topic exists or not: - kafka/bin/kafka-topics.sh -zookeeper <IP:port>?—list

Step 3: -

Start kafka server :- /kafka/bin/kafka-server-start.sh /kafka/config/server.properties

Step 4: -

We will write a kafka customized producer class, that will accept the data and become the producer. And spark can fetch data from this producer.

Step 5: -

We will write a sample Java program to feed these data into a kafka topic. As a matter of fact, we can write the logic in any program to make data available to kafka. Our main focus is to get the data from kafka into spark. After this step, make a jar. If you are using eclipse/maven, use command “mvn package”.

import java.io.BufferedReader

import java.io.File;

import java.io.FileReader;

import java.io.IOException;

import test.kafka.KafkaProducerSrvc;




public class FileParserSrvc {




	/*Reference to the kafka producer class, which is imported*/

KafkaProducerSrvc kafkaProducer = null ;




? ? ? ? ? ? ? ? ? ? ? try{

? ? ? ? ? ? ? ? ? ? ? ? ? kafkaProducer = new KafkaProducerService("myTopic");

? ? ? ? ? ? ? ? ? ? ? ? ? }catch(Exception e)

? ? ? ? ? ? ? ? ? ? ? ? ? {

? ? ? ? ? ? ? ? ? ? ? ? ? e.printStackTrace();

? ? ? ? ? ? ? ? ? ? }




?? ? ? 	try {

			File file = new File("test.txt");

			FileReader fileReader = new FileReader(file);

			BufferedReader bufferedReader = new BufferedReader(fileReader);

			String line;

			while ((line = bufferedReader.readLine()) != null) {

				kafkaProducer.sendMessege(line);				

			}

			fileReader.close();

			System.out.println("Contents of file:");

		} catch (IOException e) {

			e.printStackTrace();




};

Step 6: -

Since, we have already seen the kafka producer that sends the input line by line to the consumer. Here, consumer is spark streaming. Spark streaming accepts the input in batch intervals (10 seconds, for e.g) and makes the batches of input for this interval. And then the spark engine works on this batch of data and sends the output to further pipeline.

The spark steaming consumer can be written in any language such as Scala, Java, Python, etc.; but we write our program in Scala.

Step 7: -

To test it out, make a jar for the complete project. If you are using sbt, then go to the project root directory, and fire the command “sbt assembly”, which will create the jar file.

Transfer the jar file to the cluster/server where kafka server, zookeeper server, spark services are running and then fire following command to run it :-

To run kafka producer(java program)?java?–jar KafkaProducerSrvc-1.0.jar
To run spark streaming program /spar/bin/spark-submit?- -class test.FromKafkaToSparkStreaming?--master <node_address> FromKafkaToSparkStreaming-1.0.jar

After this step is executed, we can see that each input line is coming at spark streaming end and getting printed line by line.

Hope this tutorial helps.

Vishwanathan Mahadevan

Bigdata Architect Agilisium Consulting

3 年

Ananth Mahadevan Muralidharan

2 次回应

要查看或添加评论，请登录

Anshuman Anshuman的更多文章

Performance Test Google Cloud Pub/Sub

2022年6月16日

Performance Test Google Cloud Pub/Sub

Creating a batch data pipeline on GCP is tricky but creating a full-throttle real-time data pipeline is trickier. This…

4 条评论
Recommendation Engine with Spark MLlib

2021年11月9日

Recommendation Engine with Spark MLlib

To continue the spark series, time has come to discuss about Apache Spark Machine Learning libraries and work on one of…
Cloud IAM in Nutshell

2021年11月1日

Cloud IAM in Nutshell

In Cloud IAM, you grant access to members. Members can be of the following types: Google account Service account Google…
Difference between MapFiles and SequenceFiles.

2021年11月1日

Difference between MapFiles and SequenceFiles.

We need to understand a basic fact first that file types matter in case of MapReduce operations. That means mapreduce…
Job, JobConf, and JobControl in Hadoop

2021年11月1日

Job, JobConf, and JobControl in Hadoop

Job :- is a representation of a mapreduce job. In addition to holding job configuration, Job object also holds…
Custom InputFormat in MapReduce.

2021年11月1日

Custom InputFormat in MapReduce.

Function of an inputFormat is to define how to read data from a file into mapper class. Another important function of…

See all articles

SPARK AND KAFKA INTEGRATION

Anshuman Anshuman

Senior Engineering Manager at Seagate

领英推荐

Anshuman Anshuman的更多文章

社区洞察

其他会员也浏览了

Spark with Kubernetes

Part 3: Automating Data Pipelines with Airflow and Kafka

Day 1 - 15Day Databricks: Spark Architecture & Internal Working Mechanism

Apache Spark: Revolutionizing Big Data Processing

Exploring Apache Spark: The Future of Big Data Technologies

Kafka –> HDFS/S3 Batch Ingestion through Spark

Apache Kafka Study Notes — 1 — Introduction to Kafka Terminology

Open source tools for Data Engineering

4G of Big Data "Apache Flink" - Introduction and a Quickstart Tutorial

领英推荐

Anshuman Anshuman的更多文章

Performance Test Google Cloud Pub/Sub

Recommendation Engine with Spark MLlib

Cloud IAM in Nutshell

Difference between MapFiles and SequenceFiles.

Job, JobConf, and JobControl in Hadoop

Custom InputFormat in MapReduce.

社区洞察

其他会员也浏览了

Spark with Kubernetes

Part 3: Automating Data Pipelines with Airflow and Kafka

Day 1 - 15Day Databricks: Spark Architecture & Internal Working Mechanism

Apache Spark: Revolutionizing Big Data Processing

Exploring Apache Spark: The Future of Big Data Technologies

Kafka –> HDFS/S3 Batch Ingestion through Spark

Apache Kafka Study Notes — 1 — Introduction to Kafka Terminology

Open source tools for Data Engineering

4G of Big Data "Apache Flink" - Introduction and a Quickstart Tutorial