SPARK AND KAFKA INTEGRATION

SPARK AND KAFKA INTEGRATION

This write-up describes the integration between kafka and Spark. Very briefly I will touch upon basics of Kafka and Spark and how they can be integrated.

Kafka: - Using Kafka is a pub-sub solution; where producer publishes data to a topic and a consumer subscribes to that topic to receive the data.?Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, and?very fast.

Spark: - Spark is an ecosystem that provides many components such as spark core, spark streaming, spark Sql, spark Mlib, etc. and these components are used to compute the results by making RDDs and putting them into RAM; hence their performance is very fast.

The basic integration between kafka and spark is omnipresent in digital universe. But this blogs shows the integration where kafka producer can be customized to work as producer and feed the results to spark streaming working as consumer.

This integration will be used in multiple use cases that will follow this blog chain. Data used in this blog is dummy data, which is used to depict the integration between kafka and spark.

In the follow up blogs, we will work on real data.

Sample data

Schema of data –

Name – Name of employee

Age – Age of employee

Place – working place of employee

Salary – salary of employee

Department – The department in which he/she works


Sample rows/records of data look like: -

(“Ayaansh”,20,”Kolkata”,40000,”IT”)

(“Bahubali”,30,”Hyderabad”,60000,”Analytics”)

(“Curious”,40,”Bangalore”,90000,”Data science”)

(“Dear”,38,”Mumbai”,100000,”Business consultant”)


Step 1: -

Create a kafka topic: - kafka/bin/kafka-topics.sh --zookeeper <IP:port> --create --topic <topic_name> --partitions <numeric value> --replication-factor <numeric value>

No alt text provided for this image


Step 2: -

Verify whether created topic exists or not: - kafka/bin/kafka-topics.sh -zookeeper <IP:port>?—list

No alt text provided for this image


Step 3: -

Start kafka server :- /kafka/bin/kafka-server-start.sh /kafka/config/server.properties

Step 4: -

We will write a kafka customized producer class, that will accept the data and become the producer. And spark can fetch data from this producer.

No alt text provided for this image



Step 5: -

We will write a sample Java program to feed these data into a kafka topic. As a matter of fact, we can write the logic in any program to make data available to kafka. Our main focus is to get the data from kafka into spark. After this step, make a jar. If you are using eclipse/maven, use command “mvn package”.

import java.io.BufferedReader

import java.io.File;

import java.io.FileReader;

import java.io.IOException;

import test.kafka.KafkaProducerSrvc;




public class FileParserSrvc {




	/*Reference to the kafka producer class, which is imported*/

KafkaProducerSrvc kafkaProducer = null ;




? ? ? ? ? ? ? ? ? ? ? try{

? ? ? ? ? ? ? ? ? ? ? ? ? kafkaProducer = new KafkaProducerService("myTopic");

? ? ? ? ? ? ? ? ? ? ? ? ? }catch(Exception e)

? ? ? ? ? ? ? ? ? ? ? ? ? {

? ? ? ? ? ? ? ? ? ? ? ? ? e.printStackTrace();

? ? ? ? ? ? ? ? ? ? }




?? ? ? 	try {

			File file = new File("test.txt");

			FileReader fileReader = new FileReader(file);

			BufferedReader bufferedReader = new BufferedReader(fileReader);

			String line;

			while ((line = bufferedReader.readLine()) != null) {

				kafkaProducer.sendMessege(line);				

			}

			fileReader.close();

			System.out.println("Contents of file:");

		} catch (IOException e) {

			e.printStackTrace();




};        

Step 6: -

Since, we have already seen the kafka producer that sends the input line by line to the consumer. Here, consumer is spark streaming. Spark streaming accepts the input in batch intervals (10 seconds, for e.g) and makes the batches of input for this interval. And then the spark engine works on this batch of data and sends the output to further pipeline.

The spark steaming consumer can be written in any language such as Scala, Java, Python, etc.; but we write our program in Scala.

No alt text provided for this image


Step 7: -

To test it out, make a jar for the complete project. If you are using sbt, then go to the project root directory, and fire the command “sbt assembly”, which will create the jar file.

Transfer the jar file to the cluster/server where kafka server, zookeeper server, spark services are running and then fire following command to run it :-

  • To run kafka producer(java program)?java?–jar KafkaProducerSrvc-1.0.jar
  • To run spark streaming program /spar/bin/spark-submit?- -class test.FromKafkaToSparkStreaming?--master <node_address> FromKafkaToSparkStreaming-1.0.jar

After this step is executed, we can see that each input line is coming at spark streaming end and getting printed line by line.

No alt text provided for this image

Hope this tutorial helps.

要查看或添加评论,请登录

Anshuman Anshuman的更多文章

  • Performance Test Google Cloud Pub/Sub

    Performance Test Google Cloud Pub/Sub

    Creating a batch data pipeline on GCP is tricky but creating a full-throttle real-time data pipeline is trickier. This…

    4 条评论
  • Recommendation Engine with Spark MLlib

    Recommendation Engine with Spark MLlib

    To continue the spark series, time has come to discuss about Apache Spark Machine Learning libraries and work on one of…

  • Cloud IAM in Nutshell

    Cloud IAM in Nutshell

    In Cloud IAM, you grant access to members. Members can be of the following types: Google account Service account Google…

  • Difference between MapFiles and SequenceFiles.

    Difference between MapFiles and SequenceFiles.

    We need to understand a basic fact first that file types matter in case of MapReduce operations. That means mapreduce…

  • Job, JobConf, and JobControl in Hadoop

    Job, JobConf, and JobControl in Hadoop

    Job :- is a representation of a mapreduce job. In addition to holding job configuration, Job object also holds…

  • Custom InputFormat in MapReduce.

    Custom InputFormat in MapReduce.

    Function of an inputFormat is to define how to read data from a file into mapper class. Another important function of…

社区洞察

其他会员也浏览了