SPARK AND KAFKA INTEGRATION
This write-up describes the integration between kafka and Spark. Very briefly I will touch upon basics of Kafka and Spark and how they can be integrated.
Kafka: - Using Kafka is a pub-sub solution; where producer publishes data to a topic and a consumer subscribes to that topic to receive the data.?Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, and?very fast.
Spark: - Spark is an ecosystem that provides many components such as spark core, spark streaming, spark Sql, spark Mlib, etc. and these components are used to compute the results by making RDDs and putting them into RAM; hence their performance is very fast.
The basic integration between kafka and spark is omnipresent in digital universe. But this blogs shows the integration where kafka producer can be customized to work as producer and feed the results to spark streaming working as consumer.
This integration will be used in multiple use cases that will follow this blog chain. Data used in this blog is dummy data, which is used to depict the integration between kafka and spark.
In the follow up blogs, we will work on real data.
Sample data
Schema of data –
Name – Name of employee
Age – Age of employee
Place – working place of employee
Salary – salary of employee
Department – The department in which he/she works
Sample rows/records of data look like: -
(“Ayaansh”,20,”Kolkata”,40000,”IT”)
(“Bahubali”,30,”Hyderabad”,60000,”Analytics”)
(“Curious”,40,”Bangalore”,90000,”Data science”)
(“Dear”,38,”Mumbai”,100000,”Business consultant”)
Step 1: -
Create a kafka topic: - kafka/bin/kafka-topics.sh --zookeeper <IP:port> --create --topic <topic_name> --partitions <numeric value> --replication-factor <numeric value>
Step 2: -
Verify whether created topic exists or not: - kafka/bin/kafka-topics.sh -zookeeper <IP:port>?—list
Step 3: -
Start kafka server :- /kafka/bin/kafka-server-start.sh /kafka/config/server.properties
Step 4: -
We will write a kafka customized producer class, that will accept the data and become the producer. And spark can fetch data from this producer.
Step 5: -
We will write a sample Java program to feed these data into a kafka topic. As a matter of fact, we can write the logic in any program to make data available to kafka. Our main focus is to get the data from kafka into spark. After this step, make a jar. If you are using eclipse/maven, use command “mvn package”.
import java.io.BufferedReader
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import test.kafka.KafkaProducerSrvc;
public class FileParserSrvc {
/*Reference to the kafka producer class, which is imported*/
KafkaProducerSrvc kafkaProducer = null ;
? ? ? ? ? ? ? ? ? ? ? try{
? ? ? ? ? ? ? ? ? ? ? ? ? kafkaProducer = new KafkaProducerService("myTopic");
? ? ? ? ? ? ? ? ? ? ? ? ? }catch(Exception e)
? ? ? ? ? ? ? ? ? ? ? ? ? {
? ? ? ? ? ? ? ? ? ? ? ? ? e.printStackTrace();
? ? ? ? ? ? ? ? ? ? }
?? ? ? try {
File file = new File("test.txt");
FileReader fileReader = new FileReader(file);
BufferedReader bufferedReader = new BufferedReader(fileReader);
String line;
while ((line = bufferedReader.readLine()) != null) {
kafkaProducer.sendMessege(line);
}
fileReader.close();
System.out.println("Contents of file:");
} catch (IOException e) {
e.printStackTrace();
};
Step 6: -
Since, we have already seen the kafka producer that sends the input line by line to the consumer. Here, consumer is spark streaming. Spark streaming accepts the input in batch intervals (10 seconds, for e.g) and makes the batches of input for this interval. And then the spark engine works on this batch of data and sends the output to further pipeline.
The spark steaming consumer can be written in any language such as Scala, Java, Python, etc.; but we write our program in Scala.
Step 7: -
To test it out, make a jar for the complete project. If you are using sbt, then go to the project root directory, and fire the command “sbt assembly”, which will create the jar file.
Transfer the jar file to the cluster/server where kafka server, zookeeper server, spark services are running and then fire following command to run it :-
After this step is executed, we can see that each input line is coming at spark streaming end and getting printed line by line.
Hope this tutorial helps.
Bigdata Architect Agilisium Consulting
3 年Ananth Mahadevan Muralidharan