Spark Streaming - Part 1
Ankur Ranjan
Building DatosYard | YouTube (100k) - The Big Data Show | Software Engineer by heart, Data Engineer by mind
A few months back, I was given a codebase that used Spark Streaming and it was written in scala. We were supposed to make major changes and go for the new version of the project. I had worked on Spark Structure Streaming before but not in Spark Streaming. I found that, as Spark Streaming is a legacy project, Apache foundation has stopped giving further updates on this. Hence there are very few good articles on Spark Streaming which talks about its functionalities, problems and use cases in layman's terms in good depth.
I think that understanding Spark Streaming is a good way to enter into the Streaming world as it is very easy to learn. It is also written using a lower level of code, hence it's very easier to visualize the concepts of streaming. So let's start learning Spark streaming in very layman's terms using this series of articles.
This article assumes that you know the basic terms of Spark and you have written some code for batch processing using spark.
What is Spark Streaming?
But Ankur, how do we understand this flow of live data and spark streaming concept in simpler terms?
There are mostly the following terms that one has to understand for understanding the flow of live data with Spark Streaming.
Let's try to understand using one of my experiences. I started my career with one of the start-ups in Mumbai(India). I used to daily see people collecting water from the public water system.
Let's suppose Bombay Municipal Corporation(BMC) allowed tap to run for the whole day then we can consider it as a producer producing data continuously. This is like a streaming use case where data is flowing continuously.
Now let's suppose, we consider the water tab running time equal to 1 hour for our calculation & consider a scenario where every person is given 2 minutes to fill their bucket.
Let's visualize this above situation with a streaming use case. Here 2 minutes will be treated as batch duration and every bucket as one RDD. Droplets of water will be considered messages or records.
As we know the flow of water might not be the same every time. So some buckets may contain more water and some might contain less water. Same way in the case of streaming some RDD might have more messages and some might have fewer messages. So one batch duration, we might have more messages whereas, in the next batch duration, we might receive fewer messages.
Now as the size of the stream is 1 hour and the batch duration is 2 minutes. It means that 30 people can fill their 30 buckets.
Here groups of buckets can be considered as DStream and each bucket as one RDDs.
So basically,
DStream -> RDD -> Messages (entities)
So here it has 30 RDD and each RDD can have lots of messages but when we use Spark Streaming then operations are performed as on DStream. Any operation on a DStream applies to all the underlying individual RDDs.
So underneath spark still works in a batch style, usually batch size is so small that we get the feeling that it is real-time streaming.
I hope my enthusiastic readers now understand the meaning of batch duration, messages, RDD and DStream.
A Quick Example
Before we start discussing the concepts of Spark Streaming. Let's see one quick code. I think a word count example in any computing engine is a really good way to start.
For our word count example, we will simulate the producer by our console.
Socket => IP Address + Port number => { localhost + 9000}
Let's try to look into the code and comments in the code snippet for understanding the flow.
val BATCH_INTERVAL_TIME = 5
def main(args: Array[String]): Unit = {
// Inititating spark session which will further help us to start
// spark streaming context
val spark: SparkSession = SparkSession.builder()
.master("local[*]")
.appName("theBigDataShow.com")
.getOrCreate()
// Setting spark error log level. This helps to debug in console.
spark.sparkContext.setLogLevel("ERROR")
// Start the streaming context. This it entry point for spark
// streaming. Here In our code base batch interval is 5 seconds.
// It means after every 5 seconds. New RDD will be created.
val sc: SparkContext = spark.sparkContext
val ssc = new StreamingContext(
sc,
batchDuration = Seconds(BATCH_INTERVAL_TIME)
)
// One can use checkpointing but this is very basic example so I
// have just commented it.
// ssc.checkpoint("./checkpointing")
// Using streaming context, we can create a DStream that
// represents streaming data from a TCP source,
// specified as hostname (e.g.?localhost) and port (e.g.?900).
val lines: ReceiverInputDStream[String] = ssc.socketTextStream(
"localhost", 9000
)
// This?lines?DStream represents the stream of data that
// will be received from the data server.
// Each record in this DStream is a line of text.
// Next, we want to split the lines by space characters into words.
val words: DStream[String] = lines.flatMap(x => x.split(" "))
// flatMap?is a one-to-many DStream operation that creates a
// new DStream by generating multiple new records from each
// record in the source DStream. In this case, each line will
// be split into multiple words and the stream of words is
// represented as the?words?DStream. Next, we want to count
// these words.
val wordsFrequencyMap: DStream[(String, Int)] = words.map(w => {
(w, 1)
})
// Count each word in each batch
val wordsCounts: DStream[(String, Int)] = wordsFrequencyMap
.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this
// DStream to the console
wordsCounts.print()
ssc.start()
ssc.awaitTermination()
}
In this article, we have tried to build the foundation of Spark Streaming i.e. DStream, Rdd, Messages/Entity/Records.
In the next article of this series, we will try to understand some more concepts like.
I hope this small article sparks some interesting thoughts w.r.t to Spark Streaming. Let's meet in the next articles for discussing the above topics.
Feel free to subscribe to my YouTube channel i.e?The Big Data Show . I might upload a more detailed discussion of the above concepts in the coming days.
More so, thank you for that most precious gift to a me as writer i.e. your time.
Next Article Link: https://www.dhirubhai.net/pulse/spark-streaming-session-2-ankur-ranjan/
Building DatosYard | YouTube (100k) - The Big Data Show | Software Engineer by heart, Data Engineer by mind
1 年Created a YouTube video for the same thing. Do check it out in your free time. Link - https://youtu.be/_mKcqzPG5YY
Principal Engineer@Invesco | Data Engineering & Analytics Unlocking Data Potential for Business Growth
1 年Insightful ? Thanks for sharing Ankur ??
Software Engineer | SQL | Python | Apache Spark | Big Data Engineering |
1 年Can you post something about most important and demanding skill/tool/tech , for anyone who want to enter in Data Engineering. And what should be the right resources to study for that . . . And one Real time Project Implementation thing. If you can post something, that might be helpful . Ankur Ranjan
Data Engineer - III at Walmart | Software Engineer | Spark | Big Data | Python | SQL | AWS | GCP | Scala | Kafka | Datawarehouse | Streaming | Airflow 1x | Java-Spring Boot | ML
1 年Well explained ?? Especially the portion with tap water.