Spark Streaming - Part 1

Spark Streaming - Part 1

A few months back, I was given a codebase that used Spark Streaming and it was written in scala. We were supposed to make major changes and go for the new version of the project. I had worked on Spark Structure Streaming before but not in Spark Streaming. I found that, as Spark Streaming is a legacy project, Apache foundation has stopped giving further updates on this. Hence there are very few good articles on Spark Streaming which talks about its functionalities, problems and use cases in layman's terms in good depth.

I think that understanding Spark Streaming is a good way to enter into the Streaming world as it is very easy to learn. It is also written using a lower level of code, hence it's very easier to visualize the concepts of streaming. So let's start learning Spark streaming in very layman's terms using this series of articles.

This article assumes that you know the basic terms of Spark and you have written some code for batch processing using spark.

What is Spark Streaming?

  • Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.?
  • Data can be ingested from different sources like Socket, File Source, Kafka, Kinesis etc and can be processed with high-level functions like?map,?reduce,?join?etc of Spark Streaming.

No alt text provided for this image

  • ?Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.
  • Spark Streaming provides a high-level abstraction called?discretized stream?or?DStream, which represents a continuous stream of data.
  • DStream is nothing but a continuous stream of RDDs (Spark abstraction). One can consider it as Seq[RDD]. Every RDD in DStream contains data from a certain interval.
  • Any operation on a DStream applies to all the underlying RDDs. It covers all the details. It provides the developer with a high-level API for convenience.

But Ankur, how do we understand this flow of live data and spark streaming concept in simpler terms?

There are mostly the following terms that one has to understand for understanding the flow of live data with Spark Streaming.

  1. Batch duration
  2. DStream
  3. RDD
  4. Records/ Messages

Let's try to understand using one of my experiences. I started my career with one of the start-ups in Mumbai(India). I used to daily see people collecting water from the public water system.

Let's suppose Bombay Municipal Corporation(BMC) allowed tap to run for the whole day then we can consider it as a producer producing data continuously. This is like a streaming use case where data is flowing continuously.

Now let's suppose, we consider the water tab running time equal to 1 hour for our calculation & consider a scenario where every person is given 2 minutes to fill their bucket.

Let's visualize this above situation with a streaming use case. Here 2 minutes will be treated as batch duration and every bucket as one RDD. Droplets of water will be considered messages or records.

As we know the flow of water might not be the same every time. So some buckets may contain more water and some might contain less water. Same way in the case of streaming some RDD might have more messages and some might have fewer messages. So one batch duration, we might have more messages whereas, in the next batch duration, we might receive fewer messages.

Now as the size of the stream is 1 hour and the batch duration is 2 minutes. It means that 30 people can fill their 30 buckets.

Here groups of buckets can be considered as DStream and each bucket as one RDDs.

So basically,

DStream -> RDD -> Messages (entities)

So here it has 30 RDD and each RDD can have lots of messages but when we use Spark Streaming then operations are performed as on DStream. Any operation on a DStream applies to all the underlying individual RDDs.

So underneath spark still works in a batch style, usually batch size is so small that we get the feeling that it is real-time streaming.

I hope my enthusiastic readers now understand the meaning of batch duration, messages, RDD and DStream.

A Quick Example

Before we start discussing the concepts of Spark Streaming. Let's see one quick code. I think a word count example in any computing engine is a really good way to start.

For our word count example, we will simulate the producer by our console.

Socket => IP Address + Port number => { localhost + 9000}

  • Step 1: Open your console or terminal and just use nc -lk 9000 command to simulate producing data.
  • Step 2: Now our producer is ready, we have to start our consumer application for listening to our data. Here Spark Streaming will act as a consumer.
  • Step 3: In Spark Streaming one has to start the Streaming Context. StreamingContext?is the main entry point for all streaming functionality. Here, we will create a local StreamingContext with two execution threads, and a batch interval of 5 seconds.
  • Step 4: Then we will create a DStream that will connect to hostname: port, like localhost:9000

Let's try to look into the code and comments in the code snippet for understanding the flow.


val BATCH_INTERVAL_TIME = 5


def main(args: Array[String]): Unit = {

  // Inititating spark session which will further help us to start 
  // spark streaming context
  
  val spark: SparkSession = SparkSession.builder()
    .master("local[*]")
    .appName("theBigDataShow.com")
    .getOrCreate()


  // Setting spark error log level. This helps to debug in console. 
 
  spark.sparkContext.setLogLevel("ERROR")


  
  // Start the streaming context. This it entry point for spark
  // streaming. Here In our code base batch interval is 5 seconds. 
  // It means after every 5 seconds. New RDD will be created.

  val sc: SparkContext = spark.sparkContext
  
  val ssc = new StreamingContext(
    sc,
    batchDuration = Seconds(BATCH_INTERVAL_TIME)
  )

  
 // One can use checkpointing but this is very basic example so I
 // have just commented it.
 // ssc.checkpoint("./checkpointing")

 // Using streaming context, we can create a DStream that 
 // represents streaming data from a TCP source, 
 // specified as hostname (e.g.?localhost) and port (e.g.?900).
 
 val lines: ReceiverInputDStream[String] = ssc.socketTextStream(
     "localhost", 9000
 )
 
 // This?lines?DStream represents the stream of data that 
 // will be received from the data server. 
 // Each record in this DStream is a line of text.
 // Next, we want to split the lines by space characters into words.
                  
 val words: DStream[String] = lines.flatMap(x => x.split(" "))

 // flatMap?is a one-to-many DStream operation that creates a 
 // new DStream by generating multiple new records from each 
 // record in the source DStream. In this case, each line will 
 // be split into multiple words and the stream of words is 
 // represented as the?words?DStream. Next, we want to count 
 // these words.                   
 val wordsFrequencyMap: DStream[(String, Int)] = words.map(w => {
  (w, 1)
  })
 
 // Count each word in each batch                         
 val wordsCounts: DStream[(String, Int)] = wordsFrequencyMap
   .reduceByKey(_ + _)
                            
 // Print the first ten elements of each RDD generated in this 
 // DStream to the console 
 wordsCounts.print()

 ssc.start()
 ssc.awaitTermination()

}        


In this article, we have tried to build the foundation of Spark Streaming i.e. DStream, Rdd, Messages/Entity/Records.

In the next article of this series, we will try to understand some more concepts like.

  • Stateless and Stateful Transformation
  • We will understand the difference b/w stateless and stateful transformation by using UpdateStateByKey and reduceByKey transformations.
  • Concept of Window Size and Sliding Interval.
  • We will use the reduceByKeyAndWindow and reduceByWindow for understanding window size and sliding interval concepts.
  • Joins in Streaming like stream-stream join, Stream-dataset join
  • Output Operations in Spark Streaming

I hope this small article sparks some interesting thoughts w.r.t to Spark Streaming. Let's meet in the next articles for discussing the above topics.


Feel free to subscribe to my YouTube channel i.e?The Big Data Show . I might upload a more detailed discussion of the above concepts in the coming days.

More so, thank you for that most precious gift to a me as writer i.e. your time.

Next Article Link: https://www.dhirubhai.net/pulse/spark-streaming-session-2-ankur-ranjan/
Ankur Ranjan

Building DatosYard | YouTube (100k) - The Big Data Show | Software Engineer by heart, Data Engineer by mind

1 年

Created a YouTube video for the same thing. Do check it out in your free time. Link - https://youtu.be/_mKcqzPG5YY

Yash Bhawsar

Principal Engineer@Invesco | Data Engineering & Analytics Unlocking Data Potential for Business Growth

1 年

Insightful ? Thanks for sharing Ankur ??

回复
Sandhya Chauhan

Software Engineer | SQL | Python | Apache Spark | Big Data Engineering |

1 年

Can you post something about most important and demanding skill/tool/tech , for anyone who want to enter in Data Engineering. And what should be the right resources to study for that . . . And one Real time Project Implementation thing. If you can post something, that might be helpful . Ankur Ranjan

Kuldeep Pal

Data Engineer - III at Walmart | Software Engineer | Spark | Big Data | Python | SQL | AWS | GCP | Scala | Kafka | Datawarehouse | Streaming | Airflow 1x | Java-Spring Boot | ML

1 年

Well explained ?? Especially the portion with tap water.

要查看或添加评论,请登录

Ankur Ranjan的更多文章

  • Unlocking Apache Kafka: The Secret Sauce of Event Streaming?

    Unlocking Apache Kafka: The Secret Sauce of Event Streaming?

    What is Apache Kafka? Why is Kafka considered an event streaming platform & more over what does an event actually mean?…

    6 条评论
  • Spark Dynamic Resource Allocation

    Spark Dynamic Resource Allocation

    One of Spark's great features is the support of dynamic resource allocations. Still, with my experience in the last 5…

    6 条评论
  • Intro to Kafka Security for Data Engineers - Part 1

    Intro to Kafka Security for Data Engineers - Part 1

    I have a story about Kafka and Data Engineers that I'd like to share. In the world of Data Engineering, there are two…

    9 条评论
  • Apache Hudi: Copy on Write(CoW) Table

    Apache Hudi: Copy on Write(CoW) Table

    As Data Engineer, we frequently encounter the tedious task of performing multiple UPSERT(update + insert) and DELETE…

    9 条评论
  • Solve Small File Problem using Apache Hudi

    Solve Small File Problem using Apache Hudi

    One of the biggest pains of Data Engineers is small file problems. Let me tell you a short story and explain how one of…

  • Data Swamp - A problem arises due to the love life of Data Engineers.

    Data Swamp - A problem arises due to the love life of Data Engineers.

    Philosophy and the cycle of love even hold in the world of Data Engineering. Let me help you understand how the love of…

    2 条评论
  • Supercharging Apps with Polyglot Persistence: A Simple Guide

    Supercharging Apps with Polyglot Persistence: A Simple Guide

    After working for more than 4 years on Data Intensive applications in a startup, consultancy and product-based…

    4 条评论
  • Optimize Google BigQuery

    Optimize Google BigQuery

    I love BigQuery and think It is one of the best products ever made by the Google Cloud Platform. As someone who works…

    6 条评论
  • Stateful transformations in Spark Streaming - Part 1

    Stateful transformations in Spark Streaming - Part 1

    In the previous article of this series i.e.

    7 条评论
  • Kafka for Data Engineers

    Kafka for Data Engineers

    Kafka is the prominent queuing system which is the most used technology in all streaming solutions. In most of my…

    6 条评论

社区洞察

其他会员也浏览了