An interview with Apache Spark

Interviewer: So, Mr. Spark, let's cut right to the chase. What would you say are the two most important things that anyone wanting to know you MUST familiarize oneself with?

Apache Spark: Well, I guess that would be RDD, or Resilient Distributed Datasets and DAG, or Directed Acyclic Graph.

Interviewer: Hmm, and would you like to explain to our audience what these two actually are and why they are so important?

Spark (clearing his throat): Well, RDD is actually what you would call a Data Structure, which means just like Arrays, Linked Lists, Stacks or Queues, this RDD also holds a collection of members which represents user data. It is split into chunks called partitions and these partitions are spread out over a cluster of machines. This is why RDD is called Distributed. On top of this, RDD has been conceptually designed to be fault-tolerant. So, if something goes wrong somewhere while working with a RDD, it should be possible to recover from such failures. Essentially, we are talking about auto-healing mechanism which RDD has and this builds Resilience into it.

Lastly, a RDD lives inside the memory or RAM of the machines in a cluster. It is an in-memory data structure. I was telling you that a RDD is chunked up into partitions. All these partitions reside inside the memory of the cluster machines in a distributed manner.

Interviewer: Why are you stressing the in-memory bit so much?

Spark: For a couple of important reasons. I am faster than MapReduce precisely because I am an in-memory computational framework. MapReduce stores a lot of things on HDFS during data processing. This increases chatter with the file system and too much of chatter with a disk based file system is never a good thing. It hampers performance and speed of execution. I bypass this chatter to a great extent, because I essentially do most of my stuff inside memory and I talk with the file system occasionally and not as frequently as MapReduce does. Secondly, because I am mostly in-memory, my memory consumption is also going to be huge. One needs to be aware of this while planning the infrastructure of a cluster.

Interviewer: Ok, noted. And what's this that I hear about RDDs being immutable or something like that?

Spark: Well yeah. This immutability ties into RDD's fault-tolerant nature. Imagine you have to get from point A to point B and you have a map to guide you along the way. Now suppose, you cross 5 or 6 landmarks in that map and suddenly lose your way. What would you do then?

Interviewer: I suppose I'd retrace my steps back to the last visited landmark and try to figure out where to go from there.

Spark: And then, you would want the landmarks to remain unchanging and constant, would you not? It wouldn't be helpful if a park you passed through suddenly no longer exists and in its place there is suddenly a parking lot which doesn't allow entry to anybody without an entry sticker.

Interviewer: Of course not. That'd make the map practically useless.

Spark: Exactly. Remember I was telling you about the DAG or Directed Acyclic Graph. A DAG is like this map. Because it is a graph, it has nodes and edges. The nodes are my RDDs. Your landmarks, so to speak. And the edges represent the various operations or changes I am trying to apply to them during my data processing journey. If something bad happens on this journey, like some machine fails or some part of the workflow errors out, how would one recover from it or retrace ones steps? The map or the DAG is only useful in this case, if the nodes don't change after they have been created. So, if the node, or in other words, the RDD can't change after it has been created, it becomes immutable.

Interviewer (stroking his chin): Interesting. So what you're essentially saying, the DAG is like a chain of operations on my data and each link in the chain is a new RDD which comes into existence because of some operation which takes place on some previous RDD?

Spark: I'd prefer to think of the DAG more as a picture of the chain, rather than the actual chain itself. An actual chain is going to have a physical form. It has to be created from physical materials. Things can go wrong while creating that chain. And whenever things go wrong, one can always look at the picture, go back to the last link which is working perfectly, and start over from there. That's why I had earlier referred to DAG as a map. It tells you how to go from point A to B, but it is not the actual journey. It's just a diagram of what the journey would look like.

Interviewer: Hmm, well I guess I now understand at least conceptually what a RDD is and what a DAG represents. But who creates this DAG?

Spark: Before I answer that, let me just attempt to make you understand the complexity and scale of the work at hand for us. Say, you need a 1000 by 1000 foot painting drawn up, and you have a bunch of 50 or so artists at your disposal to do this for you. How would you go about doing this? How will you divide the responsibility among the artists?

Interviewer: Well, I...well, this is a tough one. Let me ponder on this for a minute.

(long pause)

I guess, I will first list down the exact set of steps needed to draw the painting. This itself is not going to be that easy, but it must be done as a first step, I guess. I have to figure out which steps can be broken up into even smaller pieces and how these pieces can be done in parallel by multiple artists. I'll also need to look into which steps have a dependency among them, as in, Step P has got to complete before Step Q and that sort of thing. I shall have to factor in which part of the painting has to be situated where with respect to other parts of the painting. I'll always have to keep in mind the big picture, so to speak. Otherwise, it is going to be a complete mess and waste of effort. By the way, that pun on the word 'picture' was purely accidental.

(takes a pause again, his forehead furrows and eyebrows knit together in concentration)

Then I shall assign the steps to the individual artists. I shall keep track of how the steps are progressing. As and when the artists complete some steps and are freed up, I shall assign more steps to the artists. I can already imagine the amount of co-ordination effort this is gonna take is going to be ginormous. Also, there is a fair amount of scheduling involved. I have to make a lot of decisions about who gets to work on what and when. I possibly can't do all of it on my own. It'd be wonderful to have someone who tells me when my artists are occupied and when they are free. That way, allocation and deallocation becomes a lot easier to manage.

And eventually, fingers crossed, I hope the painting gets done.

Spark: Nice. Nice. And what if one or more of your artists fall sick or just desert you midway? How do you handle that?

Interviewer: What?! Are you serious? I am supposed to take even those kind of things into account?

Spark (smiling): Absolutely.

Interviewer (holding his head): I guess I am going to come down with a migraine after going through with this. That is, if I actually do make it through.

Spark: I think you're beginning to see that handling things in a distributed manner while ensuring fault-tolerance isn't exactly a walk in the park.

Interviewer: Oh hell, no!

Spark: But let's come back to our scenario. So, we have a ringmaster, that is, you, directing others to complete the task in a distributed way, correct? And we have a master-slave kind of an arrangement, isn't it?

Interviewer: Yes, I guess you could say so.

Spark: I have such a ringmaster too. I call it the Driver. So, when a Spark developer develops a program and submits it to me for processing, it is my Driver who actually takes this program and runs it. Now, I'll get a bit technical. So this Driver is actually a process which runs inside a JVM. And this Driver is the process which is going to take charge of executing the Spark program developed by the developer.

Interviewer: Ok. So the Driver is the master. What about the slaves? Where do they come into the picture? Pun not intended, of course. (chuckles)

Spark: There is a short answer to that and a long answer. Which one do you prefer?

Interviewer: The short one, for starters?

Spark: The Driver splits a Spark application into tasks and schedules them to run on the workers. The workers are the other machines in the cluster, the slaves in our master-slave configuration. The driver coordinates workers and overall execution of tasks. Remember how the Driver runs inside a JVM? Similarly, even on the workers, processes called Executors are spawned which run inside JVMs. It is the Executors who actually execute the tasks on the worker nodes. So, in essence, the driver coordinates workers by coordinating the executors running on the workers.

Interviewer: That was easy. And the long one?

Spark (smiling): You might want to grab a cup of coffee before hearing that one out.

Interviewer (rolling his eyes): THAT bad, huh? This is going to be one of those migraine inducing explanations which never seem to end, isn't it?

Spark (still smiling): Worse. Way worse.

(Fifteen minutes and a cup of coffee later)

Spark: Imagine you are in the kitchen preparing a recipe as your wife keeps issuing instructions to you. Now, say you are a lazy bloke like me and want to get this over with spending the least effort possible. What would you do?

Interviewer: I...uh..

Spark: In stead of doing everything the moment your wife instructs you to do it, you may like to wait till her instructions have accumulated till a point when you have no choice but to take an actual physical action, for instance, when she wishes to taste what you have managed to prepare so far. Till she wishes to see or taste something, you can just keep noting down her instructions and planning how to go about executing her instructions, in stead of actually troubling your lazy self to physically do something about the cooking.

Interviewer: That sounds like a great idea. Not only is it allowing me to be lazy, it also helps me to plan things better. I mean if I know ahead of time what's coming my way, I can just tweak my plan and save effort when I'm actually executing the plan.

Spark: And that's Lazy Evaluation for you. Any operations you perform on a RDD can be of two kinds. Transformations and Actions. Transformations are like the instructions of your wife which you can just note down and plan about without actually needing to physically do anything with the actual data. Actions are the...

Interviewer: Calls to action from my wife when I actually need to show her what I've been cooking so far.

Spark (smiling): Summed it up perfectly! Now I'll get technical. If you come right down to the level of coding, then filter(), join(), map(), union(), intersect() - these are few examples of transformations. As Transformations are being applied to the RDD, the Driver is mentally noting everything down and creating something called the Operator Graph. The moment an Action is invoked, it's Light Camera Action and everything starts rolling. Few examples of actions include count(), collect(), take(), aggregate(). The operator graph is submitted to a part of the Driver called the DAGScheduler. DAGScheduler then...

(Spark suddenly stops speaking)

Interviewer: What happened?

Spark (smiling sympathetically): I just realized I'll have to hit you with a tsunami of technical terms from this point onwards.

Interviewer (groans): Oh boy! Let's hope the effects of the coffee don't desert me(sits upright in his chair and takes a few deep breaths). All right, bring it on!

Spark: See, my data is spread out in pockets across multiple machines or worker nodes. These pockets, as I mentioned previously, are called partitions. Now, when transformations are being applied to the pockets, there can be two scenarios. One kind of transformation has such an output that whatever data it requires to produce the output can be found entirely in the data pocket it is applying the transformation to. Think of a filter transformation. I have a data pocket P1. Whatever data I need to produce a filtered data pocket P2, which is essentially just a subset of P1, I can find all that data in the original data pocket P1 itself. I don't need to bring in data from another pocket residing in some other machine to produce the output pockets. This is called a Narrow Transformation.

On the other hand, there is the second kind of transformation where the output cannot be produced until and unless I bring in data from multiple pockets spread out over multiple machines. Think of a group by or join transformation. As you can imagine, in this type of transformation, data needs to be moved around in the network in between several machines. This type of transformation is called a Wide Transformation. And this movement of data across the network is called shuffling. Shuffling is costly. If I have just a series of narrow transformations, I can do the whole execution entirely in memory of the worker nodes where the data resides. There is also more parallelism because each worker node can just work independently with the data that is residing within itself. So several worker nodes can now work parallelly. But the moment shuffling is involved, I lose that luxury. I now have to output the intermediate results to disk from memory before I proceed to shuffle the data around the network. Which means, I am also slowing down because of all this overhead.

Interviewer: Ok, but you were talking about DAGScheduler. How come we suddenly jumped to talking about Narrow and Wide transformations and shuffling etcetera?

Spark: Getting to that in a minute. Knowing what I know about narrow transformations and how good they are for parallelism and knowing how costly shuffle operations are, I take a certain approach to sequencing or scheduling the execution flow. I try to glue together as many narrow transformations as possible in a series or pipeline and go for a shuffle only when I must. The moment I go for a shuffle, a series gets interrupted. After the shuffle happens, I again try to start a new series of narrow transformations. And then even that might get interrupted by another shuffle, and so on and so forth.

Interviewer: It seems a shuffle act as some kind of a boundary to your series or pipeline. It goes like this - you have a series, then shuffle acts as a boundary closing off that series, then another series starts, and another shuffle boundary closes it off, a third series comes into play, ended by a third shuffle and so on. This way, you are actually optimizing your execution plan, because you are minimizing shuffling data around.

Spark: Aptly summed up. We actually call this boundary a shuffle boundary in our circles. And this series of narrow transformations - we refer to as a Stage. A shuffle signals the end of a stage. The DAGScheduler takes a DAG, breaks it down into stages and make the stages line up after one another. A stage is where things start getting physical from a logical planning level. A stage is subdivided into Tasks. A task operates on one data pocket worth of data. Or, a task operates on one partition worth of data. Which means, there will be as many tasks as there are partitions. Each task within a stage does the same set of activities. They do this in parallel and they do this on different partitions of the data.

Interviewer: Interesting. And all this start happening when an action is invoked?

Spark: Yes. When an action is invoked, internally one or more Spark jobs are created. One Spark job splits into stages and stages split into tasks, courtesy the DAGScheduler.

Interviewer: Action. Job. Job breaks down into Stages. Stages further break out into Tasks. DAGScheduler is behind all this. Got it. What happens next?

Spark: DAGScheduler hands over the stages and tasks to something called the TaskScheduler. TaskScheduler is another component of the driver itself. TaskScheduler looks at each task and tries to determine where the data needed for that task resides in the cluster. For this, it reaches out to something called the Cluster Manager. Cluster Manager informs TaskScheduler in which worker nodes the data is and also informs about the availability of resources, that is, RAM and CPU cores, on those nodes. TaskScheduler then starts dispatching tasks to appropriate worker nodes as informed by the Cluster Manager. If there are task failures, TaskScheduler retries them. TaskScheduler notifies DAGScheduler for task start and task completion events. These events enable DAG scheduler to keep its books updated about the stages and track the completion status of the stage execution which could turn out to be a success, partial success, or failure. In the case of successful execution of stage, DAG scheduler plans for execution of further dependent stages. And, in case of partial success or complete failure, the DAG scheduler in certain scenarios may plan for another attempt for the corresponding stage execution.

Interviewer: Woah!

Spark (smiling): Want a pause to catch your breath?

Interviewer: So THAT is what it takes to build a 1000 by 1000 foot picture using 50 artists in a fault-tolerant manner, huh? So much of planning, optimization, scheduling and tight co-ordination among so many different actors!

Spark: No kidding. But we're in the end game now. Let's wrap this up quick. Cluster Manager resides outside of me, as in it is not a part of me. I can work with different Cluster Managers - YARN, Mesos, Spark standalone and Kubernetes. The Cluster Manager keeps track of the resources in a cluster and allocates deallocates them as per requests from the driver and based on the availability of the resources themselves.

Interviewer: Okay...anything else?

Spark: All of these actors - DAGScheduler, TaskScheduler etcetera- if I were to consider them the stars of a movie then there is something called a SparkSession which is like a movie ticket to unlock the movie experience. It is like an entry-point to my world, the world of Spark, and the developer must instantiate it in the Spark program which he submits to the driver to execute. The driver owns this SparkSession throughout the lifecycle of a Spark program. The SparkSession is actually a wrapper over other important objects like the SparkContext, SQLContext, SparkConf, HiveContext and StreamingContext which existed from Apache Spark 1.0 era. The SparkSession unifies all of the above and reduces developer overhead of managing separate context objects for doing different things.

Interviewer: Stop! Stop! I beg you. Please stop. One more technical detail and I'm sure I'll be sick!

Spark (laughing): Ah well, the ordeal's over. Lucky you. I'm done.

Interviewer (sighing in relief): Oh, thank god! Good heavens! My head's spinning now. But thanks for the explanation. I really appreciate it.

Spark (grins): Hey, my pleasure. Just hope the audience learns something from it. Fingers crossed.

Interviewer: I guess we shall see you next week then? What was the topic you said you'd be discussing then? I forgot.

Spark (smiles wider and winks): Streaming.... Structured streaming.

------------------------------------------------The End---------------------------------------------------

Or, even better....

-------------------------------------------Shuffle Boundary----------------------------------------------

[Next Stage loading.....]

Amit Pal

Engineering Leader@Egnyte | ERN-stack Architect | Empowering Engineers | Sharing Insights Weekly (WebWiz Newsletter)

2 年

Nicely articulated

Amar Kumar Gupta ????

5+ yrs Exp | Senior Azure Data Engineer | 3x Databricks Certified | 2x Azure ?? Certified | Continuous Learner

2 年

Dada, that's a really great post... it's undoubtedly a must visit before any spark interview!

Sanjai Verma

Sr. Data Engineer- AWS, GCP, Databricks

2 年

Great work! Keep doing.

Abhishek Kumar

Data Engineer at Salesforce | Ex-TCS | Spark | Hadoop | SQL | Scala | Python | Shell scripting

2 年

Amazing story telling Rahul Biswas ??. Interesting way of explaining the Spark architecture??

Tanima Dutta

Senior Specialist -Data Engineering

2 年

What an incredible story about Spark Architecture. Absolutely nailed it. ??

要查看或添加评论,请登录

Rahul Biswas的更多文章

  • Handling OOM Errors in Apache Spark – A Knowledge Deep Dive

    Handling OOM Errors in Apache Spark – A Knowledge Deep Dive

    A data engineer (3+ YOE) was recently asked in an interview how she would handle Out-of-Memory (OOM) errors in Spark…

    11 条评论
  • A gentle intro to Structured Streaming - I

    A gentle intro to Structured Streaming - I

    Not so long ago, in a kingdom not that far away, there lived a tap and an artificial stream. The stream got its water…

    17 条评论
  • Analyzing Databricks performance using Ganglia

    Analyzing Databricks performance using Ganglia

    To understand how the machines inside a Databricks cluster are working, we can look at the Ganglia dashboard. It…

    3 条评论
  • A date with Azure Synapse Analytics

    A date with Azure Synapse Analytics

    I've a problem. I've collected 100 billion records of data that I need to do some data crunching on in order to get…

    6 条评论
  • How Azure Data Factory pipelines can be analysed using Python recursion

    How Azure Data Factory pipelines can be analysed using Python recursion

    SPOILER ALERT! This article might technically seem to be more about recursion using Python, than Azure Data Factory per…

    2 条评论
  • Ctrl-C Ctrl-V Power BI models

    Ctrl-C Ctrl-V Power BI models

    I think any developer shall agree that the two best things to have happened to anybody in software engineering are…

    6 条评论
  • Building Azure Data Factory pipelines using Python

    Building Azure Data Factory pipelines using Python

    Right off the bat, I would like to lay out the motivations which led me to explore automated creation of Azure Data…

    27 条评论

社区洞察

其他会员也浏览了