Data Locality In Spark
Spark also relies on the principle of Data Locality, which means the vicinity of data around the code that process it.
So what happens if you are running Spark and Yarn on a Hadoop Cluster, Hadoop's principle of rack based locality comes into play.
There are several levels of data locality in Hadoop/Yarn cluster.
Although Spark ideally likes to process the data from local(PROCESS_LOCAL or NODE_LOCAL) which you can see from the above image. (Here I have ran spark job on my system- standalone application) but this is not the case everytime.
Spark has two decisions to make, either it can wait for the executore to free up to process the data locally on the same or start a task immediately to a new location where idle executors have no data so that it can move the data there for processing.
Now when we say "wait", actually there is a timeout time which is by default 3 seconds.
领英推荐
Condition could be that after timeout spark might choose a poor location which would be a performance hit as processing data to a poor remote node cannot be faster than processing it locally.
Meaning PROCESS_LOCAL faster than NODE_LOCAL .
And NODE_LOCAL faster than RACK_LOCAL
How to overcome this situation?
Configure spark.locality.wait (default value 3 seconds) parameter - Spark should be looking for processing in the vicinity before deciding to process on a more remote node.(PROCESS_LOCAL >>> NODE_LOCAL>>>RACK_LOCAL)
Configure spark.locality.wait.node to customize the locality wait for node locality. Setting it to 0 will make the spark go to rack locality skipping the node locality.