Data Locality In Spark

Spark also relies on the principle of Data Locality, which means the vicinity of data around the code that process it.

So what happens if you are running Spark and Yarn on a Hadoop Cluster, Hadoop's principle of rack based locality comes into play.

There are several levels of data locality in Hadoop/Yarn cluster.

  • PROCESS_LOCAL - data co-located with the code in the same JVM (ideal)
  • NODE_LOCAL - data located on the same node where processing occurs
  • NO_PREF - data with no preference for locality
  • RACK_LOCAL - data on the same rack but on a different server
  • ANY - Data located on other racks

No alt text provided for this image

Although Spark ideally likes to process the data from local(PROCESS_LOCAL or NODE_LOCAL) which you can see from the above image. (Here I have ran spark job on my system- standalone application) but this is not the case everytime.

Spark has two decisions to make, either it can wait for the executore to free up to process the data locally on the same or start a task immediately to a new location where idle executors have no data so that it can move the data there for processing.

Now when we say "wait", actually there is a timeout time which is by default 3 seconds.

Condition could be that after timeout spark might choose a poor location which would be a performance hit as processing data to a poor remote node cannot be faster than processing it locally.

Meaning PROCESS_LOCAL faster than NODE_LOCAL .

And NODE_LOCAL faster than RACK_LOCAL

How to overcome this situation?

Configure spark.locality.wait (default value 3 seconds) parameter - Spark should be looking for processing in the vicinity before deciding to process on a more remote node.(PROCESS_LOCAL >>> NODE_LOCAL>>>RACK_LOCAL)

Configure spark.locality.wait.node to customize the locality wait for node locality. Setting it to 0 will make the spark go to rack locality skipping the node locality.



要查看或添加评论,请登录

Akshay Anand的更多文章

社区洞察

其他会员也浏览了