登录查看更多内容

Data Locality In Spark

Akshay Anand

Data Enthusiast: Unleashing Insights Through Numbers and Code !!

发布日期: 2021年8月11日

Spark also relies on the principle of Data Locality, which means the vicinity of data around the code that process it.

So what happens if you are running Spark and Yarn on a Hadoop Cluster, Hadoop's principle of rack based locality comes into play.

There are several levels of data locality in Hadoop/Yarn cluster.

PROCESS_LOCAL - data co-located with the code in the same JVM (ideal)
NODE_LOCAL - data located on the same node where processing occurs
NO_PREF - data with no preference for locality
RACK_LOCAL - data on the same rack but on a different server
ANY - Data located on other racks

Although Spark ideally likes to process the data from local(PROCESS_LOCAL or NODE_LOCAL) which you can see from the above image. (Here I have ran spark job on my system- standalone application) but this is not the case everytime.

Spark has two decisions to make, either it can wait for the executore to free up to process the data locally on the same or start a task immediately to a new location where idle executors have no data so that it can move the data there for processing.

Now when we say "wait", actually there is a timeout time which is by default 3 seconds.

领英推荐

Concept Of Parallelism To Upload The Split Data While…

Govind Bhardwaj 4 年前

Demystifying the concept Of Parallelism when Upload…

Mohamed Afrid 4 年前

Optimize your EMR cluster

Dinesh Gupta 3 年前

Condition could be that after timeout spark might choose a poor location which would be a performance hit as processing data to a poor remote node cannot be faster than processing it locally.

Meaning PROCESS_LOCAL faster than NODE_LOCAL .

And NODE_LOCAL faster than RACK_LOCAL

How to overcome this situation?

Configure spark.locality.wait (default value 3 seconds) parameter - Spark should be looking for processing in the vicinity before deciding to process on a more remote node.(PROCESS_LOCAL >>> NODE_LOCAL>>>RACK_LOCAL)

Configure spark.locality.wait.node to customize the locality wait for node locality. Setting it to 0 will make the spark go to rack locality skipping the node locality.

要查看或添加评论，请登录

Akshay Anand的更多文章

OOZIE-PYSPARK Integration in Hortonworks Data Platform (hdp/2.6.5.1153-2/hadoop/hadoop-common-2.7.3.2.6.5.1153-2)

2021年2月2日

OOZIE-PYSPARK Integration in Hortonworks Data Platform (hdp/2.6.5.1153-2/hadoop/hadoop-common-2.7.3.2.6.5.1153-2)

I am writing here to explain how to work with OOZIE-PYSPARK on Hortonworks Data Platform. It's not that I am an expert…

5 条评论

Data Locality In Spark

Akshay Anand

Data Enthusiast: Unleashing Insights Through Numbers and Code !!

领英推荐

Akshay Anand的更多文章

社区洞察

其他会员也浏览了

Harnessing the Power of Hadoop A Guide to Effective Data Management

SQL Joins and Indexes

How Client put the file, read the file in Hadoop Cluster and How it retrieve data when DataNode is crashed

Big Data & Telco

Integrating LVM with Hadoop and providing Elasticity to DataNode Storage

?? Day 1 of 100 Spark Interview Questions: Let's Spark Some Insights! ??

Dimensional Models for Hadoop and Big Data

Is Apache Spark going to replace Hadoop?

HAWQ/HDB and Hadoop with Hive and HBase

Contributing Limited Amount of Data-Node Storage to Hadoop Cluster.

领英推荐

Akshay Anand的更多文章

OOZIE-PYSPARK Integration in Hortonworks Data Platform (hdp/2.6.5.1153-2/hadoop/hadoop-common-2.7.3.2.6.5.1153-2)

社区洞察

其他会员也浏览了

Harnessing the Power of Hadoop A Guide to Effective Data Management

SQL Joins and Indexes

How Client put the file, read the file in Hadoop Cluster and How it retrieve data when DataNode is crashed

Big Data & Telco

Integrating LVM with Hadoop and providing Elasticity to DataNode Storage

?? Day 1 of 100 Spark Interview Questions: Let's Spark Some Insights! ??

Dimensional Models for Hadoop and Big Data

Is Apache Spark going to replace Hadoop?

HAWQ/HDB and Hadoop with Hive and HBase

Contributing Limited Amount of Data-Node Storage to Hadoop Cluster.