登录查看更多内容

OOZIE-PYSPARK Integration in Hortonworks Data Platform (hdp/2.6.5.1153-2/hadoop/hadoop-common-2.7.3.2.6.5.1153-2)

Akshay Anand

Data Enthusiast: Unleashing Insights Through Numbers and Code !!

发布日期: 2021年2月2日

I am writing here to explain how to work with OOZIE-PYSPARK on Hortonworks Data Platform.

It's not that I am an expert in this but having gone through many troubles in the past few days and an advice from my friend has brought me here.

The most common error that I was getting

ImportError: No module named pandas

ImportError: No module named pyspark

To Solve this: Please make sure that you are running oozie-spark in 'cluster' mode with relocatable virtualenv which should be same for Executors and the AppMaster:

--conf spark.executorEnv.PYSPARK_PYTHON=/data/miniconda/envs/venv1/bin/python

--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/data/miniconda/envs/venv1/bin/python

Secondly: In your Python-PySpark script, add the below lines of the code in the starting:

import os
os.environ['SPARK_MAJOR_VERSION'] = '2'
os.environ['PYSPARK_PYTHON'] = '/data/miniconda/envs/venv1/bin/python'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/data/miniconda/envs/venv1/bin/python

Note: The variable controlling the python environment for python applications in Spark is is 'PYSPARK_PYTHON' , so don't forget to set it.

Doing all things as mentioned above, my issue got fixed.

The oozie version with which I am working is 4.2.0.2.6.5.1153-2.(you can check with oozie version command)

The two most important files being 'test.properties' and 'test_workflow.xml' (we say properties file and workflow.xml file)

I know if you search google and stack-overflow, these links would pop up and I would definitely recommend to go through for your own understanding.

Running pyspark with virtualenv

Adding pyspark_python path in oozie

how to run pyspark script with oozie

set python path for spark worker

This link above is for understanding oozie spark action

At the end, just like any other developer you need to understand your environment and then make things work for you.

If you go https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html , you will find that regarding pyspark oozie action, we have just one statement "The jar element indicates a comma separated list of jars or python files".

You should also understand "To submit PySpark scripts with Spark Action, pyspark dependencies must be available in sharelib or in workflow's lib/ directory"

sample workflow.xml file :

<job-tracker>${jobTracker}</job-tracker>

<name-node>${nameNode}</name-node>

<master>yarn-cluster</master>

<mode>cluster</mode>

<name>${spark_name}</name>

<jar>${wfAppPath}/test_oozie_pyspark.py</jar>

<spark-opts>--conf spark.driver.maxResultSize=3g

--executor-memory 2G --num-executors 2 --executor-cores 2 --driver-memory 3g

--conf spark.executorEnv.PYSPARK_PYTHON=/data/miniconda/envs/venv1/bin/python

--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/data/miniconda/envs/venv1/bin/python

</spark-opts>

</spark>

</action>

# sample test.properties file

nameNode=hdfs://DP01

jobTracker=server4.port4

oozie.use.system.libpath=true

oozie.action.sharelib.for.spark=spark2

oozie.libpath=/user/oozie/share/lib/abcdef4896543

codeBaseDir=${nameNode}/tmp/test_oozie/

jdbcUrl=jdbc:hive2://server1.port1,server2.port2,server3.port3;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2

jdbcPrincipal=hive/[email protected]

queueName=default

edlRecordSource='weekly_load'

edlRunId=concat('people_data',current_timestamp)

containerSize=8192

heapSize=-Xmx6553m

wfAppPath=${codeBaseDir}

workflow_dir=${nameNode}/tmp/test_oozie/

oozie.wf.application.path=${workflow_dir}/test_workflow.xml

[email protected]

metaStoreUri=thrift://server5.port5

master=yarn

mode=cluster

oozieExampleRoot=OozieE

Note: we are using miniconda distribution that is essentially an installer for an empty conda environment, containing only Conda, its dependencies, and Python.

Akshay Anand的更多文章

Data Locality In Spark

2021年8月11日

Data Locality In Spark

Spark also relies on the principle of Data Locality, which means the vicinity of data around the code that process it…

Akshay Anand的更多文章

Data Locality In Spark