OOZIE-PYSPARK Integration in Hortonworks Data Platform (hdp/2.6.5.1153-2/hadoop/hadoop-common-2.7.3.2.6.5.1153-2)
I am writing here to explain how to work with OOZIE-PYSPARK on Hortonworks Data Platform.
It's not that I am an expert in this but having gone through many troubles in the past few days and an advice from my friend has brought me here.
The most common error that I was getting
ImportError: No module named pandas
ImportError: No module named pyspark
To Solve this: Please make sure that you are running oozie-spark in 'cluster' mode with relocatable virtualenv which should be same for Executors and the AppMaster:
--conf spark.executorEnv.PYSPARK_PYTHON=/data/miniconda/envs/venv1/bin/python
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/data/miniconda/envs/venv1/bin/python
Secondly: In your Python-PySpark script, add the below lines of the code in the starting:
- import os
- os.environ['SPARK_MAJOR_VERSION'] = '2'
- os.environ['PYSPARK_PYTHON'] = '/data/miniconda/envs/venv1/bin/python'
- os.environ['PYSPARK_DRIVER_PYTHON'] = '/data/miniconda/envs/venv1/bin/python
Note: The variable controlling the python environment for python applications in Spark is is 'PYSPARK_PYTHON' , so don't forget to set it.
Doing all things as mentioned above, my issue got fixed.
The oozie version with which I am working is 4.2.0.2.6.5.1153-2.(you can check with oozie version command)
The two most important files being 'test.properties' and 'test_workflow.xml' (we say properties file and workflow.xml file)
I know if you search google and stack-overflow, these links would pop up and I would definitely recommend to go through for your own understanding.
Running pyspark with virtualenv
Adding pyspark_python path in oozie
how to run pyspark script with oozie
set python path for spark worker
This link above is for understanding oozie spark action
At the end, just like any other developer you need to understand your environment and then make things work for you.
If you go https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html , you will find that regarding pyspark oozie action, we have just one statement "The jar element indicates a comma separated list of jars or python files".
You should also understand "To submit PySpark scripts with Spark Action, pyspark dependencies must be available in sharelib or in workflow's lib/ directory"
sample workflow.xml file :
<start to="start_spark" />
<action name="start_spark">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn-cluster</master>
<mode>cluster</mode>
<name>${spark_name}</name>
<jar>${wfAppPath}/test_oozie_pyspark.py</jar>
<spark-opts>--conf spark.driver.maxResultSize=3g
--executor-memory 2G --num-executors 2 --executor-cores 2 --driver-memory 3g
--conf spark.executorEnv.PYSPARK_PYTHON=/data/miniconda/envs/venv1/bin/python
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/data/miniconda/envs/venv1/bin/python
</spark-opts>
</spark>
<ok to="end"/>
<error to="fail"/>
</action>
# sample test.properties file
nameNode=hdfs://DP01
jobTracker=server4.port4
oozie.use.system.libpath=true
oozie.action.sharelib.for.spark=spark2
oozie.libpath=/user/oozie/share/lib/abcdef4896543
codeBaseDir=${nameNode}/tmp/test_oozie/
jdbcUrl=jdbc:hive2://server1.port1,server2.port2,server3.port3;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2
jdbcPrincipal=hive/[email protected]
queueName=default
edlRecordSource='weekly_load'
edlRunId=concat('people_data',current_timestamp)
containerSize=8192
heapSize=-Xmx6553m
wfAppPath=${codeBaseDir}
workflow_dir=${nameNode}/tmp/test_oozie/
oozie.wf.application.path=${workflow_dir}/test_workflow.xml
metaStoreUri=thrift://server5.port5
master=yarn
mode=cluster
oozieExampleRoot=OozieE
Note: we are using miniconda distribution that is essentially an installer for an empty conda environment, containing only Conda, its dependencies, and Python.
Speaking is a Skill: Listening is an Art
4 年Thanks for your valuable information ???? You always motivate me through your work..
Principal Customer Engineer
4 年Fantastic work Akshay!