OOZIE-PYSPARK Integration in Hortonworks Data Platform (hdp/2.6.5.1153-2/hadoop/hadoop-common-2.7.3.2.6.5.1153-2)

I am writing here to explain how to work with OOZIE-PYSPARK on Hortonworks Data Platform.

It's not that I am an expert in this but having gone through many troubles in the past few days and an advice from my friend has brought me here.

The most common error that I was getting  

ImportError: No module named pandas

ImportError: No module named pyspark

To Solve this: Please make sure that you are running oozie-spark in 'cluster' mode with relocatable virtualenv which should be same for Executors and the AppMaster:

          --conf spark.executorEnv.PYSPARK_PYTHON=/data/miniconda/envs/venv1/bin/python

          --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/data/miniconda/envs/venv1/bin/python  

Secondly: In your Python-PySpark script, add the below lines of the code in the starting:

  1. import os
  2. os.environ['SPARK_MAJOR_VERSION'] = '2'
  3. os.environ['PYSPARK_PYTHON'] = '/data/miniconda/envs/venv1/bin/python'
  4. os.environ['PYSPARK_DRIVER_PYTHON'] = '/data/miniconda/envs/venv1/bin/python

Note: The variable controlling the python environment for python applications in Spark is is 'PYSPARK_PYTHON' , so don't forget to set it.

Doing all things as mentioned above, my issue got fixed.

The oozie version with which I am working is 4.2.0.2.6.5.1153-2.(you can check with oozie version command)

The two most important files being 'test.properties' and 'test_workflow.xml' (we say properties file and workflow.xml file)

I know if you search google and stack-overflow, these links would pop up and I would definitely recommend to go through for your own understanding.

Running pyspark with virtualenv

Adding pyspark_python path in oozie

how to run pyspark script with oozie

set python path for spark worker

This link above is for understanding oozie spark action

At the end, just like any other developer you need to understand your environment and then make things work for you.

If you go https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html , you will find that regarding pyspark oozie action, we have just one statement "The jar element indicates a comma separated list of jars or python files".

You should also understand "To submit PySpark scripts with Spark Action, pyspark dependencies must be available in sharelib or in workflow's lib/ directory"

sample workflow.xml file :

<start to="start_spark" />

<action name="start_spark">

  <spark xmlns="uri:oozie:spark-action:0.1">

    <job-tracker>${jobTracker}</job-tracker>

    <name-node>${nameNode}</name-node>

    <master>yarn-cluster</master>

    <mode>cluster</mode>

    <name>${spark_name}</name>

    <jar>${wfAppPath}/test_oozie_pyspark.py</jar>

    <spark-opts>--conf spark.driver.maxResultSize=3g

          --executor-memory 2G --num-executors 2 --executor-cores 2 --driver-memory 3g

          --conf spark.executorEnv.PYSPARK_PYTHON=/data/miniconda/envs/venv1/bin/python

          --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/data/miniconda/envs/venv1/bin/python  

    </spark-opts>

  </spark>

  <ok to="end"/>

  <error to="fail"/>

</action>

# sample test.properties file

nameNode=hdfs://DP01

jobTracker=server4.port4

oozie.use.system.libpath=true

oozie.action.sharelib.for.spark=spark2

oozie.libpath=/user/oozie/share/lib/abcdef4896543

codeBaseDir=${nameNode}/tmp/test_oozie/

jdbcUrl=jdbc:hive2://server1.port1,server2.port2,server3.port3;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2

jdbcPrincipal=hive/[email protected]

queueName=default

edlRecordSource='weekly_load'

edlRunId=concat('people_data',current_timestamp)

containerSize=8192

heapSize=-Xmx6553m

wfAppPath=${codeBaseDir}

workflow_dir=${nameNode}/tmp/test_oozie/

oozie.wf.application.path=${workflow_dir}/test_workflow.xml

[email protected]

metaStoreUri=thrift://server5.port5

master=yarn

mode=cluster

oozieExampleRoot=OozieE

Note: we are using miniconda distribution that is essentially an installer for an empty conda environment, containing only Conda, its dependencies, and Python.

Amit Singh

Speaking is a Skill: Listening is an Art

4 年

Thanks for your valuable information ???? You always motivate me through your work..

Laurent Weichberger

Principal Customer Engineer

4 年

Fantastic work Akshay!

要查看或添加评论,请登录

Akshay Anand的更多文章

  • Data Locality In Spark

    Data Locality In Spark

    Spark also relies on the principle of Data Locality, which means the vicinity of data around the code that process it…