Spark with Kubernetes
K8=kubernetes
One of the finest technologies(Spark & K8) is trending now in the enterprise . The main motive behind integration?is to make a centralised cluster where spark jobs are executed and terminated?irrespective of project. There are no physical nodes. Running spark on physical nodes rather than running in k8 pods. spark taking leverage benefits of k8.
In the last decade, we have seen spark performance in each and every field using hadoop. YARN(hadoop) and spark ruled the industry for almost a decade in Data Engineering.
Spark creates a Spark driver running within a Kubernetes pod. The driver creates executors which are also running within Kubernetes pods and connects to them, and executes application code.
we have seen spark cluster managers like standalone,mesos and YARN. Every cluster manager has its own unique requirements and differences.
Now we have a cluster manager like kubernetes. kubernetes is one of the best orchestration tools in the IT world. If the two best tools work together, obviously the end product will simply be the best.?
Personally, I refer to these two technologies as Neem & Karela(??? & ?????)??. The reason behind is -not easy to understand, big,vastness and scope.Names seem to be ridiculous,but the effect is medicinal.
Question is WHY K8 for spark?
---------------------
Kubernetes allows centralized infrastructure management under a single type of cluster for multiple workload types. Before kubernetes, we would have different clusters for different workloads. If you see typical enterprise working behaviour, we have different sets of clusters for applications; hadoop for data,Db cluster and so on. Even these clusters can be with k8 or without k8 . Each cluster has its own administrator, devops team and most, importantly security.
So kubernetes makes it common for all kinds of application in the enterprise. Also, better utilization of resources where it might be possible clusters are under utilized or over utilized.?
The same is applicable for data science and data engineering projects where each project has project dependency and different conda env. There are so many open-source packages being used in these projects. Typically, an enterprise has 30-50 projects and each has its own dependencies. But if we use k8 as env, no need to worry about dependency and env because K8 works on POD. there can put all dependency while spark-submit . It will save huge cost.
How does it work?
--------------
There are 3 major parts of Spark Architecture.
Driver — Responsible for breaking up an incoming spark job into smaller junks and governing the execution of those chunks
Executor — Responsible for running an individual chunk of the overall job and reporting the result back to the driver.
Cluster Manager — k8 infra on which the Spark driver and Executor would run in Pod.Earlier Spark was used?with Yarn, Mesos, and Standalone cluster managers, but this setup is with k8
Spark creates a Spark driver running within a k8 pod and the driver creates executors which are also running within k8 pods and connects to them and executes application code.
In the traditional Spark-on-YARN, we need to have a dedicated Hadoop cluster for Spark processing and, the same for Python, R, etc.?
So, once a job is submitted in a cluster, the master pods will be created and then executors as per job. Even in some cases, we can define node pool for auto scale. Once a job is finished ,all resources will be killed automatically and the cluster is ready for other jobs. This way can save a lot of cost because there is no need to run dedicated servers.?
Rediness of K8 cluster:
------
Need a robust k8 cluster. There are a plethora of ways we can create a K8 cluster but i have opted for this kind. It is so simple to install and less time-consuming. For Minikube, I found a little complex.
For more check the official website of kind.
Implement or create the accounts below before spark deployment:
kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark:spark --namespace=default
Software required and Version-?
??Apache Spark 3.3.0
??Docker 1.41
??Kubectl v1.25.0
Deployment:
---------
Way 1.
----------
Use Bitnami spark image/any other . and follow the instruction from website.
Here i found 2 standard errors "
--org.apache.hadoop.security.KerberosAuthException: failure to login: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name
Soln- Don't use default spark user in image (ARG spark_uid=185) rather give another user id like root,65534 or else
-- SPARK-MASTER not found
Way 2.(preferred)
-----------
Creating spark image from official spark build. Below are steps.
Note- Be careful about Linux tini package. Either use everywhere or Don't use. Official image has confused over this. Here i have not used.
Download Official Spark from site.
wget https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz
OR
wget https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
Extract tgz File and?inside spark folder below files will require to check/create. Either change in file or replace with Attached file. If you want to use tini package check the official documentation. Although we have not used tini package.
SPARK_HOME/bin/docker-image-tool.sh
领英推荐
SPARK_HOME/kubernetes/dockerfiles/spark/Dockerfile
SPARK_HOME/kubernetes/dockerfiles/spark/entrypoint.sh
SPARK_HOME/kubernetes/dockerfiles/spark/start-spark.sh
##docker-image-tool.sh
-----------------------
No change
##entrypoint.sh (commented tini pkg) see the last 3 lines
# Execute the container CMD under tini for better hygiene
#exec /usr/bin/tini -s -- "${CMD[@]}"
exec ${CMD[@]}
##start-spark.sh??
- create this file or copy and paste in designated folder
##Execution for image.
Once build is successful then tag and push to your repository
./docker-image-tool.sh -r test-spark -t 3.3.0 build
Test spark jobs:
1. K8 Yaml
Sample Pod yaml (test.yaml) is attached or create your job accordingly. We have used sample Pi java program which is shipped with spark download.
Kubectl apply -f test.yaml?#pod
Kubectl apply -f job_java.yaml?#Job?
Try to open another CMD console and apply k8 watcher to see live pods creation.?
Kubectl get pods -w
Way3:??Helm
--------------
Use bitnami helm chart. I don't have tested yet. For more visit official site
## Submit Spark job?
Java program
Spark Operator
CMD
API:
---------
Sample Spark job:
$SPARK_HOME/bin/spark-submit --master k8s://https://127.0.0.1:44343 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark?--conf spark.kubernetes.container.image=cnsnoida/test-spark:3.3.0 --conf spark.kubernetes.file.upload.path=/home/hadoop/spark-3.3.1/examples/jars/spark-examples_2.12-3.3.1.jar
Reference Docker image:
cnsnoida/test-spark:3.3.0
Relevant files are kept in git hub account:
https://github.com/chandranitu/Tutorial-youtube/tree/master/spark_k8
Challenges: Nothing is 100% full proof in this world. This integration has issue of thread waiting and solved by Gang sheduling,Yunicorn Algorithms. This needs a separate article.
So try to learn Spark with Kubernetes. Topics are big ,so tried to give high level overview.
Thanks for reading. For any query write to [email protected]
====================
Azure | Data Engineering | Synapse Analytics | Design & Development | Micro Services | Azure DevOps | SQL | Python | C# | .Net | Github
1 年Best article... Thank you sir