EMR on EKS by Example
EMR on EKS?provides a deployment option for?Amazon EMR?that allows you to automate the provisioning and management of open-source big data frameworks on?Amazon EKS. While a wide range of open source big data components are available in EMR on EC2, only Apache Spark is available in EMR on EKS. It is more flexible, however, that applications of different EMR versions can be run in multiple availability zones on either EC2 or Fargate. Also other types of containerized applications can be deployed on the same EKS cluster. Therefore, if you have or plan to have, for example,?Apache Airflow,?Apache Superset?or?Kubeflow?as your analytics toolkits, it can be an effective way to manage big data (as well as non-big data) workloads. While Glue is more for ETL, EMR on EKS can also be used for other types of tasks such as machine learning. Moreover it allows you to build a Spark application, not a?Gluish?Spark application. For example, while you have to use custom connectors for?Hudi?or?Iceberg?for Glue, you can use their native libraries with EMR on EKS. In this post, we’ll discuss EMR on EKS with simple and elaborated examples.
Set up Amazon EMR on EKS
As described in the?Amazon EMR on EKS development guide, Amazon EKS uses Kubernetes namespaces to divide cluster resources between multiple users and applications. A virtual cluster is a Kubernetes namespace that Amazon EMR is registered with. Amazon EMR uses virtual clusters to run jobs and host endpoints. As illustrated further below, we need to take the following steps so as to set up for EMR on EKS.