登录查看更多内容

EMR on EKS by Example

Jaehyeon Kim

?? Data Engineer ?? Blogger

发布日期: 2022年1月17日

EMR on EKS?provides a deployment option for?Amazon EMR?that allows you to automate the provisioning and management of open-source big data frameworks on?Amazon EKS. While a wide range of open source big data components are available in EMR on EC2, only Apache Spark is available in EMR on EKS. It is more flexible, however, that applications of different EMR versions can be run in multiple availability zones on either EC2 or Fargate. Also other types of containerized applications can be deployed on the same EKS cluster. Therefore, if you have or plan to have, for example,?Apache Airflow,?Apache Superset?or?Kubeflow?as your analytics toolkits, it can be an effective way to manage big data (as well as non-big data) workloads. While Glue is more for ETL, EMR on EKS can also be used for other types of tasks such as machine learning. Moreover it allows you to build a Spark application, not a?Gluish?Spark application. For example, while you have to use custom connectors for?Hudi?or?Iceberg?for Glue, you can use their native libraries with EMR on EKS. In this post, we’ll discuss EMR on EKS with simple and elaborated examples.

Set up Amazon EMR on EKS

As described in the?Amazon EMR on EKS development guide, Amazon EKS uses Kubernetes namespaces to divide cluster resources between multiple users and applications. A virtual cluster is a Kubernetes namespace that Amazon EMR is registered with. Amazon EMR uses virtual clusters to run jobs and host endpoints. As illustrated further below, we need to take the following steps so as to set up for EMR on EKS.

Enable cluster access for Amazon EMR on EKS
Create an IAM OIDC identity provider for the EKS cluster
Create a job execution role
Update the trust policy of the job execution role
Register Amazon EKS Cluster with Amazon EMR

Continue...

要查看或添加评论，请登录

Jaehyeon Kim的更多文章

Use External Schema Registry with MSK Connect – Part 2 MSK Deployment

2022年4月4日

Use External Schema Registry with MSK Connect – Part 2 MSK Deployment

In the previous post, we discussed a Change Data Capture (CDC) solution with a schema registry. A local development…
Use External Schema Registry with MSK Connect – Part 1 Local Development

2022年3月7日

Use External Schema Registry with MSK Connect – Part 1 Local Development

When we discussed a Change Data Capture (CDC) solution in one of the earlier posts, we used the JSON converter that…
Simplify Your Development on AWS with Terraform

2022年2月7日

Simplify Your Development on AWS with Terraform

When I wrote my data lake demo series (part 1, part 2 and part 3) recently, I used an Aurora PostgreSQL, MSK and EMR…
Data Lake Demo using Change Data Capture (CDC) on AWS – Part 3 Hudi Table and Dashboard Creation

2021年12月20日

Data Lake Demo using Change Data Capture (CDC) on AWS – Part 3 Hudi Table and Dashboard Creation

In the previous post, we created a VPC that has private and public subnets in 2 availability zones in order to build…
Data Lake Demo using Change Data Capture (CDC) on AWS – Part 2 CDC with Amazon MSK

2021年12月13日

Data Lake Demo using Change Data Capture (CDC) on AWS – Part 2 CDC with Amazon MSK

In the previous post, we discussed a data lake solution where data ingestion is performed using change data capture…
Data Lake Demo using Change Data Capture (CDC) on AWS – Part 1 Database and Local Development

2021年12月6日

Data Lake Demo using Change Data Capture (CDC) on AWS – Part 1 Database and Local Development

Change data capture (CDC) is a proven data integration pattern that has a wide range of applications. Among those, data…

1 条评论
Thoughts on Apache Airflow AWS Lambda Operator

2020年4月13日

Thoughts on Apache Airflow AWS Lambda Operator

Apache Airflow is a popular open-source workflow management platform. Typically tasks run remotely by Celery workers…

2 条评论
Dynamic Routing and Centralized Auth with Traefik, Python and R Example

2019年11月29日

Dynamic Routing and Centralized Auth with Traefik, Python and R Example

Ingress in Kubernetes exposes HTTP and HTTPS routes from outside the cluster to services within the cluster. By setting…
Distributed Task Queue with Python and R Example

2019年11月15日

Distributed Task Queue with Python and R Example

While I'm looking into Apache Airflow, a workflow management tool, I thought it would be beneficial to get some…
Linux Dev Environment on Windows

2019年11月1日

Linux Dev Environment on Windows

I use Linux containers a lot for development. Having Windows computers at home and work, I used to use Linux VMs on…

See all articles

EMR on EKS by Example

Jaehyeon Kim

?? Data Engineer ?? Blogger

Set up Amazon EMR on EKS

Continue...

Jaehyeon Kim的更多文章

社区洞察

其他会员也浏览了

AWS Lambda Use Cases

Comparing Big Data Pipelines on AWS, Microsoft Azure, and Google Cloud Platform

Transforming Data Science with Cloud Computing: Innovations and Applications

Azure Developer Associate certification renewal exam guide

Top announcements of AWS re:Invent 2023

An opinionated review of AWS ReInvent 2020

AWS re:Invent 2024: A Recap of Major Announcements

Read from Kafka & Write to Snowflake via Spark Databricks

Top Announcements from AWS re:Invent 2024: Driving Digital Transformation Forward

Databricks Serverless Compute

Set up Amazon EMR on EKS

Continue...

Jaehyeon Kim的更多文章

Use External Schema Registry with MSK Connect – Part 2 MSK Deployment

Use External Schema Registry with MSK Connect – Part 1 Local Development

Simplify Your Development on AWS with Terraform

Data Lake Demo using Change Data Capture (CDC) on AWS – Part 3 Hudi Table and Dashboard Creation

Data Lake Demo using Change Data Capture (CDC) on AWS – Part 2 CDC with Amazon MSK

Data Lake Demo using Change Data Capture (CDC) on AWS – Part 1 Database and Local Development

Thoughts on Apache Airflow AWS Lambda Operator

Dynamic Routing and Centralized Auth with Traefik, Python and R Example

Distributed Task Queue with Python and R Example

Linux Dev Environment on Windows

社区洞察

其他会员也浏览了

AWS Lambda Use Cases

Comparing Big Data Pipelines on AWS, Microsoft Azure, and Google Cloud Platform

Transforming Data Science with Cloud Computing: Innovations and Applications

Azure Developer Associate certification renewal exam guide

Top announcements of AWS re:Invent 2023

An opinionated review of AWS ReInvent 2020

AWS re:Invent 2024: A Recap of Major Announcements

Read from Kafka & Write to Snowflake via Spark Databricks

Top Announcements from AWS re:Invent 2024: Driving Digital Transformation Forward

Databricks Serverless Compute