EMR Serverless
As we all know, EMR is a cluster managed platform which handles big data frameworks like Apache Hadoop and Apache Spark. EMR has a lot of cluster and configuration management, which is tough to maintain. One of the main reasons being, slower cluster creation and termination (taking 12-15mins) and problems while creating the cluster.
?As a solution, the EMR Serverless came into existence which reduces a lot of management work and concentrates on the data processing. Some of the advantages being- Optimized Configuration, improved application security & efficiency, auto/manual application start & stop while triggering the jobs etc. Developers don't fac any difficulty in managing the cluster except the development difficulties which are faced all the time and are entirely different from EMR launching difficulties.
Basic requirements for EMR serverless-
1. EMR Studio
2. AWS IAM Execution Role
3. AWS S3 for storing EMR serverless generated logs.
4. Default Spark Properties
??????????????a. --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python
??????????????b. --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python
??????????????c. --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python
5. As EMR Clusters run within a VPC, security groups & VPC must also be created/assigned even while creating EMR serverless???????????
Implementation:
1. Creation/Deletion of the EMR application which is managed by AWS.
领英推荐
2. Submit the jobs.
Like EMR cluster, EMR Serverless can also be orchestrated via AWS MWAA or AWS Step Functions.
Orchestration from AWS MWAA as mentioned below-
1. AWS provides operators which are used for creation/deletion of the applications and submission of the jobs.
2. MWAA version 2.2.2 is required for EMR Serverless and it doesn't work with version 1.12.0
4. Below input parameters are to be entered for MWAA dag
??????????????a. Application ID
??????????????b. Execution Role
??????????????c. Entry Point
??????????????d. Spark Submit Parameters
??????????????e. S3 Logs storage Path
5. Below spark property has to be used for using Glue as the metastore -> --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory