AWS Glue- Based on a data Engineer real life experience
There is lot of buzz going around cloud technologies.Many organizations are moving to Cloud. Main reason behind this is to move away from building and maintaining infrastructure; focus on solution delivery.This is also giving huge cost benefits, ability to scale up quickly and deliver more in less time.
I have been working as a Senior Data Engineer using AWS Services for three years. I decided to write a Linkedin article to highlight major pros and cons of AWS ETL service offering: Glue . This article is based on my own experience and views expressed are personal.
Glue is Server less & self managed service for ETL. I am personally a big fan of it.Using this, we can run Spark jobs written in Java, Python, R or Scala. I used it for running Pyspark jobs (Pyspark- Python API for Spark). Glue also has UI kind of interface where you can do source and target field mappings, apply transformations.
To use Glue, first provision an AWS dev endpoint (This is EMR machine provided by AWS to test your spark code) . Next is write Spark code, test on dev endpoint. Once code is tested, deploy it to AWS (You can put code in S3 ). In last, create Glue job and schedule it directly or through AWS step function. That is all. It is not necessary to provision dev endpoint. You can run your code directly in Glue. But properly not written code may take forever to fix. So it is better first you test your code on dev endpoint. Glue code also can be created using UI but I never used that.
Pros of Glue is that its server less so you don't have to worry about provisioning server and cluster.Develop your Spark code, attach additional libraries/jar files and run it. cost is also very low. AWS also offers dev endpoint (its like EMR machine) where you can test and debug your code.This is one of the good service offering from AWS.
Cons of Glue is that it takes some time to provision servers. Sometimes (if demand is more) it may take upto 15-20 minutes to provision server and start executing your job. If you are working on fixing production issues than it can be painful. Looks like with Glue 2.0, AWS has reduced server provisioning wait time, but I have not tested this yet.
Overall Glue is good AWS service to run batch Spark jobs. There is still scope for further improvement to this. Like , They can add Spark UI to monitor jobs stages/task/environment/storage/Executor in real time mode without provisioning history server. Also, currently at a time, maximum 20 jobs can be submitted to Glue API . This is Glue API rate limit.Though this looks good but increasing that to 50 may be a good option.
I tried to put this based on my personal experience. If you have any specific question, Please free feel to PM me.I will be happy to answer your questions.
Solution Architect
4 年Thank you everyone.
Informatica & Reltio MDM Lead Developer at Blue Cross Blue Shield of Massachusetts
4 年Good Sir????
Data-Driven Innovation Strategist | AI & Tech Leader | Co-Founder & CEO @Signatech Solutions LLC | Channel Host @Village to Valley
4 年Excellent Gopal Thanks for sharing your personal experience working with AWS-Glue looking forward to many more articles.
Manager, Application Support, Portfolio and Vendor Management
4 年Good one