AWS Glue- Based on a data Engineer real life experience

There is lot of buzz going around cloud technologies.Many organizations are moving to Cloud. Main reason behind this is to move away from building and maintaining infrastructure; focus on solution delivery.This is also giving huge cost benefits, ability to scale up quickly and deliver more in less time.

I have been working as a Senior Data Engineer using AWS Services for three years. I decided to write a Linkedin article to highlight major pros and cons of AWS ETL service offering: Glue . This article is based on my own experience and views expressed are personal.

Glue is Server less & self managed service for ETL. I am personally a big fan of it.Using this, we can run Spark jobs written in Java, Python, R or Scala. I used it for running Pyspark jobs (Pyspark- Python API for Spark). Glue also has UI kind of interface where you can do source and target field mappings, apply transformations.

To use Glue, first provision an AWS dev endpoint (This is EMR machine provided by AWS to test your spark code) . Next is write Spark code, test on dev endpoint. Once code is tested, deploy it to AWS (You can put code in S3 ). In last, create Glue job and schedule it directly or through AWS step function. That is all. It is not necessary to provision dev endpoint. You can run your code directly in Glue. But properly not written code may take forever to fix. So it is better first you test your code on dev endpoint. Glue code also can be created using UI but I never used that.

Pros of Glue is that its server less so you don't have to worry about provisioning server and cluster.Develop your Spark code, attach additional libraries/jar files and run it. cost is also very low. AWS also offers dev endpoint (its like EMR machine) where you can test and debug your code.This is one of the good service offering from AWS.

Cons of Glue is that it takes some time to provision servers. Sometimes (if demand is more) it may take upto 15-20 minutes to provision server and start executing your job. If you are working on fixing production issues than it can be painful. Looks like with Glue 2.0, AWS has reduced server provisioning wait time, but I have not tested this yet.

Overall Glue is good AWS service to run batch Spark jobs. There is still scope for further improvement to this. Like , They can add Spark UI to monitor jobs stages/task/environment/storage/Executor in real time mode without provisioning history server. Also, currently at a time, maximum 20 jobs can be submitted to Glue API . This is Glue API rate limit.Though this looks good but increasing that to 50 may be a good option.

I tried to put this based on my personal experience. If you have any specific question, Please free feel to PM me.I will be happy to answer your questions.

Gopal Kumar Roy

Solution Architect

4 年

Thank you everyone.

回复
Santosh Kumar

Informatica & Reltio MDM Lead Developer at Blue Cross Blue Shield of Massachusetts

4 年

Good Sir????

回复
Sivapunniyam Dakshinamurthy

Data-Driven Innovation Strategist | AI & Tech Leader | Co-Founder & CEO @Signatech Solutions LLC | Channel Host @Village to Valley

4 年

Excellent Gopal Thanks for sharing your personal experience working with AWS-Glue looking forward to many more articles.

Rakesh More

Manager, Application Support, Portfolio and Vendor Management

4 年

Good one

要查看或添加评论,请登录

Gopal Kumar Roy的更多文章

  • NoSQL versus SQL Database

    NoSQL versus SQL Database

    I have been working with SQL and MPP databases since very long time. After working so long, I learnt depth and breadth…

    2 条评论
  • AWS Data Analytics - Specialty exam preparation tips

    AWS Data Analytics - Specialty exam preparation tips

    Last week I passed the AWS Data Analytics - Specialty exam and thought of sharing some of the tips that can be very…

    3 条评论
  • Airflow: ETL Workflow Management Platform

    Airflow: ETL Workflow Management Platform

    Airflow is getting very popular for the ETL workflow management (It can be used for other kind of workflow management…

    2 条评论
  • Snowflake: The cloud data warehouse solution with no modeling

    Snowflake: The cloud data warehouse solution with no modeling

    In this article, I am going to talk about the cloud based data warehouse solution Snowflake. I will deep dive into some…

    6 条评论
  • Spark: The most popular big data processing framework

    Spark: The most popular big data processing framework

    Here is my another article related to big data and cloud technologies. In this article, I am going to talk about the…

    5 条评论
  • Why Python is top choice for Data Engineering

    Why Python is top choice for Data Engineering

    Python is one of the most popular programming language. Cloud, Big data and Machine Learning have made it very popular…

    1 条评论
  • Google's BigQuery: Strengths

    Google's BigQuery: Strengths

    Google's cloud offering GCP is increasing its footprint very rapidly. Specifically, GCP's data warehouse service…

    1 条评论

社区洞察

其他会员也浏览了