AWS Glue In 2023
AWS Glue is serverless data integration service for easy to discover, prepare and combine data for analytics, machine learning, and application development.
Features of AWS Glue
The core features of Glue are as follows:
So How AWS glue fit in modern data stack?
The main reason to choose AWS glue:
That's why it's very powerful you only need to focus on data infrastructure managed by aws, and pay only peruse it to avoid licensing costs and infra idle time. Powerful open-source engine. Glue is supporting multiple open-source engines.
It is a data integration that glue has. As we know its a serverless infrastructure it allows users to spin up clusters for you in a matter of seconds. You are no longer bound by the capacity that you have. As per glue documentation, it can spin up 300 nodes in 5 to 10 seconds. Another key benefit is per seconds billing you actually pay for only what you used. So it's great point when it comes to cost.
To become more cost-effective, it has great capability in auto-scaling, especially when you don't know how to manage its capacity exactly.
Other side its glue version to spot instances spare capacity. In many use case customers have critical jobs which need standard execution. You don't want it to fail. customers also have non critical jobs that maybe you can benefits from 34% discount and run them in spare capacity.
The powerful thing about serverless you don't have to care about the ideal time. You have all the capacity that you need.
Glue layers
Glue based on five layers
Connectors
Aws glue about 80 plus connectors most of connectors are bidirectional. Aws glue have native connectors which comes with glue. In addition glue has custom connectors when customers have customs systems and wrap to yourself. And a lot of connectors in market place as well which is comes from various new systems and industry specific systems and SaaS applications. Main idea to have flexibility to choose and right set of connectors based on your source and targets.
Engines
Aws glue support three engines
the decision of glue engine all depends on your use case and work loads.
Author
This is how aws glue abstract above engines. Glue have five different ways to create an author jobs in AWS glue. Here are following interface to create glue job?
1. Glue studio (Drag and Drop component )
2. Glue studio Note book(Code way)
3. Glue studio data brew (Mostly used for data preparation like an excel spreadsheet and low code )
4. API, SDK and local notebook(Jupiter note book)
5. Amazon Segamaker studio notebook??
You get the benefits from all the platform innovation of glue. All these are based on user skills and needs
Operationalize
Operationalization is phase after building data pipelines they have to build workflow and schedule it then monitor it. Its is facts that data engineers spends 50% of their time in Operationalizing the data pipelines. In glue studio there are various option to do that like git integration, job monitoring and workflow and orchestration.
Git integration: That's a native way to integrate to a source control. Its help you to parameterized your jobs. You can also move jobs between different environments like dev, prod etc. And also help us to apply best practices in software development like code review and others.
Jobs Monitoring: Glue also have job monitoring. No matter how you wrote your job visual, code or api. You can still monitor your job in a centralized place. Its great visibility and understand what going on your environment.
Workflow and Orchestration: Glue have native workflows and also have integration with step functions and Airflow. You can choose as per your requirement either schedule a job or trigger it by an event.
Data Management
Data Catalog: Data management is how you manage your data using data catalog it is most popular metadata when we talk about data or delta lake. Essentially it store metadata for files or tables such as S3, Hadoop and RDS. This metadata help you keep all information about your data. There is crawler also part of data catalog help to get updated information about format and schema. Which mitigate most of data format and schema related issue automatically.
Sensitive Data Detection: Now a days there is more regulation and more compliance and data privacy became key to a lot of customers. Glue is providing data sensitivity and sensitive data detection in data pipeline. Glue identifies PII and sensitive data that goes through your pipelines using pattern matching and machine learning.
References:
https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html