登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

AWS Glue In 2023

Huzefa Khan

Data Engineer | Big Data Processing | Spark,Python,SQL | AWS

发布日期: 2023年1月14日

AWS Glue is serverless data integration service for easy to discover, prepare and combine data for analytics, machine learning, and application development.

Features of AWS Glue

The core features of Glue are as follows:

Automatic schema discovery.?Glue allows developers to automate?crawlers?to obtain schema-related information and store it in the data catalog, which can then be used to manage jobs.
Job scheduler.?Glue jobs can be set and called on a flexible schedule, either by event-based triggers or on demand. Several jobs can be started in parallel, and users can specify dependencies between jobs.
Developer endpoints.?Developers can use these to debug Glue, as well as creating custom readers, writers and transformations, which can then be imported into custom libraries.
Automatic code generation.?The ETL process automatically generates code, and the only input necessary is a location/path for the data to be stored. The code is in either Scala or Python.
Integrated data catalog.?Acts a singular metadata store of data from a disparate source in the AWS pipeline. An AWS account has one catalog.

So How AWS glue fit in modern data stack?

Purpose build data services
Seamless data movement and sharing
unified data governance
scalable, performant and cost effective

No alt text provided for this image — Glue for Modren Data stack

The main reason to choose AWS glue:

Serverless
Fast to market
Spark based
Cost efficient?

That's why it's very powerful you only need to focus on data infrastructure managed by aws, and pay only peruse it to avoid licensing costs and infra idle time. Powerful open-source engine. Glue is supporting multiple open-source engines.

It is a data integration that glue has. As we know its a serverless infrastructure it allows users to spin up clusters for you in a matter of seconds. You are no longer bound by the capacity that you have. As per glue documentation, it can spin up 300 nodes in 5 to 10 seconds. Another key benefit is per seconds billing you actually pay for only what you used. So it's great point when it comes to cost.

To become more cost-effective, it has great capability in auto-scaling, especially when you don't know how to manage its capacity exactly.

Other side its glue version to spot instances spare capacity. In many use case customers have critical jobs which need standard execution. You don't want it to fail. customers also have non critical jobs that maybe you can benefits from 34% discount and run them in spare capacity.

The powerful thing about serverless you don't have to care about the ideal time. You have all the capacity that you need.

Glue layers

Glue based on five layers

Connectors

Aws glue about 80 plus connectors most of connectors are bidirectional. Aws glue have native connectors which comes with glue. In addition glue has custom connectors when customers have customs systems and wrap to yourself. And a lot of connectors in market place as well which is comes from various new systems and industry specific systems and SaaS applications. Main idea to have flexibility to choose and right set of connectors based on your source and targets.

Engines

Aws glue support three engines

Spark (Very popular open source processing Engine)
Python Shell (For Smaller workloads for python)
Ray(Its open source project based on Ray.io. It allow you to run python script in distributed way.Easy way to use python in scalable mechanism)

the decision of glue engine all depends on your use case and work loads.

Author

This is how aws glue abstract above engines. Glue have five different ways to create an author jobs in AWS glue. Here are following interface to create glue job?

1. Glue studio (Drag and Drop component )

2. Glue studio Note book(Code way)

3. Glue studio data brew (Mostly used for data preparation like an excel spreadsheet and low code )

4. API, SDK and local notebook(Jupiter note book)

5. Amazon Segamaker studio notebook??

You get the benefits from all the platform innovation of glue. All these are based on user skills and needs

Operationalize

Operationalization is phase after building data pipelines they have to build workflow and schedule it then monitor it. Its is facts that data engineers spends 50% of their time in Operationalizing the data pipelines. In glue studio there are various option to do that like git integration, job monitoring and workflow and orchestration.

Git integration: That's a native way to integrate to a source control. Its help you to parameterized your jobs. You can also move jobs between different environments like dev, prod etc. And also help us to apply best practices in software development like code review and others.

Jobs Monitoring: Glue also have job monitoring. No matter how you wrote your job visual, code or api. You can still monitor your job in a centralized place. Its great visibility and understand what going on your environment.

Workflow and Orchestration: Glue have native workflows and also have integration with step functions and Airflow. You can choose as per your requirement either schedule a job or trigger it by an event.

Data Management

Data Catalog: Data management is how you manage your data using data catalog it is most popular metadata when we talk about data or delta lake. Essentially it store metadata for files or tables such as S3, Hadoop and RDS. This metadata help you keep all information about your data. There is crawler also part of data catalog help to get updated information about format and schema. Which mitigate most of data format and schema related issue automatically.

Sensitive Data Detection: Now a days there is more regulation and more compliance and data privacy became key to a lot of customers. Glue is providing data sensitivity and sensitive data detection in data pipeline. Glue identifies PII and sensitive data that goes through your pipelines using pattern matching and machine learning.

References:

https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html

要查看或添加评论，请登录

Huzefa Khan的更多文章

Cloud Adoption in Pakistan Telecom Industry

2023年1月14日

Cloud Adoption in Pakistan Telecom Industry

The telecommunications industry can realize several key benefits by adopting cloud technology: Scalability: Cloud…
Data problems in Pakistan Telecom Industry

2023年1月13日

Data problems in Pakistan Telecom Industry

There are several data-related problems that the telecommunications industry in Pakistan faces: Data quality issues:…
Digital

2020年1月5日

Digital

Making Digital Transformation a Success Digital Transformation is that the ability of a business to grasp and adapt to…
How to Apply K-means clustering on Textual Data?

2020年1月2日

How to Apply K-means clustering on Textual Data?

In this article well be learning about Natural Language Processing(NLP) which can help computers analyze text easily…
Why Data Quality Report is Necessary in Data Science Projects?

2020年1月1日

Why Data Quality Report is Necessary in Data Science Projects?

Data auditing is being carried out, the output of which will be a "Data Quality Report". This report will highlight the…

See all articles

Features of AWS Glue

The main reason to choose AWS glue:

Glue layers

Connectors

Engines

Author

Operationalize

Data Management

Huzefa Khan的更多文章

Cloud Adoption in Pakistan Telecom Industry

Data problems in Pakistan Telecom Industry

Digital

How to Apply K-means clustering on Textual Data?

Why Data Quality Report is Necessary in Data Science Projects?

社区洞察