AWS Glue In 2023

AWS Glue In 2023

AWS Glue is serverless data integration service for easy to discover, prepare and combine data for analytics, machine learning, and application development.

Features of AWS Glue

The core features of Glue are as follows:

  • Automatic schema discovery.?Glue allows developers to automate?crawlers?to obtain schema-related information and store it in the data catalog, which can then be used to manage jobs.
  • Job scheduler.?Glue jobs can be set and called on a flexible schedule, either by event-based triggers or on demand. Several jobs can be started in parallel, and users can specify dependencies between jobs.
  • Developer endpoints.?Developers can use these to debug Glue, as well as creating custom readers, writers and transformations, which can then be imported into custom libraries.
  • Automatic code generation.?The ETL process automatically generates code, and the only input necessary is a location/path for the data to be stored. The code is in either Scala or Python.
  • Integrated data catalog.?Acts a singular metadata store of data from a disparate source in the AWS pipeline. An AWS account has one catalog.

So How AWS glue fit in modern data stack?

  • Purpose build data services
  • Seamless data movement and sharing
  • unified data governance
  • scalable, performant and cost effective

No alt text provided for this image
Glue for Modren Data stack

The main reason to choose AWS glue:

  • Serverless
  • Fast to market
  • Spark based
  • Cost efficient?

That's why it's very powerful you only need to focus on data infrastructure managed by aws, and pay only peruse it to avoid licensing costs and infra idle time. Powerful open-source engine. Glue is supporting multiple open-source engines.

It is a data integration that glue has. As we know its a serverless infrastructure it allows users to spin up clusters for you in a matter of seconds. You are no longer bound by the capacity that you have. As per glue documentation, it can spin up 300 nodes in 5 to 10 seconds. Another key benefit is per seconds billing you actually pay for only what you used. So it's great point when it comes to cost.

To become more cost-effective, it has great capability in auto-scaling, especially when you don't know how to manage its capacity exactly.

Other side its glue version to spot instances spare capacity. In many use case customers have critical jobs which need standard execution. You don't want it to fail. customers also have non critical jobs that maybe you can benefits from 34% discount and run them in spare capacity.

The powerful thing about serverless you don't have to care about the ideal time. You have all the capacity that you need.

No alt text provided for this image

Glue layers

Glue based on five layers

No alt text provided for this image
Aws Glue Over View

Connectors

Aws glue about 80 plus connectors most of connectors are bidirectional. Aws glue have native connectors which comes with glue. In addition glue has custom connectors when customers have customs systems and wrap to yourself. And a lot of connectors in market place as well which is comes from various new systems and industry specific systems and SaaS applications. Main idea to have flexibility to choose and right set of connectors based on your source and targets.

No alt text provided for this image
AWS Glue Connectors

Engines

Aws glue support three engines

  • Spark (Very popular open source processing Engine)
  • Python Shell (For Smaller workloads for python)
  • Ray(Its open source project based on Ray.io. It allow you to run python script in distributed way.Easy way to use python in scalable mechanism)

No alt text provided for this image

the decision of glue engine all depends on your use case and work loads.

Author

This is how aws glue abstract above engines. Glue have five different ways to create an author jobs in AWS glue. Here are following interface to create glue job?

1. Glue studio (Drag and Drop component )

2. Glue studio Note book(Code way)

3. Glue studio data brew (Mostly used for data preparation like an excel spreadsheet and low code )

4. API, SDK and local notebook(Jupiter note book)

5. Amazon Segamaker studio notebook??

You get the benefits from all the platform innovation of glue. All these are based on user skills and needs

No alt text provided for this image
Glue Interfaces

Operationalize

Operationalization is phase after building data pipelines they have to build workflow and schedule it then monitor it. Its is facts that data engineers spends 50% of their time in Operationalizing the data pipelines. In glue studio there are various option to do that like git integration, job monitoring and workflow and orchestration.

Git integration: That's a native way to integrate to a source control. Its help you to parameterized your jobs. You can also move jobs between different environments like dev, prod etc. And also help us to apply best practices in software development like code review and others.

Jobs Monitoring: Glue also have job monitoring. No matter how you wrote your job visual, code or api. You can still monitor your job in a centralized place. Its great visibility and understand what going on your environment.

Workflow and Orchestration: Glue have native workflows and also have integration with step functions and Airflow. You can choose as per your requirement either schedule a job or trigger it by an event.

No alt text provided for this image

Data Management

Data Catalog: Data management is how you manage your data using data catalog it is most popular metadata when we talk about data or delta lake. Essentially it store metadata for files or tables such as S3, Hadoop and RDS. This metadata help you keep all information about your data. There is crawler also part of data catalog help to get updated information about format and schema. Which mitigate most of data format and schema related issue automatically.

Sensitive Data Detection: Now a days there is more regulation and more compliance and data privacy became key to a lot of customers. Glue is providing data sensitivity and sensitive data detection in data pipeline. Glue identifies PII and sensitive data that goes through your pipelines using pattern matching and machine learning.

No alt text provided for this image
Glue Sensitive data detection


References:

https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html

要查看或添加评论,请登录

Huzefa Khan的更多文章

  • Cloud Adoption in Pakistan Telecom Industry

    Cloud Adoption in Pakistan Telecom Industry

    The telecommunications industry can realize several key benefits by adopting cloud technology: Scalability: Cloud…

  • Data problems in Pakistan Telecom Industry

    Data problems in Pakistan Telecom Industry

    There are several data-related problems that the telecommunications industry in Pakistan faces: Data quality issues:…

  • Digital

    Digital

    Making Digital Transformation a Success Digital Transformation is that the ability of a business to grasp and adapt to…

  • How to Apply K-means clustering on Textual Data?

    How to Apply K-means clustering on Textual Data?

    In this article well be learning about Natural Language Processing(NLP) which can help computers analyze text easily…

  • Why Data Quality Report is Necessary in Data Science Projects?

    Why Data Quality Report is Necessary in Data Science Projects?

    Data auditing is being carried out, the output of which will be a "Data Quality Report". This report will highlight the…

社区洞察