登录查看更多内容

AWS GLUE

Rohit Singh

Associate Project Manager @ HuQuo

发布日期: 2024年10月15日

AWS Glue provides a serverless solution that simplifies the entire process of discovering, preparing, and combining data for application development, machine learning, and analytics. To adequately define what AWS Glue is, you’ll first need to understand how data integration works.

AWS Glue provides the following capabilities:

Run ETL jobs as newly collected data arrives – AWS Glue, for instance, lets you automatically run ETL jobs when new data arrives in your Amazon Simple Storage Service (S3) buckets.
Data Catalog – Use it to rapidly browse and search multiple AWS datasets without needing to move the data. The cataloged data is immediately searchable and queryable with Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR.
AWS Glue Studio -It supports no-code ETL jobs. AWS Glue Studio enables you to visually build, run, and monitor AWS Glue ETL jobs. Your ETL jobs can move and transform data with a drag-and-drop editor. AWS Glue auto-generates the code.
Multi-method support – Supports a variety of data processing approaches and workloads, such as ETL, ELT, batch, and streaming. You can also use your favorite method, from drag and drop or writing code to connecting with your notebook.
AWS Glue Data Quality – Creates, manages, and monitors data quality rules automatically. This ensures high-quality data throughout your data lakes and pipelines.
AWS Glue DataBrew – This enables you to discover and interact with data directly from your data lake, data warehouses, and databases. You can do that with Amazon S3, Amazon Redshift, AWS Lake Formation, Amazon Aurora, and Amazon Relational Database Service (RDS).

Components of AWS Glue

The console – The AWS Glue console is where you define and orchestrate your workflow. There are several API operations you can call from here to perform tasks, such as defining AWS Glue objects, editing transformation scripts, and defining events or job schedules for job triggers.
AWS Glue Data Catalog – This is basically a central repository for your metadata, built to hold information in metadata tables — with each table pointing to a single data store. In other words, it acts as an index to your data schema, location, and runtime metrics, which are then used to identify the targets and sources of your ETL (Extract, Transform, Load) jobs.

Job Scheduling System – The job scheduling system, on the other hand, is intended to help you automate and chain your ETL pipelines. It comes in the form of a flexible scheduler that’s capable of setting up event-based triggers and job execution schedules.
Script – The AWS Glue service generates a script to transform your data. Alternatively, you can upload your script via the AWS Glue API or console. Scripts extract data from your data source, transform it, and load it into your data target. In AWS Glue, the scripts run in an Apache Spark environment.
Connection – This refers to a Data Catalog object that comprises properties for connecting to a given data store.
Data store – This is where your data is stored persistently, such as relational databases and Amazon S3.
Data source – This refers to a data store that serves as input to a transform or process.
Data target – This is the data store that a process writes to.
Transform – Refers to the code logic used to change the format of your data.

领英推荐

Best ETL Tools For AWS

Hexaview Technologies Inc. 1 年前

AWS Glue-All you need to Simplify the ETL process -…

Naresh i Technologies 2 年前

Mastering Data Transformation with AWS Glue: A…

Hemanth Kumar 5 个月前

ETL Engine – AWS Glue’s ETL engine is the one component that handles ETL code generation. It automatically provides this in Python or Scala, and then proceeds to even give you the option of customizing the code.
Crawler and Classifier – A crawler helps retrieve data from the source through integrated or custom classifiers. This AWS Glue component creates or uses metadata tables pre-defined in the data catalog.

Job – This is the business logic that performs an ETL task in AWS Glue. Internally, Apache Spark with Python or Scala writes the business logic.
Trigger – This starts ETL job execution at a specific time or on-demand.
Development endpoint – This creates a development environment in which your ETL job script can be developed, tested, and debugged.
Database – It creates or accesses source and target databases.
Table – You can create one or more tables in the database for use by the source and target.
Notebook server – An online environment for running PySpark statements, a Python dialect for ETL programming. With AWS Glue extensions, you can run PySpark statements on a notebook server.

要查看或添加评论，请登录

Rohit Singh的更多文章

Amazon Elastic Container Service (Amazon ECS)

2025年3月19日

Amazon Elastic Container Service (Amazon ECS)

Amazon Elastic Container Service (Amazon ECS) is a fully managed container orchestration service that simplifies the…
User Acceptance Testing (UAT)

2025年3月18日

User Acceptance Testing (UAT)

User Acceptance Testing (UAT) is a crucial phase in software testing where the software is tested in a real-world…
Software Development Engineer in Test (SDET)

2025年3月17日

Software Development Engineer in Test (SDET)

Software Development Engineer in Test (SDET) is a developer with the primary responsibility for the development of…
Data center

2025年3月15日

Data center

A data center is essentially a building or a dedicated space within a building that serves as a central hub for…
Network security engineer

2025年3月13日

Network security engineer

A Network and Security Engineer designs, implements, and maintains secure network infrastructure, protecting systems…
Firewall

2025年3月12日

Firewall

A firewall is a network security device either hardware or software-based which monitors all incoming and outgoing…
Apache Sqoop

2025年3月11日

Apache Sqoop

Apache Sqoop is a command-line tool that transfers data between relational databases and Hadoop. It's used to import…
Trello

2025年3月10日

Trello

Trello is a popular, simple, and easy-to-use collaboration tool that enables you to organize projects, and everything…
Safe Agilist

2025年3月8日

Safe Agilist

The Scaled Agile Framework? (SAFe?) is a set of organizational and workflow patterns for implementing agile practices…
Data strategy

2025年3月7日

Data strategy

A data strategy is a plan that outlines how an organization collects, manages, and uses data to meet its goals. It's a…

See all articles

AWS GLUE

Rohit Singh

Associate Project Manager @ HuQuo

Components of AWS Glue

领英推荐

Rohit Singh的更多文章

社区洞察

其他会员也浏览了

ETL workflow

The ETL to ELT to EtLT Evolution, and data pipelines

Top 10 Data Pipeline Tools: Use Cases

The Must-Have ETL Tools to Unleash Data Warehousing Potential in 2023

AWS Glue, Athena and Visual ETL based Data Quality Improvement

Ace Microsoft Fabric: Understanding Dataflows Gen2

Matillion: SaaS ETL Tool

ETL vs. ELT: Which Data Pipeline Strategy Fits Your Project?

ETL/ELT Simplified: Open-Source Tools That Transform Your Data Strategy

Top MongoDB ETL Tools of 2024

Components of AWS Glue

领英推荐

Rohit Singh的更多文章

Amazon Elastic Container Service (Amazon ECS)

User Acceptance Testing (UAT)

Software Development Engineer in Test (SDET)

Data center

Network security engineer

Firewall

Apache Sqoop

Trello

Safe Agilist

Data strategy

社区洞察

其他会员也浏览了

ETL workflow

The ETL to ELT to EtLT Evolution, and data pipelines

Top 10 Data Pipeline Tools: Use Cases

The Must-Have ETL Tools to Unleash Data Warehousing Potential in 2023

AWS Glue, Athena and Visual ETL based Data Quality Improvement

Ace Microsoft Fabric: Understanding Dataflows Gen2

Matillion: SaaS ETL Tool

ETL vs. ELT: Which Data Pipeline Strategy Fits Your Project?

ETL/ELT Simplified: Open-Source Tools That Transform Your Data Strategy

Top MongoDB ETL Tools of 2024