AWS GLUE

AWS GLUE

AWS Glue provides a serverless solution that simplifies the entire process of discovering, preparing, and combining data for application development, machine learning, and analytics. To adequately define what AWS Glue is, you’ll first need to understand how data integration works.

AWS Glue provides the following capabilities:

  • Run ETL jobs as newly collected data arrives – AWS Glue, for instance, lets you automatically run ETL jobs when new data arrives in your Amazon Simple Storage Service (S3) buckets.
  • Data Catalog – Use it to rapidly browse and search multiple AWS datasets without needing to move the data. The cataloged data is immediately searchable and queryable with Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR.
  • AWS Glue Studio -It supports no-code ETL jobs. AWS Glue Studio enables you to visually build, run, and monitor AWS Glue ETL jobs. Your ETL jobs can move and transform data with a drag-and-drop editor. AWS Glue auto-generates the code.
  • Multi-method support – Supports a variety of data processing approaches and workloads, such as ETL, ELT, batch, and streaming. You can also use your favorite method, from drag and drop or writing code to connecting with your notebook.
  • AWS Glue Data Quality – Creates, manages, and monitors data quality rules automatically. This ensures high-quality data throughout your data lakes and pipelines.
  • AWS Glue DataBrew – This enables you to discover and interact with data directly from your data lake, data warehouses, and databases. You can do that with Amazon S3, Amazon Redshift, AWS Lake Formation, Amazon Aurora, and Amazon Relational Database Service (RDS).

Components of AWS Glue

  • The console – The AWS Glue console is where you define and orchestrate your workflow. There are several API operations you can call from here to perform tasks, such as defining AWS Glue objects, editing transformation scripts, and defining events or job schedules for job triggers.
  • AWS Glue Data Catalog – This is basically a central repository for your metadata, built to hold information in metadata tables — with each table pointing to a single data store. In other words, it acts as an index to your data schema, location, and runtime metrics, which are then used to identify the targets and sources of your ETL (Extract, Transform, Load) jobs.

  • Job Scheduling System – The job scheduling system, on the other hand, is intended to help you automate and chain your ETL pipelines. It comes in the form of a flexible scheduler that’s capable of setting up event-based triggers and job execution schedules.
  • Script – The AWS Glue service generates a script to transform your data. Alternatively, you can upload your script via the AWS Glue API or console. Scripts extract data from your data source, transform it, and load it into your data target. In AWS Glue, the scripts run in an Apache Spark environment.
  • Connection – This refers to a Data Catalog object that comprises properties for connecting to a given data store.
  • Data store – This is where your data is stored persistently, such as relational databases and Amazon S3.
  • Data source – This refers to a data store that serves as input to a transform or process.
  • Data target This is the data store that a process writes to.
  • Transform – Refers to the code logic used to change the format of your data.

  • ETL Engine AWS Glue’s ETL engine is the one component that handles ETL code generation. It automatically provides this in Python or Scala, and then proceeds to even give you the option of customizing the code.
  • Crawler and Classifier – A crawler helps retrieve data from the source through integrated or custom classifiers. This AWS Glue component creates or uses metadata tables pre-defined in the data catalog.

  • Job This is the business logic that performs an ETL task in AWS Glue. Internally, Apache Spark with Python or Scala writes the business logic.
  • Trigger – This starts ETL job execution at a specific time or on-demand.
  • Development endpoint – This creates a development environment in which your ETL job script can be developed, tested, and debugged.
  • Database It creates or accesses source and target databases.
  • Table – You can create one or more tables in the database for use by the source and target.
  • Notebook server – An online environment for running PySpark statements, a Python dialect for ETL programming. With AWS Glue extensions, you can run PySpark statements on a notebook server.



要查看或添加评论,请登录

Rohit Singh的更多文章

  • Data center

    Data center

    A data center is essentially a building or a dedicated space within a building that serves as a central hub for…

  • Network security engineer

    Network security engineer

    A Network and Security Engineer designs, implements, and maintains secure network infrastructure, protecting systems…

  • Firewall

    Firewall

    A firewall is a network security device either hardware or software-based which monitors all incoming and outgoing…

  • Apache Sqoop

    Apache Sqoop

    Apache Sqoop is a command-line tool that transfers data between relational databases and Hadoop. It's used to import…

  • Trello

    Trello

    Trello is a popular, simple, and easy-to-use collaboration tool that enables you to organize projects, and everything…

  • Safe Agilist

    Safe Agilist

    The Scaled Agile Framework? (SAFe?) is a set of organizational and workflow patterns for implementing agile practices…

  • Data strategy

    Data strategy

    A data strategy is a plan that outlines how an organization collects, manages, and uses data to meet its goals. It's a…

  • STL

    STL

    Standard Template Library (STL) provides the built-in implementation of commonly used data structures known as…

  • Fraud Detection

    Fraud Detection

    Fraud detection is a set of activities undertaken to prevent money or property from being obtained through false…

  • Django

    Django

    Django, built with Python, is designed to help developers build secure, scalable, and feature-rich web applications…

社区洞察

其他会员也浏览了