Building Data Pipelines with No-Code ETL Using AWS Glue Studio

Building Data Pipelines with No-Code ETL Using AWS Glue Studio

This article was written by Neil Rico. Neil, fueled by his passion for technology, specializes in designing and optimizing cloud solutions, particularly in the rapidly evolving landscape of Amazon Web Services (AWS). He is committed to continuous learning, viewing life as a journey of discovery and personal growth.

Welcome to the exciting realm of AWS Data Engineering! This beginner-friendly guide offers an introduction to the fundamentals of data staging and transformation within the AWS ecosystem, all without requiring any coding skills. By focusing on the basics of Amazon S3 and AWS Glue, this guide provides a hands-on starting point for understanding how data is managed and processed on AWS. Whether you're preparing for certification or seeking to apply these skills in real-world situations, this guide lays the foundation for your journey in data engineering.

Preparation: Navigating Through Sample Datasets

In this article, we’ll explore three key datasets related to a fictional ride-hailing service. These datasets include essential information such as trip details, driver and vehicle specifics, and location-based fare data. Review the tables below and then download them in CSV format using the provided link. This format will allow for seamless uploading and management of the data within Amazon S3, preparing us for no-code data integration and transformation.

Datasets Overview

  • Trip Data: Below is the detailed trip record table, including pickup and dropoff locations, timings, distances, fares, and ratings. You can download it here.

  • Driver & Vehicle Information: The table below gives insights into drivers and their vehicles, covering aspects like age, vehicle type, and registration details. You can download it here.

  • Location Base Fare Details: Lastly, here is the table with the base fare prices and surge multipliers based on pickup and dropoff locations. You can download it here.

Your Mission as a Data Engineer

Imagine yourself as a Data Engineer, responsible for merging these datasets to prepare them for thorough analysis—all without getting into complex coding. Here's how you can achieve this:

Data Staging with Amazon S3

Amazon S3 offers a versatile and secure object storage solution that scales effortlessly, ensuring your data remains accessible and protected. It's ideal for users across various industries, providing limitless storage options for diverse use cases, such as data lakes, websites, mobile apps, backups and recovery, archiving, and big data analytics. Below, we provide a practical example to help you get started with organizing your data in S3:

Step-by-Step Guide

1. Log in to your AWS account.

2. Navigate to Amazon S3 Console.

3. Create a new bucket.

? Click on Create bucket.

? Provide a unique name for your bucket (e.g., ride-hailing-data-<number or your name>).

? Choose the AWS Region that is closest to you or your customers.

? Leave the default settings or adjust them according to your security and compliance needs.

? Click on Create bucket.

4. Upload the datasets.

? Navigate into your newly created bucket.

? Create a folder for the raw data. Name it “raw data.”

?? Enter that folder and click on Upload.

?? Click on Add files to select the datasets you downloaded as CSV files a while ago.

?? Click on Upload to start the process.

? Return to the bucket main menu and create another folder for the transformed data files later. Name it “transformed data.”

Well done! You’ve successfully uploaded your datasets to S3. The next step is to create a Glue Crawler to populate the AWS Glue Data Catalog, setting the stage for further data analysis, including transformation and enrichment.


Data Transformation with AWS Glue

AWS Glue is a fully managed ETL (Extract, Transform, and Load) service that streamlines the process of categorizing, cleaning, enriching, and transferring data across different data stores. It allows you to efficiently prepare and load data for analytics with minimal effort and reduced costs.

AWS Glue Studio is a visual interface designed to simplify the creation, execution, and monitoring of ETL (Extract, Transform, and Load) jobs within AWS Glue. This tool enables you to visually design data transformation workflows, effortlessly run them on AWS Glue's serverless ETL engine based on Apache Spark, and review the schema and data results at each step of the process.

The AWS Glue Data Catalog is a centralized metadata repository that automatically discovers and catalogs data across your environment.

Step-by-Step Guide

CREATE A GLUE CRAWLER

1. ?Navigate to AWS Glue from the AWS Management Console.

2. Add a crawler.

? In the AWS Glue Console, click the dropdown Data Catalog, then Crawlers in the left sidebar, then Create Crawler.

3. Specify crawler details

? For Crawler name, enter a name like ride-hailing-data-crawler.

? Click Next.

4. Choose a data store.

? Select S3 as the data store.

? Choose Specified path in my account, entering the path to your S3 bucket where the datasets are uploaded.

? Click Next.

5. Choose an IAM role.

? Select Create an IAM role, providing a name such as AWSGlueServiceRole-RideHailingData.

?? Click View. You will be redirected to a new tab.

?? Locate the policy with the "Customer Managed" type. Click the + to expand its details, then click Edit. You will be redirected to a new tab once more.

?? Under the Edit statement, find the Add resource.?Click Add.

?? Click the dropdown Resource type.?Select object.

?? In the Resource ARN.?Type arn:aws:s3:::ride-hailing-data-<number or your name>/transformed data*

?? Click Add resource.?Click Next and Click Save changes.

? Go back to the previous tab (AWS Glue Tab), and click Next.

6. Configure the crawler’s output.

? For Database, select Add database, and name it (e.g., ride_hailing_service_data).

? Click Create database to create the crawler.

? Return to the previous tab, click refresh, and select the ride_hailing_service_data as the database.

? Leave the Schedule as On demand and click Next.

7. Review the detail summary, then click Create crawler.

8. Run the crawler by selecting it and clicking Run crawler. The crawler will classify your data and create tables in the Glue Data Catalog.

9. Review the metadata in the Glue Data Catalog. You should see the 3 tables in the Data Catalog.


PREPARE YOUR AWS GLUE ENVIRONMENT

An AWS Glue Visual ETL Job is an ETL (Extract, Transform, and Load) job that is created and managed using the AWS Glue Studio's visual interface, enabling users to design and execute ETL processes without writing any code.

The key components of a visual ETL job in AWS Glue Studio are:

1. Sources refer to the input data that will be extracted for processing. Common sources include data stored in Amazon S3, Amazon DynamoDB, Amazon Redshift, and other data storage services.

2. Transforms refer to the data transformation operations applied to the source data to prepare it for loading into the target destination. Examples of transforms include:

  • Join – Joining involves connecting related data from different sources. For example, to determine the total purchase cost of a specific item, you can aggregate the purchase values from multiple vendors and store the combined total in the destination system.
  • Drop fields – This refers to the process of removing or "dropping" specific fields or columns from the data during transformation. This is typically done to eliminate unused or unnecessary fields, helping to streamline and clean up the data.
  • Change schema – Changing the schema involves modifying the structure of the data, such as renaming or reordering fields, adding or removing fields, or altering the data types of fields. This ensures that the transformed data aligns with the schema requirements of the target system.
  • Filtering – Filtering involves extracting only the relevant data from the source by applying specific rules or conditions. For example, you might filter customer records to include only those where the country is "US."

3. Targets – Targets refer to the destinations where the transformed data will be loaded or written after the transformation process. Common targets include Amazon S3, Amazon Redshift, and other data storage services.

Some key things you can do with targets in AWS Glue include:

  • Changing the file format – You can save the transformed data in various file formats such as Parquet, JSON, ORC, and others, depending on your needs. Parquet and JSON are among the most commonly used formats.
  • Configuring compression – You can compress the target files for efficient storage using compression codecs such as GZIP and BZIP2. This can be configured within the S3 connection properties.


CREATE AN AWS GLUE ETL JOB

1. Navigate to AWS Glue and access the Visual ETL section.

2. Add a new job by clicking Visual ETL.

3. Configure the job properties, selecting a name (e.g., ride-hailing-data-transformation) and the IAM role you created?(e.g., AWSGlueServiceRoleRideHailingData)

4. Choose a data source from your Glue Data Catalog, such as Amazon S3. In the Data Source properties, follow these steps:

  • S3 source type: Data Catalog table
  • Database: ride_hailing_service_data
  • Table: Trip Data CSV file
  • Review the data in the Data Preview.
  • Add two more data sources with tables for Location Base Fare Details CSV file and Driver & Vehicle Information CSV file.

5. Select a transform type by clicking the + sign (Add nodes) in the upper left corner. Our objective is to merge and prepare these datasets for analysis. While there are many transform types available, using JOIN, DROP FIELDS, and CHANGE SCHEMA will be sufficient to achieve our goal. Feel free to explore other transform types and add them as needed.

6. Map the source columns to target columns, specifying transformations.?

7. Choose a data target, select Amazon S3, and choose the format for your output data. (e.g. CSV,? Parquet, etc.)

8. Review and save your job configuration.


EXPECTED OUTPUT

  • The image below is an example of the final ETL Visual Job after following the above step-by-step guides.

  • The table below is an example of a Change Schema Transform type result after Joining, Dropping Fields(Columns), and renaming Fields:


RUN THE ETL JOB

Navigate back to your job in the AWS Glue console, and select Run job to start the transformation process.

Your AWS Glue ETL job will now process the ride-hailing data according to your specifications, transforming and loading the cleaned and structured data back into S3, ready for analysis and enrichment.

EXPECTED OUTPUT

  • The image below is an example of data that is sent to your S3 Bucket folder in parquet format. Remember that you can change the file format in the TARGET of your ETL Job.

By following these steps, you've successfully staged and transformed your ride-hailing service data, making it ready for in-depth analysis. This process is essential for extracting valuable insights that can guide strategic decisions and enhance service quality.

CONCLUSION

This guide provided an introduction to the fundamentals of data engineering, emphasizing data staging using Amazon S3 and simplifying data transformation with AWS Glue Studio, all without the need for coding. By adopting these methods, you've established a solid foundation for your data engineering tasks, setting the stage for more advanced data analysis and visualization endeavors.

Stay tuned for more guides that will further enhance your knowledge and skills in AWS Data Engineering. Remember, mastering AWS Data Engineering is a marathon, not a sprint. Continue practicing, exploring AWS documentation, and leveraging resources like Tutorials Dojo's practice exams to build a comprehensive understanding and prepare for the AWS Data Engineering Associate exam.


* This newsletter was sourced from this Tutorials Dojo article.

Noel L.

Software Engineer | Data Engineer | AWS Certified

2 个月

Mga bayaning naghimagsik laban sa pagtaas ng presyo ng gasolina!

回复
Ryan Moeller ??

Chief Financial Officer (CFO), Strategic Business Partner @Amazon (AWS) | Specialize in Driving Exponential Growth for $100M+ Companies

2 个月

True

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了