Building Data Pipelines with No-Code ETL Using AWS Glue Studio
This article was written by Neil Rico. Neil, fueled by his passion for technology, specializes in designing and optimizing cloud solutions, particularly in the rapidly evolving landscape of Amazon Web Services (AWS). He is committed to continuous learning, viewing life as a journey of discovery and personal growth.
Welcome to the exciting realm of AWS Data Engineering! This beginner-friendly guide offers an introduction to the fundamentals of data staging and transformation within the AWS ecosystem, all without requiring any coding skills. By focusing on the basics of Amazon S3 and AWS Glue, this guide provides a hands-on starting point for understanding how data is managed and processed on AWS. Whether you're preparing for certification or seeking to apply these skills in real-world situations, this guide lays the foundation for your journey in data engineering.
Preparation: Navigating Through Sample Datasets
In this article, we’ll explore three key datasets related to a fictional ride-hailing service. These datasets include essential information such as trip details, driver and vehicle specifics, and location-based fare data. Review the tables below and then download them in CSV format using the provided link. This format will allow for seamless uploading and management of the data within Amazon S3, preparing us for no-code data integration and transformation.
Datasets Overview
Your Mission as a Data Engineer
Imagine yourself as a Data Engineer, responsible for merging these datasets to prepare them for thorough analysis—all without getting into complex coding. Here's how you can achieve this:
Data Staging with Amazon S3
Amazon S3 offers a versatile and secure object storage solution that scales effortlessly, ensuring your data remains accessible and protected. It's ideal for users across various industries, providing limitless storage options for diverse use cases, such as data lakes, websites, mobile apps, backups and recovery, archiving, and big data analytics. Below, we provide a practical example to help you get started with organizing your data in S3:
Step-by-Step Guide
1. Log in to your AWS account.
2. Navigate to Amazon S3 Console.
3. Create a new bucket.
? Click on Create bucket.
? Provide a unique name for your bucket (e.g., ride-hailing-data-<number or your name>).
? Choose the AWS Region that is closest to you or your customers.
? Leave the default settings or adjust them according to your security and compliance needs.
? Click on Create bucket.
4. Upload the datasets.
? Navigate into your newly created bucket.
? Create a folder for the raw data. Name it “raw data.”
?? Enter that folder and click on Upload.
?? Click on Add files to select the datasets you downloaded as CSV files a while ago.
?? Click on Upload to start the process.
? Return to the bucket main menu and create another folder for the transformed data files later. Name it “transformed data.”
Well done! You’ve successfully uploaded your datasets to S3. The next step is to create a Glue Crawler to populate the AWS Glue Data Catalog, setting the stage for further data analysis, including transformation and enrichment.
Data Transformation with AWS Glue
AWS Glue is a fully managed ETL (Extract, Transform, and Load) service that streamlines the process of categorizing, cleaning, enriching, and transferring data across different data stores. It allows you to efficiently prepare and load data for analytics with minimal effort and reduced costs.
AWS Glue Studio is a visual interface designed to simplify the creation, execution, and monitoring of ETL (Extract, Transform, and Load) jobs within AWS Glue. This tool enables you to visually design data transformation workflows, effortlessly run them on AWS Glue's serverless ETL engine based on Apache Spark, and review the schema and data results at each step of the process.
The AWS Glue Data Catalog is a centralized metadata repository that automatically discovers and catalogs data across your environment.
Step-by-Step Guide
CREATE A GLUE CRAWLER
1. ?Navigate to AWS Glue from the AWS Management Console.
2. Add a crawler.
? In the AWS Glue Console, click the dropdown Data Catalog, then Crawlers in the left sidebar, then Create Crawler.
3. Specify crawler details
? For Crawler name, enter a name like ride-hailing-data-crawler.
? Click Next.
4. Choose a data store.
? Select S3 as the data store.
? Choose Specified path in my account, entering the path to your S3 bucket where the datasets are uploaded.
? Click Next.
5. Choose an IAM role.
? Select Create an IAM role, providing a name such as AWSGlueServiceRole-RideHailingData.
?? Click View. You will be redirected to a new tab.
?? Locate the policy with the "Customer Managed" type. Click the + to expand its details, then click Edit. You will be redirected to a new tab once more.
领英推荐
?? Under the Edit statement, find the Add resource.?Click Add.
?? Click the dropdown Resource type.?Select object.
?? In the Resource ARN.?Type arn:aws:s3:::ride-hailing-data-<number or your name>/transformed data*
?? Click Add resource.?Click Next and Click Save changes.
? Go back to the previous tab (AWS Glue Tab), and click Next.
6. Configure the crawler’s output.
? For Database, select Add database, and name it (e.g., ride_hailing_service_data).
? Click Create database to create the crawler.
? Return to the previous tab, click refresh, and select the ride_hailing_service_data as the database.
? Leave the Schedule as On demand and click Next.
7. Review the detail summary, then click Create crawler.
8. Run the crawler by selecting it and clicking Run crawler. The crawler will classify your data and create tables in the Glue Data Catalog.
9. Review the metadata in the Glue Data Catalog. You should see the 3 tables in the Data Catalog.
PREPARE YOUR AWS GLUE ENVIRONMENT
An AWS Glue Visual ETL Job is an ETL (Extract, Transform, and Load) job that is created and managed using the AWS Glue Studio's visual interface, enabling users to design and execute ETL processes without writing any code.
The key components of a visual ETL job in AWS Glue Studio are:
1. Sources refer to the input data that will be extracted for processing. Common sources include data stored in Amazon S3, Amazon DynamoDB, Amazon Redshift, and other data storage services.
2. Transforms refer to the data transformation operations applied to the source data to prepare it for loading into the target destination. Examples of transforms include:
3. Targets – Targets refer to the destinations where the transformed data will be loaded or written after the transformation process. Common targets include Amazon S3, Amazon Redshift, and other data storage services.
Some key things you can do with targets in AWS Glue include:
CREATE AN AWS GLUE ETL JOB
1. Navigate to AWS Glue and access the Visual ETL section.
2. Add a new job by clicking Visual ETL.
3. Configure the job properties, selecting a name (e.g., ride-hailing-data-transformation) and the IAM role you created?(e.g., AWSGlueServiceRoleRideHailingData)
4. Choose a data source from your Glue Data Catalog, such as Amazon S3. In the Data Source properties, follow these steps:
5. Select a transform type by clicking the + sign (Add nodes) in the upper left corner. Our objective is to merge and prepare these datasets for analysis. While there are many transform types available, using JOIN, DROP FIELDS, and CHANGE SCHEMA will be sufficient to achieve our goal. Feel free to explore other transform types and add them as needed.
6. Map the source columns to target columns, specifying transformations.?
7. Choose a data target, select Amazon S3, and choose the format for your output data. (e.g. CSV,? Parquet, etc.)
8. Review and save your job configuration.
EXPECTED OUTPUT
RUN THE ETL JOB
Navigate back to your job in the AWS Glue console, and select Run job to start the transformation process.
Your AWS Glue ETL job will now process the ride-hailing data according to your specifications, transforming and loading the cleaned and structured data back into S3, ready for analysis and enrichment.
EXPECTED OUTPUT
By following these steps, you've successfully staged and transformed your ride-hailing service data, making it ready for in-depth analysis. This process is essential for extracting valuable insights that can guide strategic decisions and enhance service quality.
CONCLUSION
This guide provided an introduction to the fundamentals of data engineering, emphasizing data staging using Amazon S3 and simplifying data transformation with AWS Glue Studio, all without the need for coding. By adopting these methods, you've established a solid foundation for your data engineering tasks, setting the stage for more advanced data analysis and visualization endeavors.
Stay tuned for more guides that will further enhance your knowledge and skills in AWS Data Engineering. Remember, mastering AWS Data Engineering is a marathon, not a sprint. Continue practicing, exploring AWS documentation, and leveraging resources like Tutorials Dojo's practice exams to build a comprehensive understanding and prepare for the AWS Data Engineering Associate exam.
* This newsletter was sourced from this Tutorials Dojo article.
Software Engineer | Data Engineer | AWS Certified
2 个月Mga bayaning naghimagsik laban sa pagtaas ng presyo ng gasolina!
Chief Financial Officer (CFO), Strategic Business Partner @Amazon (AWS) | Specialize in Driving Exponential Growth for $100M+ Companies
2 个月True