登录查看更多内容

AWS SageMaker based Feature Engineering in Jupyter Notebook

NARAYANAN PALANI ??????

Platform Engineering Lead | AWS & Google Cloud Certified Architect | Cloud Solutions Expert | Driving Innovation in Retail, Commercial & Investment Banking | CI/CD | DevOps | Cloud Transformation

发布日期: 2025年3月18日

Have been exploring data streamline basics with the help of Cloud AI tools recently and got to get few experiments with AWS SageMaker hence sharing those interesting steps in this newsletter.

Introduction

Data Encoding, Scaling and Binning are three critical steps helpful in cleansing the data for AI and Machine Learning Usage within Deep Learning space. Selected a sample data from Pluralsight/ACloudGuru on employee list (for training and learning purpose) and it has list of dummy data with dummy name and details of employee records.

What It Means to Data Preprocessing?

Jupyter Notebook based existing data has been helpful in analysing complexity to it and it helped pre-processing on selective rows. Lets take a look.

Navigate to AWS Sage Maker AI and Jupyter Notebook to use any csv based data files for this preprocessing experiment.

In this sample data, I have navigated to Jupyter Notebook (after the status turned 'In Service') and imported the libraries with a python code:

import numpy as np
import pandas as pd
#Required for encoding purposes
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

#Required for scaling purposes
from sklearn.preprocessing import MinMaxScaler

#Required for binning purposes
from sklearn.preprocessing import KBinsDiscretizer

# Required for plaotting charts
import matplotlib.pyplot as plt

How it works?

After I selected a section of code and clicked in Run button a few times, In [ ] section updated with number of execution:

Now read the data from csv file with display of top of the data (head) to display in the notebook:

Then it helped displayed top five rows of data which is not in the order I am interested to see:

Now performing a FIT OPERATION with ordinal encoding to see the array of titles from this data:

Then it helped listing the title which is not in the order I would like to see:

What is Ordinal Encoder in Encoding of AWS SageMaker?

In AWS SageMaker, the Ordinal Encoder is a feature of the Data Wrangler tool that is used to preprocess categorical data by encoding the categories as ordinal numbers (integer values). Ordinal encoding assigns a unique integer to each category in a specific column. This encoding is useful when working with machine learning algorithms that require numeric input or when the categorical data has an inherent order or rank.

Key Features of Ordinal Encoder in SageMaker

Assigns Unique Integers: Each unique category in the column is assigned a distinct integer value.
Custom Ordering: If the categories have a specific order (e.g., "Low" < "Medium" < "High"), the encoder respects this order if provided by the user.
Automatic Handling: If no explicit order is specified, the encoder assigns integers based on the lexicographical order of the categories.
Handles Missing Values: Options to handle missing or unseen values during the transformation, ensuring robustness.
Integration with Data Wrangler: Ordinal Encoder is available in SageMaker Data Wrangler, which is a visual interface for data preparation and transformation.

When to Use Ordinal Encoder

Ordered Categories: Ideal when the categorical data has a meaningful order (e.g., "Very Poor" < "Poor" < "Good" < "Very Good").
Model Requirements: When using algorithms that require numerical input, such as linear regression or decision trees.

Now I want to perform One Hot Enncoding to the data I have at the moment!

What is OneHotEncoder in Encoding of AWS SageMaker?

The OneHotEncoder in AWS SageMaker is a feature within the Data Wrangler tool that transforms categorical data into a one-hot encoded format. It creates binary (0 or 1) columns for each category in a given categorical column, making it suitable for machine learning models that require numerical inputs and cannot process categorical data directly.

Key Features of OneHotEncoder in SageMaker

Binary Representation:
Automatic Handling:
Sparse Representation (optional):
Handles Missing Values:
Integration with Data Wrangler:

When to Use OneHotEncoder

No Inherent Order: Ideal when categorical variables have no meaningful order (e.g., "Red", "Blue", "Green").
Compatibility: When using machine learning models that require numeric inputs but do not support ordinal relationships between categories (e.g., Logistic Regression, Neural Networks).
Multiple Categories: Works well for columns with multiple distinct categories, where each category needs its binary representation.

Example Use Case

Suppose you have a dataset with a Color column containing values: ["Red", "Green", "Blue"]. Using the OneHotEncoder in SageMaker:

The Color column will be transformed into three new columns: Color_Red, Color_Green, Color_Blue.
For a row with the value Green, the encoding would look like: Color_Red Color_Green Color_Blue 0 1 0

How to Use OneHotEncoder in SageMaker Data Wrangler

Open your dataset in SageMaker Data Wrangler.
Add a Transform step.
Select the OneHotEncoder transformation.
Choose the column(s) to encode.
Apply the transformation to preview the output.

Benefits of OneHotEncoder in SageMaker

Ease of Use: No need to write custom code; encoding is done with a few clicks.
Visual Workflow: Integrated into SageMaker Data Wrangler, enabling seamless end-to-end data preparation.
Model Compatibility: Ensures categorical data is compatible with algorithms requiring numeric features.

Then it has helped listing the gender with gender_encoder value (newly created column) on the output.

Now I need to list the employee records hence executing the code to display the head of the table:

Then it produced the list but every row got two columns created in this method:

Now let us explore how to do Label Encoding in this same dataset!

What is LabelEncoder in Encoding of AWS SageMaker?

The LabelEncoder in AWS SageMaker is a feature in the Data Wrangler tool that converts categorical labels into numerical values. It is primarily used to encode target variables (labels) in supervised machine learning tasks, where the labels need to be numerical for model training.

Then the code helped labelling the title in order:

Now let us use MinMax Scaler!

What is MinMaxScaler in Scaling of AWS SageMaker?

The MinMaxScaler in AWS SageMaker is a scaling technique used to normalize numerical data to a specified range, typically between 0 and 1. This scaler is available as part of the preprocessing tools in SageMaker Data Wrangler. By scaling the data, it ensures that features have a uniform scale, which can improve the performance and convergence speed of many machine learning algorithms.

Then it helped producing the data between 0 and 1:

Now the date displayed in the scale of 100:

Then I want to use KBinsDiscretizer for converting continuous data to categorical data!

What is KBinsDiscretizer in Scaling of AWS SageMaker?

The KBinsDiscretizer in AWS SageMaker is a preprocessing tool in Data Wrangler that transforms continuous numerical features into discrete bins or intervals. It divides the range of a numerical column into a specified number of bins and assigns each value to a corresponding bin label, effectively converting continuous data into categorical data. This is useful for simplifying models, handling non-linear relationships, or when certain machine learning algorithms perform better with discretized inputs. SageMaker supports multiple binning strategies such as uniform (equal-width bins), quantile (equal-sized bins based on data distribution), or k-means (bins formed using k-means clustering).

Now

Then I am using plt to explore this data in visualisation below:

What is plt in Plaotting of AWS SageMaker?

In AWS SageMaker, plt typically refers to Matplotlib, a popular Python library for creating static, interactive, and dynamic visualizations. It is widely used in SageMaker Jupyter notebooks to plot data during exploratory data analysis (EDA), model evaluation, or result visualization. By importing Matplotlib's pyplot module as plt (e.g., import matplotlib.pyplot as plt), users can create various plots such as line charts, bar graphs, scatter plots, and histograms to understand datasets, analyze model performance, or visualize predictions.

Watch the Youtube video for the steps at Link

Summary

Following are the high level steps in performing data streamline using AWS SageMaker in this example:

Launch a SageMaker Notebook
Load and Explore Data
Data Cleaning
Feature Engineering
Scaling and Transformation
Save and Export Preprocessed Data

??Please feel free to share your views in the comments section on any simplified steps in AWS Sage Maker based ML usage.

?Follow me on LinkedIn: Link

Like this article? Subscribe to Engineering Leadership , Digital Accessibility, Digital Payments Hub and Motivation newsletters to enjoy reading useful articles. Press SHARE and REPOST button to help sharing the content with your network.

#LinkedInNewsUK #FinanceLeadership

Engineering Leadership

3,010 位关注者

要查看或添加评论，请登录

NARAYANAN PALANI ???????的更多文章

Optimise AWS Lambda Performance using CloudWatch and Xray

2025年3月11日

Optimise AWS Lambda Performance using CloudWatch and Xray

My specialities in previous years includes optimisation of page load during the launch of websites-this is reduce…
Mastering Skills in Azure: Defender for Cloud

2025年3月10日

Mastering Skills in Azure: Defender for Cloud

Recently explored prevention of security vulnerabilities with the help of MS Azure Cloud to find ways to identify…
Automate Compliance in AWS EC2 Instances with EventBridge, Lambda, Config and SNS Topics

2025年3月4日

Automate Compliance in AWS EC2 Instances with EventBridge, Lambda, Config and SNS Topics

Have been thinking to try what would be an automated approach to automatically protecting the application servers if…
Cypress.io vs Playwright vs WebdriverIO: The Ultimate Test Automation Showdown

2025年3月3日

Cypress.io vs Playwright vs WebdriverIO: The Ultimate Test Automation Showdown

Test automation frameworks have transformed how we validate web applications, ensuring high-quality releases with…
Building Terraform based Multi-Tier Architecture from CLI

2025年2月25日

Building Terraform based Multi-Tier Architecture from CLI

Gone are days when we deploy and reboot services-all that manual and time consuming. Infrastructure as Code (IaC) has…
Mastering TEDx: 69 Essential Secrets for a Captivating Talk

2025年2月24日

Mastering TEDx: 69 Essential Secrets for a Captivating Talk

Privileged to join Tedx Masterclass session from Believe In Greatness and enjoyed listening to great talks to build…

6 条评论
Testing Automated ML based Predictions in Azure Cloud

2025年2月21日

Testing Automated ML based Predictions in Azure Cloud

Just found some pretty easy steps in automating an ML model and get it predicted the rates of a taxi fare in recent…
AWS S3 Origin based CloudFront Distribution

2025年2月18日

AWS S3 Origin based CloudFront Distribution

Interested to start by creating an S3 bucket, which acts as the storage for website's content. Then configure it to…
Connect AWS Glue with MySQL for Data Testing via ETL Processing

2025年2月13日

Connect AWS Glue with MySQL for Data Testing via ETL Processing

Have been exploring real time use case scenario of verifying millions of records from MySQL to AWS Glue and following…
Introducing AWS IAM Role and Policy Restrictions

2025年2月11日

Introducing AWS IAM Role and Policy Restrictions

Explored options to keep restricted permissions for 'S3 bucket' through an IAM Policy recently. Let us take a look at…

See all articles

Introduction

What It Means to Data Preprocessing?

How it works?

What is Ordinal Encoder in Encoding of AWS SageMaker?

Key Features of Ordinal Encoder in SageMaker

When to Use Ordinal Encoder

What is OneHotEncoder in Encoding of AWS SageMaker?

Key Features of OneHotEncoder in SageMaker

When to Use OneHotEncoder

Example Use Case

How to Use OneHotEncoder in SageMaker Data Wrangler

Benefits of OneHotEncoder in SageMaker

What is LabelEncoder in Encoding of AWS SageMaker?

What is MinMaxScaler in Scaling of AWS SageMaker?

What is KBinsDiscretizer in Scaling of AWS SageMaker?

What is plt in Plaotting of AWS SageMaker?

Summary

Engineering Leadership

3,010 位关注者

NARAYANAN PALANI ???????的更多文章

Optimise AWS Lambda Performance using CloudWatch and Xray

Mastering Skills in Azure: Defender for Cloud

Automate Compliance in AWS EC2 Instances with EventBridge, Lambda, Config and SNS Topics

Cypress.io vs Playwright vs WebdriverIO: The Ultimate Test Automation Showdown

Building Terraform based Multi-Tier Architecture from CLI

Mastering TEDx: 69 Essential Secrets for a Captivating Talk

Testing Automated ML based Predictions in Azure Cloud

AWS S3 Origin based CloudFront Distribution

Connect AWS Glue with MySQL for Data Testing via ETL Processing

Introducing AWS IAM Role and Policy Restrictions

社区洞察