AWS SageMaker based Feature Engineering in Jupyter Notebook
NARAYANAN PALANI ??????
Platform Engineering Lead | AWS & Google Cloud Certified Architect | Cloud Solutions Expert | Driving Innovation in Retail, Commercial & Investment Banking | CI/CD | DevOps | Cloud Transformation
Have been exploring data streamline basics with the help of Cloud AI tools recently and got to get few experiments with AWS SageMaker hence sharing those interesting steps in this newsletter.
Introduction
Data Encoding, Scaling and Binning are three critical steps helpful in cleansing the data for AI and Machine Learning Usage within Deep Learning space. Selected a sample data from Pluralsight/ACloudGuru on employee list (for training and learning purpose) and it has list of dummy data with dummy name and details of employee records.
What It Means to Data Preprocessing?
Jupyter Notebook based existing data has been helpful in analysing complexity to it and it helped pre-processing on selective rows. Lets take a look.
Navigate to AWS Sage Maker AI and Jupyter Notebook to use any csv based data files for this preprocessing experiment.
In this sample data, I have navigated to Jupyter Notebook (after the status turned 'In Service') and imported the libraries with a python code:
import numpy as np
import pandas as pd
#Required for encoding purposes
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
#Required for scaling purposes
from sklearn.preprocessing import MinMaxScaler
#Required for binning purposes
from sklearn.preprocessing import KBinsDiscretizer
# Required for plaotting charts
import matplotlib.pyplot as plt
How it works?
After I selected a section of code and clicked in Run button a few times, In [ ] section updated with number of execution:
Now read the data from csv file with display of top of the data (head) to display in the notebook:
Then it helped displayed top five rows of data which is not in the order I am interested to see:
Now performing a FIT OPERATION with ordinal encoding to see the array of titles from this data:
Then it helped listing the title which is not in the order I would like to see:
What is Ordinal Encoder in Encoding of AWS SageMaker?
In AWS SageMaker, the Ordinal Encoder is a feature of the Data Wrangler tool that is used to preprocess categorical data by encoding the categories as ordinal numbers (integer values). Ordinal encoding assigns a unique integer to each category in a specific column. This encoding is useful when working with machine learning algorithms that require numeric input or when the categorical data has an inherent order or rank.
Key Features of Ordinal Encoder in SageMaker
When to Use Ordinal Encoder
Now I want to perform One Hot Enncoding to the data I have at the moment!
What is OneHotEncoder in Encoding of AWS SageMaker?
The OneHotEncoder in AWS SageMaker is a feature within the Data Wrangler tool that transforms categorical data into a one-hot encoded format. It creates binary (0 or 1) columns for each category in a given categorical column, making it suitable for machine learning models that require numerical inputs and cannot process categorical data directly.
Key Features of OneHotEncoder in SageMaker
When to Use OneHotEncoder
Example Use Case
Suppose you have a dataset with a Color column containing values: ["Red", "Green", "Blue"]. Using the OneHotEncoder in SageMaker:
How to Use OneHotEncoder in SageMaker Data Wrangler
Benefits of OneHotEncoder in SageMaker
Then it has helped listing the gender with gender_encoder value (newly created column) on the output.
Now I need to list the employee records hence executing the code to display the head of the table:
Then it produced the list but every row got two columns created in this method:
Now let us explore how to do Label Encoding in this same dataset!
What is LabelEncoder in Encoding of AWS SageMaker?
The LabelEncoder in AWS SageMaker is a feature in the Data Wrangler tool that converts categorical labels into numerical values. It is primarily used to encode target variables (labels) in supervised machine learning tasks, where the labels need to be numerical for model training.
Then the code helped labelling the title in order:
Now let us use MinMax Scaler!
What is MinMaxScaler in Scaling of AWS SageMaker?
The MinMaxScaler in AWS SageMaker is a scaling technique used to normalize numerical data to a specified range, typically between 0 and 1. This scaler is available as part of the preprocessing tools in SageMaker Data Wrangler. By scaling the data, it ensures that features have a uniform scale, which can improve the performance and convergence speed of many machine learning algorithms.
Then it helped producing the data between 0 and 1:
Now the date displayed in the scale of 100:
Then I want to use KBinsDiscretizer for converting continuous data to categorical data!
What is KBinsDiscretizer in Scaling of AWS SageMaker?
The KBinsDiscretizer in AWS SageMaker is a preprocessing tool in Data Wrangler that transforms continuous numerical features into discrete bins or intervals. It divides the range of a numerical column into a specified number of bins and assigns each value to a corresponding bin label, effectively converting continuous data into categorical data. This is useful for simplifying models, handling non-linear relationships, or when certain machine learning algorithms perform better with discretized inputs. SageMaker supports multiple binning strategies such as uniform (equal-width bins), quantile (equal-sized bins based on data distribution), or k-means (bins formed using k-means clustering).
Now
Then I am using plt to explore this data in visualisation below:
What is plt in Plaotting of AWS SageMaker?
In AWS SageMaker, plt typically refers to Matplotlib, a popular Python library for creating static, interactive, and dynamic visualizations. It is widely used in SageMaker Jupyter notebooks to plot data during exploratory data analysis (EDA), model evaluation, or result visualization. By importing Matplotlib's pyplot module as plt (e.g., import matplotlib.pyplot as plt), users can create various plots such as line charts, bar graphs, scatter plots, and histograms to understand datasets, analyze model performance, or visualize predictions.
Watch the Youtube video for the steps at Link
Summary
Following are the high level steps in performing data streamline using AWS SageMaker in this example:
??Please feel free to share your views in the comments section on any simplified steps in AWS Sage Maker based ML usage.
?Follow me on LinkedIn: Link
Like this article? Subscribe to Engineering Leadership , Digital Accessibility, Digital Payments Hub and Motivation newsletters to enjoy reading useful articles. Press SHARE and REPOST button to help sharing the content with your network.