AWS SageMaker based Feature Engineering in Jupyter Notebook

AWS SageMaker based Feature Engineering in Jupyter Notebook

Have been exploring data streamline basics with the help of Cloud AI tools recently and got to get few experiments with AWS SageMaker hence sharing those interesting steps in this newsletter.

Introduction

Data Encoding, Scaling and Binning are three critical steps helpful in cleansing the data for AI and Machine Learning Usage within Deep Learning space. Selected a sample data from Pluralsight/ACloudGuru on employee list (for training and learning purpose) and it has list of dummy data with dummy name and details of employee records.

What It Means to Data Preprocessing?

Jupyter Notebook based existing data has been helpful in analysing complexity to it and it helped pre-processing on selective rows. Lets take a look.

Navigate to AWS Sage Maker AI and Jupyter Notebook to use any csv based data files for this preprocessing experiment.


In this sample data, I have navigated to Jupyter Notebook (after the status turned 'In Service') and imported the libraries with a python code:

import numpy as np
import pandas as pd
#Required for encoding purposes
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

#Required for scaling purposes
from sklearn.preprocessing import MinMaxScaler

#Required for binning purposes
from sklearn.preprocessing import KBinsDiscretizer

# Required for plaotting charts
import matplotlib.pyplot as plt        

How it works?

After I selected a section of code and clicked in Run button a few times, In [ ] section updated with number of execution:

Now read the data from csv file with display of top of the data (head) to display in the notebook:


Then it helped displayed top five rows of data which is not in the order I am interested to see:


Now performing a FIT OPERATION with ordinal encoding to see the array of titles from this data:


Then it helped listing the title which is not in the order I would like to see:


What is Ordinal Encoder in Encoding of AWS SageMaker?

In AWS SageMaker, the Ordinal Encoder is a feature of the Data Wrangler tool that is used to preprocess categorical data by encoding the categories as ordinal numbers (integer values). Ordinal encoding assigns a unique integer to each category in a specific column. This encoding is useful when working with machine learning algorithms that require numeric input or when the categorical data has an inherent order or rank.

Key Features of Ordinal Encoder in SageMaker

  1. Assigns Unique Integers: Each unique category in the column is assigned a distinct integer value.
  2. Custom Ordering: If the categories have a specific order (e.g., "Low" < "Medium" < "High"), the encoder respects this order if provided by the user.
  3. Automatic Handling: If no explicit order is specified, the encoder assigns integers based on the lexicographical order of the categories.
  4. Handles Missing Values: Options to handle missing or unseen values during the transformation, ensuring robustness.
  5. Integration with Data Wrangler: Ordinal Encoder is available in SageMaker Data Wrangler, which is a visual interface for data preparation and transformation.

When to Use Ordinal Encoder

  • Ordered Categories: Ideal when the categorical data has a meaningful order (e.g., "Very Poor" < "Poor" < "Good" < "Very Good").
  • Model Requirements: When using algorithms that require numerical input, such as linear regression or decision trees.

Now I want to perform One Hot Enncoding to the data I have at the moment!

What is OneHotEncoder in Encoding of AWS SageMaker?

The OneHotEncoder in AWS SageMaker is a feature within the Data Wrangler tool that transforms categorical data into a one-hot encoded format. It creates binary (0 or 1) columns for each category in a given categorical column, making it suitable for machine learning models that require numerical inputs and cannot process categorical data directly.

Key Features of OneHotEncoder in SageMaker

  1. Binary Representation:
  2. Automatic Handling:
  3. Sparse Representation (optional):
  4. Handles Missing Values:
  5. Integration with Data Wrangler:


When to Use OneHotEncoder

  • No Inherent Order: Ideal when categorical variables have no meaningful order (e.g., "Red", "Blue", "Green").
  • Compatibility: When using machine learning models that require numeric inputs but do not support ordinal relationships between categories (e.g., Logistic Regression, Neural Networks).
  • Multiple Categories: Works well for columns with multiple distinct categories, where each category needs its binary representation.


Example Use Case

Suppose you have a dataset with a Color column containing values: ["Red", "Green", "Blue"]. Using the OneHotEncoder in SageMaker:

  • The Color column will be transformed into three new columns: Color_Red, Color_Green, Color_Blue.
  • For a row with the value Green, the encoding would look like: Color_Red Color_Green Color_Blue 0 1 0


How to Use OneHotEncoder in SageMaker Data Wrangler

  1. Open your dataset in SageMaker Data Wrangler.
  2. Add a Transform step.
  3. Select the OneHotEncoder transformation.
  4. Choose the column(s) to encode.
  5. Apply the transformation to preview the output.

Benefits of OneHotEncoder in SageMaker

  • Ease of Use: No need to write custom code; encoding is done with a few clicks.
  • Visual Workflow: Integrated into SageMaker Data Wrangler, enabling seamless end-to-end data preparation.
  • Model Compatibility: Ensures categorical data is compatible with algorithms requiring numeric features.


Then it has helped listing the gender with gender_encoder value (newly created column) on the output.


Now I need to list the employee records hence executing the code to display the head of the table:


Then it produced the list but every row got two columns created in this method:


Now let us explore how to do Label Encoding in this same dataset!

What is LabelEncoder in Encoding of AWS SageMaker?

The LabelEncoder in AWS SageMaker is a feature in the Data Wrangler tool that converts categorical labels into numerical values. It is primarily used to encode target variables (labels) in supervised machine learning tasks, where the labels need to be numerical for model training.


Then the code helped labelling the title in order:


Now let us use MinMax Scaler!

What is MinMaxScaler in Scaling of AWS SageMaker?

The MinMaxScaler in AWS SageMaker is a scaling technique used to normalize numerical data to a specified range, typically between 0 and 1. This scaler is available as part of the preprocessing tools in SageMaker Data Wrangler. By scaling the data, it ensures that features have a uniform scale, which can improve the performance and convergence speed of many machine learning algorithms.

Then it helped producing the data between 0 and 1:


Now the date displayed in the scale of 100:


Then I want to use KBinsDiscretizer for converting continuous data to categorical data!


What is KBinsDiscretizer in Scaling of AWS SageMaker?

The KBinsDiscretizer in AWS SageMaker is a preprocessing tool in Data Wrangler that transforms continuous numerical features into discrete bins or intervals. It divides the range of a numerical column into a specified number of bins and assigns each value to a corresponding bin label, effectively converting continuous data into categorical data. This is useful for simplifying models, handling non-linear relationships, or when certain machine learning algorithms perform better with discretized inputs. SageMaker supports multiple binning strategies such as uniform (equal-width bins), quantile (equal-sized bins based on data distribution), or k-means (bins formed using k-means clustering).

Now


Then I am using plt to explore this data in visualisation below:


What is plt in Plaotting of AWS SageMaker?

In AWS SageMaker, plt typically refers to Matplotlib, a popular Python library for creating static, interactive, and dynamic visualizations. It is widely used in SageMaker Jupyter notebooks to plot data during exploratory data analysis (EDA), model evaluation, or result visualization. By importing Matplotlib's pyplot module as plt (e.g., import matplotlib.pyplot as plt), users can create various plots such as line charts, bar graphs, scatter plots, and histograms to understand datasets, analyze model performance, or visualize predictions.

Watch the Youtube video for the steps at Link

Summary

Following are the high level steps in performing data streamline using AWS SageMaker in this example:

  1. Launch a SageMaker Notebook
  2. Load and Explore Data
  3. Data Cleaning
  4. Feature Engineering
  5. Scaling and Transformation
  6. Save and Export Preprocessed Data

??Please feel free to share your views in the comments section on any simplified steps in AWS Sage Maker based ML usage.


?Follow me on LinkedIn: Link

Like this article? Subscribe to Engineering Leadership , Digital Accessibility, Digital Payments Hub and Motivation newsletters to enjoy reading useful articles. Press SHARE and REPOST button to help sharing the content with your network.

#LinkedInNewsUK #FinanceLeadership


要查看或添加评论,请登录

NARAYANAN PALANI ???????的更多文章

社区洞察