ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Feature Engineering â€“ Data Cleansing, Transformation and Selection - my notes

Ajay Taneja

Senior Data Engineer | Generative AI Engineer at Jaguar Land Rover | Ex - Rolls-Royce | Data Engineering, Data Science, Finite Element Methods Development, Stress Analysis, Fatigue and Fracture Mechanics

å‘å¸ƒæ—¥æœŸ: 2022å¹´11æœˆ3æ—¥

1.???Data pre-processing:

All machine learning models require data pre-processing to improve training. The way the data is represented can have a strong influence on how the machine learning model can learn from it. For example, models tend to converge more faster and reliably when numerical data is scaled appropriately. The techniques for selecting and transforming the data are key to the to increase the predictive quality of the models.

The art of feature engineering tries to improve the modelâ€™s ability to learn while reducing the compute resources if required. It does so by transforming and projecting (e. g. dimensionality reduction), eliminating (feature selection methods) or combining the features in the raw data to form a new version of the data set

Important: Feature Engineering should be consistent in training and serving

During training you have the entire data set available to you. So, one can use the global properties of individual features in the feature engineering transformation.

For example, you can compute standard deviation of the features and use that to perform normalization. It should be underscored that when you serve the model you should do the same kind of feature engineering so that you give the model the same kind of data the model was trained on. So, if you have normalized the data and have used standard deviation, then, such global constants should be saved and also made use of during serving. Failing to do that is a very common source of problem in production systems and such errors can be difficult to debug.
Or, if you created a one hot vector for a categorical feature when you trained, you need also to create a one hot vector when you serve the model.

This series/document will comprise of the following topics related to feature engineering:

Section 2 will throw some light on the pre-processing operations that are used for feature engineering
Section 3 will be about Data cleansing and will talk about some of the statistical methods that may be used to detect outliers in the dataset.
Section 4 will point to my Git repository which has the Jupyter notebooks showing some data cleansing exercises using different approaches
Section 5 will talk about feature scaling and then section 6 will comprise notebook ?relating to Feature Scaling.

2.???Pre-processing operations

Let us talk about some of the pre-processing operations that are used for feature engineering

Data cleansing: This involves eliminating or correcting erroneous data

Feature tuning: It is often required to perform transformation on the data like scaling, normalizing the data since machine learning models and neural networks are sensitive to range of numerical features.

Feature extraction: Dimensionality reduction vs feature selection methods: One shouldn't just throw everything at your machine learning model and rely on your training process to determine which features are actually useful. Thus, it is imperative to carry out feature selection and | or dimensionality reduction to reduce the number of features in a dataset. Whilst both â€˜feature selectionâ€™ and â€˜dimensionality reductionâ€™ are used for reducing the number of features in a dataset, there is an important difference:

?Feature selection is simply selecting and excluding given features WITHOUT changing them

?Whereas Dimensionally Reduction transforms the features into a lower dimension

?Feature selection identifies the features that best represent the relationship amongst all in the feature space as well as the target that the model will try to predict. Feature selection methods remove the features that do not influence the outcome. This reduces the size of the feature space, hence reducing the resource requirements for processing the data and model complexity too. I have discussed about Feature Selection and Dimensionality reduction here: https://www.dhirubhai.net/pulse/feature-selection-dimensionality-reduction-ajay-taneja/

Bucketizing and Binning: Sometimes it may be useful to bucket different data ranges into a one hot encoding as shown below. For example, if youâ€™re dealing with a houses data set built over years, one could bucket different years as shown below:

3.???Data Cleansing

As mentioned, data cleansing involves eliminating or correcting erroneous data. Outliers are generally defined as samples that are generally far away from the mainstream of the data. Outliers in the dataset may be caused by measurement or input error, data corruption, etc. Statistical methods may be used to detect outliers in the dataset. Some of these methods are discussed below. However, the point that should be highlighted is that any of the methods discussed should be used carefully. In the end, it comes down to your subject area knowledge and the investigation of the candidate outlier. It is always possible than an unusual value is part of the natural variation of the process rather than a problematic point.

3.1 Percentile method:

In the percentile method, you decide a specific threshold in percentile. For example, anything above 98% percentile and below 2% percentile may be considered as an outlier and then you go ahead trimming or capping these samples from the dataset. Percentile method is arbitrary, and you will have to determine the threshold manually based on domain knowledge.

The Jupyter notebook shown below has a dataset where data above 98th percentile and below 2nd percentile is is removed. The Airbnb dataset is from Kaggle

3.2 Using Z-score to detect outliers

Z-score quantifies the unusualness of an observation when your data follows a normal distribution. Z-scores are the number of standard deviations above and below the mean each value falls. A z-score of 2 indicates that an observation is 2 standard deviations above average whilst a Z-score of -2 indicates it is 2 standard deviations below the average.

Mathematically,

Z-score, is given by:

Percentile vs Z score:

It must be underscored that whilst the percentile uses the median as an average (50th percentile), Z-score uses the mean as an average. Thus, a Z-score of 0 represents a value equal to the mean. The farthest away a Z-score is from mean, the more unusual it is.

3.3 Removal of outliers using the Interquartile Range (IQR)

Unlike the more familiar, mean and standard deviation, the interquartile range and the median are robust measures. IF the dataset is normally distributed, you can use standard deviation to determine the percentage of observation that fall specific distances from the mean. However, that does not work for if the data is not normally distributes/skewed distribution and IQR is an excellent alternative.

What is Interquartile Range?

Interquartile Range (IQR) measures the spread of the middle half of the data. It is the range for the middle 50% of your sample. IQR is used to assess the variability of where most of the values lie. To find the outliers, a multiplier (normally, 1.5) is used to subtract from the 25th percentile â€“ which gives a lower limit of Q1 â€“ 1.5 IQR and to add to the 75th percentile (i.e., an upper limit of Q3 + 1.5 IQR) is used. Any samples beyond the lower and upper limit are classified as â€œoutliersâ€.

The figure below illustrates the math better:

4.???Notebook examples of Data Cleansing

The Jupyter notebooks in my Github link below illustrate the following:

Detecting and eliminating outliers using percentile
Detecting and and eliminating outliers using Z-score
Detecting and eliminating outliers using?IQR

This is the Github link: https://github.com/atanejajlr/linkedin_feature_engineering

5.???Feature Scaling

5.1???Why Feature Scaling?

Real world datasets often contain features that vary in their degree of magnitude and units. Therefore, in order that the machine learning model interprets all features in the same scale we have to perform â€œfeature scalingâ€.

Feature scaling helps algorithms for minimization of cost function such as gradient descent converge faster as well as become mandatory in many cases as discussed in section. To understand feature scaling, let us examine the relationship between feature values and the parameters through an example.

Importance of Feature Scaling:

To understand the importance of feature scaling in the most intuitive sense, think about Principal Component Analysis (PCA)

PCA attempts to locate the principal components by choosing a "u" (the direction along which you project the data points) so that you get the maximum variance. You want the maximum variance because you want to retain the maximum information from your data set.

Suppose your data set comprises of 'height' and 'weight' feature, then, because of the inherent difference in scales in height and weight , PCA might determine the direction of maximal variance corresponding with the â€˜weightâ€™ axis - in case if no feature scaling is done - which is clearly incorrect because a variation of height of 1m is highly significant. Hence feature scaling becomes mandatory here.

As well as in algorithms like K Nearest Neighbours where you deal with Euclidian distances.

5.2???Understanding Feature Scaling through an example

Let us consider the â€œHello Worldâ€ example of machine learning wherein youâ€™re predicting the price of the house â€“ and the associated features being:

Size in square feet â€“ feature x1
Number of bedrooms â€“ feature x2

Here the size in square feet may range from 300 â€“ 2000 square feet and the number of bedrooms may range from 0 â€“ 5. So, the feature x1 takes on relatively large range of values and the feature x2 takes relatively smaller range of values.

Let us say the model associated with the price prediction is given by the following equation and let us say weâ€™re predicting the price for a 5-bedroom house of size 2000 square feet.

One choice of the parameters of the model can be:

This will result in the price evaluation of:

Letâ€™s say that a price of 500K is approximately the right price, thus, the model has evaluated the parameters correctly.

Here, it may be noticed that when the feature value is relatively large (x1 -> 2000 square feet), the corresponding parameter value is small (w1 being 0.1) and when the feature value is relatively small (x2 being 5 bedrooms), the corresponding parameter value is relatively large (w2 being 50).

Let us examine the above visually. Let us get a scatter plot of the features: the size in square feet is on the horizontal axis vs number of bedrooms is on the vertical axis?-for some of the training examples as shown below:

Now, let us see how the cost function might look in the contour plot. It may be recalled that by definition of a â€œcontourâ€ â€“ all points on a particular contour denote same value of cost function.

As it may be noticed from the figure above, the contours form ellipses which are shorter along one side (side corresponding to w1 and longer along the other â€“ the axis corresponding to w2.

In such a case the minimization algorithm such as the gradient descent might take a very long time to converge as it might bounce back and forth before it can find the global minima as emulated in the figure below:

How do we solve the above problem?

In such a case, it might be useful to scale the features i.e., x1 the sq feet and x2 the number of bedrooms so that the scaled/transformed features lie between 0 and 1. And after the transformation, the scatter plot looks as shown in the figure below:

é¢†è‹±æŽ¨è

Technical Deep-Dive: Data-Centric Machine Learning

Technical Deep-Dive: Data-Centricâ€¦

LandingAI 1 å¹´å‰

How to Leverage Computer Vision Data Labeling Through Embeddings

How to Leverage Computer Vision Data Labeling Throughâ€¦

Superb AI Inc. 1 å¹´å‰

Unsupervised Learning: Clustering and Dimensionality Reduction

Unsupervised Learning: Clustering and Dimensionalityâ€¦

AgileWoW 9 ä¸ªæœˆå‰

As it may be noticed, the scaled plot is different from the unscaled/un-transformed plot. This is because the scaled features x1 and x2 now take on compatible range of values to each other and now if you run the minimization algorithm like gradient descent on the scaled features the contour plots of the cost function look like below:

As it may be noticed, the contour plots are no longer tall and skinny, and the gradient descent will find a direct path to the global minimum as shown in the figure below.

Thus, to conclude, it can be stated: if you have different features that take on different range of values it can cause algorithm like gradient descent to converge slowly but rescaling the features so that they comparable values may speed up the minimization algorithm significantly.

5.3??????Possible ways to scale features

Let us now see the possible ways to scale the features, these include:

Divide by maximum: Here we take each feature and divide each sample of the feature by the maximum value so that every value will lie between 0 and 1 (?0 <=x1 <=1)

Thus, considering the same example hosing price data set,

Originally,

Scaling by divide by maximum:

Thus,

Mean normalization: In Mean normalization, one starts with the original features and re-scales them so that they are centred around 0. Normally, the features will lie between -1 and +1. The re-scaled features based on mean normalization will be:

That is: for feature x1 we calculate the mean of x1 for all the training examples and then take the corresponding maximum and minimum value of the feature x1 of all the training examples.

Z- score normalization: Another common feature scaling method is the Z-score normalization. To carry out the Z score normalization we have to calculate the standard deviation of each feature. Z-score normalization for feature x1 is given by:

where,

Î¼1 is the mean of all the training examples corresponding to feature x1 and ?is the standard deviation

5.4????Scikit-Learn libraries for feature scaling

There are 3 different types of scalars in the Scikit-learn library for feature scaling. These include:

Min-Max Scalar
Standard Scalar
Robust Scalar

Min-Max Scalar:

Using the Min-Max scalar, all the features will be transformed into the range: [0, 1] i.e., the minimum and maximum value of the feature will be between 0 and 1. The scaled value of a sample is given by:

Standard Scalar:

Standard Scalar standardizes the features by removing the mean and scaling it to unit variance. The standard score of a sample is calculated as:

Where, Î¼ and Ïƒ are the mean and the standard deviation respectively

The mean and the standard deviation are stored to be used later during model serving. Standard scalar is often used in many machine learning algorithms. However, the algorithm may not work well if the individual feature is not more or less standard normal distributed.

Robust Scalar:

This scalar removes the median and scales the data according to the quantile range â€“ the Interquartile Range (IQR) which lies between the first and the third quantile.

It must be underscored that standardization of the dataset is often carried out â€“ i.e., using the standard scalar as described above, However, outliers can influence the mean/variance in a negative way and in such circumstances, the median and the interquartile range will give better results

Which is the preferred Scalar? Min-Max scalar | Standard Scalar | Robust Scalar?

Min-Max Scalar will transform each value in the column in proportion within the range [0,1].

Standard Scalar will transform each value in the column in the range about the mean and the standard deviation. The method is used when the distribution is Gaussian

If there are outliers in the dataset Robust Scalar is the preferred option. Alternative to that, if one has dealt with the outliers in an exploratory Data Analysis Standard Scalar or Min Max scalar may be resorted to depending on whether the data is normally distributed or not.

6.??Notebook illustrations of feature scaling

The Jupyter notebook in the Github link mentioned below shows some example illustrations of featue ecaling using scikit-learn libraries:

This is the Github link: https://github.com/atanejajlr/linkedin_feature_engineering

7.??Feature Selection Methods: Introduction

Having spoke about: Data Cleansing, various statistical techniques for data cleansing followed by feature scaling and the importance of feature scaling in algorithms like Principal Component Analysis, K-Nearest Neighbours as well as during the minimization of cost function and some of my notebooks demonstrating these techniques for data cleansing and feature scaling, it is time to discuss some of the feature selection methods.

In this article:https://www.dhirubhai.net/pulse/feature-selection-dimensionality-reduction-ajay-taneja/?, I have discussed in sufficient detail about the feature selection methods â€“ hence this post will focus on pointing out some highlights and notebooks demonstrating the application of these methods.

8.??Highlights of the feature selection methods

Feature selection methods may be classified into supervised and unsupervised methods. Whereas, supervised methods consider the correlation between the features and the target variable, the unsupervised methods do not consider the correlation between features and the target variable.

The methods that fall under supervised feature selection include:

Filter methods
Wrapper methods and
Embedded methods

Filter methods:

In Filter methods, we start with all the features and select the best subset that we are going to give to the machine learning model. In these methods, we get a correlation matrix that tells us how the features are corelated with one another and the target variable. Some of the correlations commonly used are Pearsonâ€™s correlation, Kendall Tau Rank correlation and Spearman corelation.

Wrapper methods:

Popular methods for feature selection include Forward elimination, Backward elimination and Recursive feature elimination

?Forward selection is a greedy method wherein we select one feature at a time, pass it to the machine learning model and evaluate the importance. We repeat the process increasing the features in every iteration until no improvement is seen. At this point, we generate a best subset of all the features.

Backward selection is just the reverse of forward selection. In backward selection, we start with â€œall of the featuresâ€ and evaluate the model performance by removing one feature at a time.

Recursive feature elimination: In Recursive feature elimination, we use a model to evaluate feature importance. Random Forest Classifier is one of the model types wherein we can evaluate the feature importance.?Firstly, we select the desired number of features and fit the model. The model ranks the features by importance and then we discard the least important features. We repeat until the desired number of features remain. Recursive Feature Selection often turns out to be best performing amongst all.

9.??Notebook illustrations of feature selection methods

The Jupyter notebook in the Github link mentioned below shows some example illustrations of various feature selection techniques

This notebook runs through the different techniques in performing feature selection from a dataset and then compares the evaluation metrics using the subset of features on a model taking the baseline model with all features.

This is the Github link: https://github.com/atanejajlr/linkedin_feature_engineering

8.???References

Boxplots:

1.??????https://towardsdatascience.com/creating-boxplots-with-the-seaborn-python-library-f0c20f09bd57

2.??????https://datavizpyr.com/boxplots-with-points-using-seaborn-in-python/

3.??????https://www.statology.org/box-plot-skewness/#:~:text=We%20can%20determine%20whether%20or,or%20%E2%80%9Cpositively%E2%80%9D%20skewed).

You-Tube

1.??????https://www.youtube.com/watch?v=A3gClkblXK8&list=PLeo1K3hjS3ut5olrDIeVXk9N3Q7mKhDxO&index=5
https://www.youtube.com/watch?v=A3gClkblXK8&list=PLeo1K3hjS3ut5olrDIeVXk9N3Q7mKhDxO&index=5

AND

Several web sources

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Ajay Tanejaçš„æ›´å¤šæ–‡ç«

Low-Rank Adaptation of Large Language Models (LoRA): Part 4 of my Fine-Tuning Series of Blogs

2025å¹´2æœˆ24æ—¥

Low-Rank Adaptation of Large Language Models (LoRA): Part 4 of my Fine-Tuning Series of Blogs

1. Introduction: This article is the continuation of my series of articles on â€œFine-Tuning of LLMsâ€ and is the fourthâ€¦
Parameter Efficient Fine Tuning with Additive Adaptation: Part 3 of my Fine-Tuning Series of Blogs

2025å¹´2æœˆ10æ—¥

Parameter Efficient Fine Tuning with Additive Adaptation: Part 3 of my Fine-Tuning Series of Blogs

1. Introduction This is the continuation of my series of blogs on Fine-Tuning of LLMs and is the third blog in theâ€¦
Fine Tuning on Single and Multiple Tasks: Part 2 of my Fine-Tuning Series of Blogs

2025å¹´2æœˆ4æ—¥

Fine Tuning on Single and Multiple Tasks: Part 2 of my Fine-Tuning Series of Blogs

1. Introduction This is the continuation of my series of blogs on Fine-Tuning and is the second blog in the series.
Essentials of Fine Tuning: Part 1 of my Fine-Tuning Series of Blogs

2025å¹´2æœˆ1æ—¥

Essentials of Fine Tuning: Part 1 of my Fine-Tuning Series of Blogs

1. Fine Tuning Series and Background of Transformers and ChatGPT Training Process: One of my earlier series of blogsâ€¦
RAG Beyond Basics:

2025å¹´1æœˆ7æ—¥

RAG Beyond Basics:

1. Introduction: In this article/blog, I will discussing some advanced techniques in the Retrieval-Augmented Generationâ€¦
The Marriage of Retrieval-Augmented Generation (RAGs) with Knowledge Graphs: Part 15 of my Graph Series of Blogs

2024å¹´10æœˆ24æ—¥

The Marriage of Retrieval-Augmented Generation (RAGs) with Knowledge Graphs: Part 15 of my Graph Series of Blogs

1. Introduction: The general idea of Retrieval-Augmented Generation (RAGs) is now well understood in LLM community andâ€¦

2 æ¡è¯„è®º
Knowledge Graph Completion and Knowledge Graph Embeddings: Part 14 of my Graph Series of Blogs

2024å¹´9æœˆ23æ—¥

Knowledge Graph Completion and Knowledge Graph Embeddings: Part 14 of my Graph Series of Blogs

1. Introduction: This is the continuation of my series of blogs on Graphs and is the 14th article in the series.

3 æ¡è¯„è®º
Setting Up Graph Neural Network Prediction Tasks: Part 13 of my Graph Series of Blogs

2024å¹´8æœˆ26æ—¥

Setting Up Graph Neural Network Prediction Tasks: Part 13 of my Graph Series of Blogs

1. Introduction: This is the continuation of my Graph Series of Blogs and is the thirteenth blog in the series.
Training Graph Neural Networks: Part 12 of my Graph series of blogs

2024å¹´8æœˆ18æ—¥

Training Graph Neural Networks: Part 12 of my Graph series of blogs

1. Introduction: This is the continuation of my series of blogs on Graphs and is the twelfth article in the series.
Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs): Part 11 of my Graph series of blogs

2024å¹´6æœˆ30æ—¥

Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs): Part 11 of my Graph series of blogs

1. Introduction: This article is the continuation of my series of blogs on â€œGraphsâ€ and is the eleventh article in theâ€¦

See all articles

Feature Engineering â€“ Data Cleansing, Transformation and Selection - my notes

Ajay Taneja

Senior Data Engineer | Generative AI Engineer at Jaguar Land Rover | Ex - Rolls-Royce | Data Engineering, Data Science, Finite Element Methods Development, Stress Analysis, Fatigue and Fracture Mechanics

1.???Data pre-processing:

2.???Pre-processing operations

3.???Data Cleansing

3.1 Percentile method:

3.2 Using Z-score to detect outliers

Percentile vs Z score:

3.3 Removal of outliers using the Interquartile Range (IQR)

4.???Notebook examples of Data Cleansing

5.???Feature Scaling

5.1???Why Feature Scaling?

5.2???Understanding Feature Scaling through an example

é¢†è‹±æŽ¨è

5.3??????Possible ways to scale features

5.4????Scikit-Learn libraries for feature scaling

6.??Notebook illustrations of feature scaling

7.??Feature Selection Methods: Introduction

8.??Highlights of the feature selection methods

9.??Notebook illustrations of feature selection methods

8.???References

Ajay Tanejaçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Understanding the Data: A Journey into the World of Data Science

The Future of Data Science: Trends, Applications, and Ethical Considerations

The Problem of Overfitting in Machine Learning

ML Day 10: Effectiveness of ML Algorithms: Research Findings

Data Literacy in the AI Era

Synerise Monad: Apply science to behavioral data. Automatically.

Understanding the Difference Between Supervised Machine Learning and Unsupervised Cataloging with 3DI

TECHNIQUES FOR BUILDING PREDICTIVE MODELS

An eye on Machine Learning

AI Atlas #7: Clustering

1.???Data pre-processing:

2.???Pre-processing operations

3.???Data Cleansing

3.1 Percentile method:

3.2 Using Z-score to detect outliers

Percentile vs Z score:

3.3 Removal of outliers using the Interquartile Range (IQR)

4.???Notebook examples of Data Cleansing

5.???Feature Scaling

5.1???Why Feature Scaling?

5.2???Understanding Feature Scaling through an example

é¢†è‹±æŽ¨è

5.3??????Possible ways to scale features

5.4????Scikit-Learn libraries for feature scaling

6.??Notebook illustrations of feature scaling

7.??Feature Selection Methods: Introduction

8.??Highlights of the feature selection methods

9.??Notebook illustrations of feature selection methods

8.???References

Ajay Tanejaçš„æ›´å¤šæ–‡ç«

Low-Rank Adaptation of Large Language Models (LoRA): Part 4 of my Fine-Tuning Series of Blogs

Parameter Efficient Fine Tuning with Additive Adaptation: Part 3 of my Fine-Tuning Series of Blogs

Fine Tuning on Single and Multiple Tasks: Part 2 of my Fine-Tuning Series of Blogs

Essentials of Fine Tuning: Part 1 of my Fine-Tuning Series of Blogs

RAG Beyond Basics:

The Marriage of Retrieval-Augmented Generation (RAGs) with Knowledge Graphs: Part 15 of my Graph Series of Blogs

Knowledge Graph Completion and Knowledge Graph Embeddings: Part 14 of my Graph Series of Blogs

Setting Up Graph Neural Network Prediction Tasks: Part 13 of my Graph Series of Blogs

Training Graph Neural Networks: Part 12 of my Graph series of blogs

Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs): Part 11 of my Graph series of blogs

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Understanding the Data: A Journey into the World of Data Science

The Future of Data Science: Trends, Applications, and Ethical Considerations

The Problem of Overfitting in Machine Learning

ML Day 10: Effectiveness of ML Algorithms: Research Findings

Data Literacy in the AI Era

Synerise Monad: Apply science to behavioral data. Automatically.

Understanding the Difference Between Supervised Machine Learning and Unsupervised Cataloging with 3DI

TECHNIQUES FOR BUILDING PREDICTIVE MODELS

An eye on Machine Learning

AI Atlas #7: Clustering

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†