Feature Engineering – Data Cleansing, Transformation and Selection - my notes
Ajay Taneja
Senior Data Engineer | Generative AI Engineer at Jaguar Land Rover | Ex - Rolls-Royce | Data Engineering, Data Science, Finite Element Methods Development, Stress Analysis, Fatigue and Fracture Mechanics
1.???Data pre-processing:
All machine learning models require data pre-processing to improve training. The way the data is represented can have a strong influence on how the machine learning model can learn from it. For example, models tend to converge more faster and reliably when numerical data is scaled appropriately. The techniques for selecting and transforming the data are key to the to increase the predictive quality of the models.
The art of feature engineering tries to improve the model’s ability to learn while reducing the compute resources if required. It does so by transforming and projecting (e. g. dimensionality reduction), eliminating (feature selection methods) or combining the features in the raw data to form a new version of the data set
Important: Feature Engineering should be consistent in training and serving
During training you have the entire data set available to you. So, one can use the global properties of individual features in the feature engineering transformation.
- For example, you can compute standard deviation of the features and use that to perform normalization. It should be underscored that when you serve the model you should do the same kind of feature engineering so that you give the model the same kind of data the model was trained on. So, if you have normalized the data and have used standard deviation, then, such global constants should be saved and also made use of during serving. Failing to do that is a very common source of problem in production systems and such errors can be difficult to debug.
- Or, if you created a one hot vector for a categorical feature when you trained, you need also to create a one hot vector when you serve the model.
This series/document will comprise of the following topics related to feature engineering:
- Section 2 will throw some light on the pre-processing operations that are used for feature engineering
- Section 3 will be about Data cleansing and will talk about some of the statistical methods that may be used to detect outliers in the dataset.
- Section 4 will point to my Git repository which has the Jupyter notebooks showing some data cleansing exercises using different approaches
- Section 5 will talk about feature scaling and then section 6 will comprise notebook ?relating to Feature Scaling.
2.???Pre-processing operations
Let us talk about some of the pre-processing operations that are used for feature engineering
?
- Data cleansing: This involves eliminating or correcting erroneous data
- Feature tuning: It is often required to perform transformation on the data like scaling, normalizing the data since machine learning models and neural networks are sensitive to range of numerical features.
- Feature extraction: Dimensionality reduction vs feature selection methods: One shouldn't just throw everything at your machine learning model and rely on your training process to determine which features are actually useful. Thus, it is imperative to carry out feature selection and | or dimensionality reduction to reduce the number of features in a dataset. Whilst both ‘feature selection’ and ‘dimensionality reduction’ are used for reducing the number of features in a dataset, there is an important difference:
?Feature selection is simply selecting and excluding given features WITHOUT changing them
?Whereas Dimensionally Reduction transforms the features into a lower dimension
?Feature selection identifies the features that best represent the relationship amongst all in the feature space as well as the target that the model will try to predict. Feature selection methods remove the features that do not influence the outcome. This reduces the size of the feature space, hence reducing the resource requirements for processing the data and model complexity too. I have discussed about Feature Selection and Dimensionality reduction here: https://www.dhirubhai.net/pulse/feature-selection-dimensionality-reduction-ajay-taneja/
- Bucketizing and Binning: Sometimes it may be useful to bucket different data ranges into a one hot encoding as shown below. For example, if you’re dealing with a houses data set built over years, one could bucket different years as shown below:
3.???Data Cleansing
As mentioned, data cleansing involves eliminating or correcting erroneous data. Outliers are generally defined as samples that are generally far away from the mainstream of the data. Outliers in the dataset may be caused by measurement or input error, data corruption, etc. Statistical methods may be used to detect outliers in the dataset. Some of these methods are discussed below. However, the point that should be highlighted is that any of the methods discussed should be used carefully. In the end, it comes down to your subject area knowledge and the investigation of the candidate outlier. It is always possible than an unusual value is part of the natural variation of the process rather than a problematic point.
3.1 Percentile method:
In the percentile method, you decide a specific threshold in percentile. For example, anything above 98% percentile and below 2% percentile may be considered as an outlier and then you go ahead trimming or capping these samples from the dataset. Percentile method is arbitrary, and you will have to determine the threshold manually based on domain knowledge.
The Jupyter notebook shown below has a dataset where data above 98th percentile and below 2nd percentile is is removed. The Airbnb dataset is from Kaggle
3.2 Using Z-score to detect outliers
Z-score quantifies the unusualness of an observation when your data follows a normal distribution. Z-scores are the number of standard deviations above and below the mean each value falls. A z-score of 2 indicates that an observation is 2 standard deviations above average whilst a Z-score of -2 indicates it is 2 standard deviations below the average.
Mathematically,
Z-score, is given by:
Percentile vs Z score:
It must be underscored that whilst the percentile uses the median as an average (50th percentile), Z-score uses the mean as an average. Thus, a Z-score of 0 represents a value equal to the mean. The farthest away a Z-score is from mean, the more unusual it is.
3.3 Removal of outliers using the Interquartile Range (IQR)
Unlike the more familiar, mean and standard deviation, the interquartile range and the median are robust measures. IF the dataset is normally distributed, you can use standard deviation to determine the percentage of observation that fall specific distances from the mean. However, that does not work for if the data is not normally distributes/skewed distribution and IQR is an excellent alternative.
What is Interquartile Range?
Interquartile Range (IQR) measures the spread of the middle half of the data. It is the range for the middle 50% of your sample. IQR is used to assess the variability of where most of the values lie. To find the outliers, a multiplier (normally, 1.5) is used to subtract from the 25th percentile – which gives a lower limit of Q1 – 1.5 IQR and to add to the 75th percentile (i.e., an upper limit of Q3 + 1.5 IQR) is used. Any samples beyond the lower and upper limit are classified as “outliersâ€.
The figure below illustrates the math better:
4.???Notebook examples of Data Cleansing
The Jupyter notebooks in my Github link below illustrate the following:
- Detecting and eliminating outliers using percentile
- Detecting and and eliminating outliers using Z-score
- Detecting and eliminating outliers using?IQR
This is the Github link: https://github.com/atanejajlr/linkedin_feature_engineering
5.???Feature Scaling
5.1???Why Feature Scaling?
Real world datasets often contain features that vary in their degree of magnitude and units. Therefore, in order that the machine learning model interprets all features in the same scale we have to perform “feature scalingâ€.
Feature scaling helps algorithms for minimization of cost function such as gradient descent converge faster as well as become mandatory in many cases as discussed in section. To understand feature scaling, let us examine the relationship between feature values and the parameters through an example.
Importance of Feature Scaling:
To understand the importance of feature scaling in the most intuitive sense, think about Principal Component Analysis (PCA)
PCA attempts to locate the principal components by choosing a "u" (the direction along which you project the data points) so that you get the maximum variance. You want the maximum variance because you want to retain the maximum information from your data set.
Suppose your data set comprises of 'height' and 'weight' feature, then, because of the inherent difference in scales in height and weight , PCA might determine the direction of maximal variance corresponding with the ‘weight’ axis - in case if no feature scaling is done - which is clearly incorrect because a variation of height of 1m is highly significant. Hence feature scaling becomes mandatory here.
As well as in algorithms like K Nearest Neighbours where you deal with Euclidian distances.
5.2???Understanding Feature Scaling through an example
Let us consider the “Hello World†example of machine learning wherein you’re predicting the price of the house – and the associated features being:
- Size in square feet – feature x1
- Number of bedrooms – feature x2
Here the size in square feet may range from 300 – 2000 square feet and the number of bedrooms may range from 0 – 5. So, the feature x1 takes on relatively large range of values and the feature x2 takes relatively smaller range of values.
Let us say the model associated with the price prediction is given by the following equation and let us say we’re predicting the price for a 5-bedroom house of size 2000 square feet.
One choice of the parameters of the model can be:
This will result in the price evaluation of:
Let’s say that a price of 500K is approximately the right price, thus, the model has evaluated the parameters correctly.
Here, it may be noticed that when the feature value is relatively large (x1 -> 2000 square feet), the corresponding parameter value is small (w1 being 0.1) and when the feature value is relatively small (x2 being 5 bedrooms), the corresponding parameter value is relatively large (w2 being 50).
Let us examine the above visually. Let us get a scatter plot of the features: the size in square feet is on the horizontal axis vs number of bedrooms is on the vertical axis?-for some of the training examples as shown below:
Now, let us see how the cost function might look in the contour plot. It may be recalled that by definition of a “contour†– all points on a particular contour denote same value of cost function.
As it may be noticed from the figure above, the contours form ellipses which are shorter along one side (side corresponding to w1 and longer along the other – the axis corresponding to w2.
In such a case the minimization algorithm such as the gradient descent might take a very long time to converge as it might bounce back and forth before it can find the global minima as emulated in the figure below:
How do we solve the above problem?
In such a case, it might be useful to scale the features i.e., x1 the sq feet and x2 the number of bedrooms so that the scaled/transformed features lie between 0 and 1. And after the transformation, the scatter plot looks as shown in the figure below:
领英推è
As it may be noticed, the scaled plot is different from the unscaled/un-transformed plot. This is because the scaled features x1 and x2 now take on compatible range of values to each other and now if you run the minimization algorithm like gradient descent on the scaled features the contour plots of the cost function look like below:
As it may be noticed, the contour plots are no longer tall and skinny, and the gradient descent will find a direct path to the global minimum as shown in the figure below.
Thus, to conclude, it can be stated: if you have different features that take on different range of values it can cause algorithm like gradient descent to converge slowly but rescaling the features so that they comparable values may speed up the minimization algorithm significantly.
5.3??????Possible ways to scale features
Let us now see the possible ways to scale the features, these include:
- Divide by maximum: Here we take each feature and divide each sample of the feature by the maximum value so that every value will lie between 0 and 1 (?0 <=x1 <=1)
Thus, considering the same example hosing price data set,
Originally,
Scaling by divide by maximum:
Thus,
- Mean normalization: In Mean normalization, one starts with the original features and re-scales them so that they are centred around 0. Normally, the features will lie between -1 and +1. The re-scaled features based on mean normalization will be:
That is: for feature x1 we calculate the mean of x1 for all the training examples and then take the corresponding maximum and minimum value of the feature x1 of all the training examples.
- Z- score normalization: Another common feature scaling method is the Z-score normalization. To carry out the Z score normalization we have to calculate the standard deviation of each feature. Z-score normalization for feature x1 is given by:
where,
μ1 is the mean of all the training examples corresponding to feature x1 and ?is the standard deviation
5.4????Scikit-Learn libraries for feature scaling
There are 3 different types of scalars in the Scikit-learn library for feature scaling. These include:
- Min-Max Scalar
- Standard Scalar
- Robust Scalar
Min-Max Scalar:
Using the Min-Max scalar, all the features will be transformed into the range: [0, 1] i.e., the minimum and maximum value of the feature will be between 0 and 1. The scaled value of a sample is given by:
Standard Scalar:
Standard Scalar standardizes the features by removing the mean and scaling it to unit variance. The standard score of a sample is calculated as:
Where, μ and σ are the mean and the standard deviation respectively
The mean and the standard deviation are stored to be used later during model serving. Standard scalar is often used in many machine learning algorithms. However, the algorithm may not work well if the individual feature is not more or less standard normal distributed.
Robust Scalar:
This scalar removes the median and scales the data according to the quantile range – the Interquartile Range (IQR) which lies between the first and the third quantile.
It must be underscored that standardization of the dataset is often carried out – i.e., using the standard scalar as described above, However, outliers can influence the mean/variance in a negative way and in such circumstances, the median and the interquartile range will give better results
Which is the preferred Scalar? Min-Max scalar | Standard Scalar | Robust Scalar?
Min-Max Scalar will transform each value in the column in proportion within the range [0,1].
Standard Scalar will transform each value in the column in the range about the mean and the standard deviation. The method is used when the distribution is Gaussian
If there are outliers in the dataset Robust Scalar is the preferred option. Alternative to that, if one has dealt with the outliers in an exploratory Data Analysis Standard Scalar or Min Max scalar may be resorted to depending on whether the data is normally distributed or not.
6.??Notebook illustrations of feature scaling
The Jupyter notebook in the Github link mentioned below shows some example illustrations of featue ecaling using scikit-learn libraries:
This is the Github link: https://github.com/atanejajlr/linkedin_feature_engineering
7.??Feature Selection Methods: Introduction
Having spoke about: Data Cleansing, various statistical techniques for data cleansing followed by feature scaling and the importance of feature scaling in algorithms like Principal Component Analysis, K-Nearest Neighbours as well as during the minimization of cost function and some of my notebooks demonstrating these techniques for data cleansing and feature scaling, it is time to discuss some of the feature selection methods.
In this article:https://www.dhirubhai.net/pulse/feature-selection-dimensionality-reduction-ajay-taneja/?, I have discussed in sufficient detail about the feature selection methods – hence this post will focus on pointing out some highlights and notebooks demonstrating the application of these methods.
8.??Highlights of the feature selection methods
Feature selection methods may be classified into supervised and unsupervised methods. Whereas, supervised methods consider the correlation between the features and the target variable, the unsupervised methods do not consider the correlation between features and the target variable.
The methods that fall under supervised feature selection include:
- Filter methods
- Wrapper methods and
- Embedded methods
Filter methods:
In Filter methods, we start with all the features and select the best subset that we are going to give to the machine learning model. In these methods, we get a correlation matrix that tells us how the features are corelated with one another and the target variable. Some of the correlations commonly used are Pearson’s correlation, Kendall Tau Rank correlation and Spearman corelation.
Wrapper methods:
Popular methods for feature selection include Forward elimination, Backward elimination and Recursive feature elimination
?Forward selection is a greedy method wherein we select one feature at a time, pass it to the machine learning model and evaluate the importance. We repeat the process increasing the features in every iteration until no improvement is seen. At this point, we generate a best subset of all the features.
Backward selection is just the reverse of forward selection. In backward selection, we start with “all of the features†and evaluate the model performance by removing one feature at a time.
Recursive feature elimination: In Recursive feature elimination, we use a model to evaluate feature importance. Random Forest Classifier is one of the model types wherein we can evaluate the feature importance.?Firstly, we select the desired number of features and fit the model. The model ranks the features by importance and then we discard the least important features. We repeat until the desired number of features remain. Recursive Feature Selection often turns out to be best performing amongst all.
9.??Notebook illustrations of feature selection methods
The Jupyter notebook in the Github link mentioned below shows some example illustrations of various feature selection techniques
This notebook runs through the different techniques in performing feature selection from a dataset and then compares the evaluation metrics using the subset of features on a model taking the baseline model with all features.
This is the Github link: https://github.com/atanejajlr/linkedin_feature_engineering
8.???References
Boxplots:
1.??????https://towardsdatascience.com/creating-boxplots-with-the-seaborn-python-library-f0c20f09bd57
You-Tube
AND
- Several web sources