登录查看更多内容

Numeric Feature's Preprocessing

Umar Murtaza

Cloud Architect (GCP & AWS) | Kubernetes | DevOps | MLOps (21k+ connections)

发布日期: 2018年2月3日

+ 关注

I have started another interesting course at Coursera:

How to Win a Data Science Competition: Learn from Top Kagglers

Today i watched one of its video on Numeric Features PreProcessing and will like to share what i learned.

1. Feature Scaling:

There are some models which dont care about features scaling and some does, based on this models were broadly divided into:

Tree-based: Decision Tree, Random Forest, AdaBoost
Non-tree-based: Nearest neighbors, Linear SVM, RBF SVM, Neural networks

Starting with decision trees classifier, it does not change its behavior and neither its predictions on features scaling, rather focuses on finding the most useful split for each feature.

kNN, Linear models and methods likes gradient descent method are badly effected by features scaling.

Solution presented in the video course were:

MinMaxScaler (from sklearn)
StandardScaler (from sklearn)

In MinMaxScaler first sector at minimum value and then divide the difference base maximum.

sklearn.preprocessing.MinMaxScaler

X = (X - X.min())/(X.max() - X.min())

In StandardScaler first subtract mean value from the feature, and then divides the result by feature standard deviation.

sklearn.preprocessing.StandardScaler

X = (X - X.mean())/X.std()

Applying either of MinMaxScaling or StandardScaling transformations, it greatly helps non-tree-based models.

2. Outliers:

Outliers are observations that lies at an abnormal (large) distance from other observations in the data, these few points influences model training results.

To counter outliers, the video gave the following suggestions:

Winsorization/Winsorize
Rank Transformation (from scipy ... scipy.stats.rankdata)
Log transform and Square Root transform

In Winsorization, we clip features values between two chosen values of lower bound and upper bound. Choosing them on some percentiles of that feature. For example, first and 99s percentiles.

In Rank transformation, spaces are set between proper assorted values to be equal. This transformation, for example, can be a better option than MinMaxScaler if we have outliers, because rank transformation will move the outliers closer to other objects.

Linear models, KNN, and neural networks can benefit from this kind of transformation if there is no time to handle outliers manually.

Then its Log transform np.log(1 + x) & Square root transform np.sqrt(x + 2/3). Both these transforms drive too large values closer to the features average values. Along with this, the values near zero are becoming a bit more distinguishable. Despite the simplicity, one of these transformations can improve your neural network's results significantly.

Another important takeaway from this video was: "Another important moment which holds true for all preprocessings is that sometimes, it is beneficial to train a model on concatenated data frames produced by different preprocessings, or to mix models training differently-preprocessed data. Again, linear models, KNN, and neural networks can benefit hugely from this."