Feature engineering - part two: numerical features

Mahdi Torabi Rad

"Some machine learning projects succeed, and some fail. What makes the difference? Easily the most important factor is the features used" – Pedro Domingos

Feature engineering is one of the most important parts of practical machine learning. It refers to creating more suitable features for the algorithm or representing the information available in the data better than the original features. In this short article, I will have a high-level and easy-to-understand discussion on engineering numerical features. Categorical features were covered in an earlier article.

Numerical features

Numerical features are those that take numerical and continuous values. In engineering them, you should keep in mind that useful features have to conform to the assumptions of the model you will use later and should represent salient aspects of the data. For example, imagine you want to build a model that predicts Airbnb rental prices in a city. Those prices can reasonably be expected to have a higher correlation with the distance to the city center than the latitude or longitude. If in the training data, instead of distance to the center, you have latitude and longitude, a good feature engineer practice replaces them with the distance.

First and second checks: scale and distribution

First checks on numerical data are checking scale and distribution. Models that are smooth functions of the input features are sensitive to the scale of the input. Besides, models that use Eucledian distance (the length of the line segment connecting two points in a high-dimensional space) are sensitive to the scale; logical functions, on the other hand, are not. Two common feature scaling methods are standardization and normalization. Their drawbacks are, respectively, mapping to an unbound range (you can still have very large or small values) and being sensitive to outliers. If you have sparse data, you should be cautious with scaling because subtracting a quantity from the original features may burden the model.

Tree-based models do not require feature scaling because logical functions are not sensitive to the inputs' scale. The only exception is when the input scale grows over time (when the feature has accumulated count). In such a case, eventually, the feature range will extend outside the range the model is trained on. To combat this, you might need periodic rescaling.

The second check on numerical data is checking the distribution: the probability of a random variable taking a particular value. Some models are more sensitive to the underlying distribution than others. For example, in linear regression, when the target variable spreads over a range of magnitudes, the errors typically become non-normal. That, in turn, results in the model becoming non BLUE (Best Linear Unbiased Estimator), and, therefore, should be avoided. If your target variable is like that, you can apply the log transform. More on transformation later on in the article.

Engineering interaction features

These features allow inserting more input to the model. A simple way of constructing an interaction feature is to use the product of two features as a new one (similar to a logical AND). More advanced interacting features can be outputs of other models. Decision trees essentially create interacting features for free.

Be careful with the counts

The major problem with counts is that they might grow very fast with time. If that happens, you will have some extreme values. If those values are not correctly handled, they can throw the model off. To avoid problems like that, you can use methods such as binarization. Sometimes, binarization is necessary even when the count does not grow too fast. For example, if you want to build a model that predicts how much users like a song, you should keep in mind that a user that has listened to a song twenty time does not mean he likes the music twice more than a user who has listened to the same song ten times.

In summary, with numerical features: check magnitude, check distributions, try interaction features, and when you have counts, be careful.

More on counts: quantization or binning

Binning is a method that can contain the scale by grouping the raw counts into several bins and then removing the counts. This is especially useful because lots of models have a problem with long-trail distributions. The two most common binning methods are fixed-with binning and quantile binning.

Transformation

Transformations, unlike scaling, change the distribution of the data. Two of the most common transformations are log and power.

Log transformation maps the range 0 to 1 to 0 to minf. In other words, it compresses the range of large numbers and expands the range of small numbers. It should be used when the distribution is heavy-tailed (i.e., distributions with more probability mass in the tail than the normal). Log transform will compress the long tail in the high end into a shorter tail and will expand the low head into a longer one. In linear regression, log transformation may help with the assumption of error normality.

Power transformation: these transformations change the distribution so that the variance no longer depends on the mean. They are also sometimes called variance stabilizing transformations. For example, the power transformation of the Poisson distribution, which is suitable for modeling counts, will be one in which the variance no longer depends on the mean. Examples of power transformations include square-root transformation and box-cox transformation.

Nima H. Siboni

Research Engineer in RL / AI-educator / AI-enhanced Simulations / Simulation Scientist

4 年

keep up the good work! :D

要查看或添加评论,请登录

Mahdi Torabi Rad, Ph.D.的更多文章

社区洞察

其他会员也浏览了