Feature Engineering: The key to predictive modeling
Cover photo: https://digg.com/2018/the-whys-and-hows-of-learning-machine-learning

Feature Engineering: The key to predictive modeling

Peter Norvig, Google’s Research Director said “We don’t have better algorithms than anyone else, we just have more data.” (Source: Forbes)

However, most applications of machine learning in the industry is confronted with the problem of messy and limited data. Extracting the weekday from the dates and adding it as a column to the dataset gave me a 3% increase in accuracy in a recent hackathon.

What can we do when we don’t have access to the rich and exhaustive datasets like Google, which prevalently is the case?

Features, the columns of your dataframe are key in assisting machine learning models learn. Better features result into faster training and more accurate predictions. Take for example, an algorithm to predict the sales at a fashion store with time. The fashion sales peak during the winters specially around Christmas. Adding a feature that tells how many days we are away from Christmas gives the algorithm a lot more intuition than the date itself.

Similarly, we can engineer such simple features out of our intuition and domain knowledge to make the models more accurate. This process of combining domain knowledge, intuition and data science skillsets to create features that make the models train faster and provide more accurate predictions is called Feature Engineering

This article will help you discover what feature engineering is, its importance, and the problems it can solve, how to engineer them, and where can you go and dive deeper. These techniques can deliver significant value in the industry, Hackathons and Kaggle competitions. I am sharing a few key methods to think about feature engineering that I was acquainted during my experience in the domain and also the learnings that I have imbibed from the data science communities like StackOverflow, ResearchGate and KDNuggets, online courses from Coursera, Udemy and YouTube and work of industry mentors such as Brandon Rohrer and Jason Brownlee.

The article consists of the following sections:

  1. What is Feature Engineering and why is it important?
  2. Feature engineering techniques
  3. Summary
  4. Suggested readings

1. What is Feature Engineering and why is it important?

Today, the data science community is empowered by state of the art algorithms including deep nets, ensemble models which can identify patterns in the data beyond human cognizance. However, these models are limited by the amount of data. Neural networks for example, being non-linear learners requires a wealth of data to train (~5 million rows).

One way to circumvent these constraints is to give our models a smarter dataset to learn from. For example, feeding the names of customers along with their shopping history into a machine learning model would take sometime for the model to realize that there could be 2 segmentation made for their behavior. Once we add the boolean containing Male/Female, the model from the very onset knows the clusters which it earlier had to learn. This process of creating a smarter dataset from the given data by combining data science skills, intuition and domain knowledge is called “Feature Engineering”.

The importance and application of feature engineering concepts is not just limited to scenarios constrained by limited data. It can provide high-value in terms of faster training times and achieving similar accuracies across simpler models. For example, the popular million-dollar prize of Netflix recommendation system never came into implementation because the algorithms were too complicated to provide recommendations real-time. In such scenarios, feature engineering comes to the rescue.

Next, I want to shed light on feature engineering techniques that are followed in our data science community. Though, this is not an exhaustive list but I have tried my best to incorporate the most frequently observed aspects of feature engineering that can complement our everyday data-science journey. Some of them have been adopted from online open-source communities such as Kaggle, StackOverflow and ResearchGate, yet others have come out of my own education and experience in the domain.

2. Feature Engineering techniques

Feature engineering and its applications are diverse and vary from one problem to another. However, we can classify the techniques based on the kind of dataset that we are dealing with. This section has been divided into four parts, each catering to a particular type of data:

  • time-series data
  • numerical data
  • text data
  • Kaggle boosters

2.1. Time-series data

Time-series data is simply data ordered (indexed) on time. For example, transactions data of a credit-card firm, prices of stocks in the financial markets and weather data. While modeling time-series data we mostly tend to second the importance of the timestamp itself. The following techniques can be used to extract informative features from the timestamp depending upon the use-case.

2.1.1. Extracting the weekday, month

For example if we look at the online food orders on say, GrubHub and their corresponding delivery times. When we analyze the orders data on an intra-day basis, for example, we are concerned about what would happen in the next hour.

In this case the date part of the time-stamp does not convey a lot of information. The hour of the day is a better feature that will help the machine learn direct relation. Additionally, instead of the date, the weekday conveys more information like if it was a weekday or weekend. The new features would look like this.

Similarly, depending on the granularity of the analysis that we need to do, we can extract relevant feature like the ones above from the timestamp.

2.1.2.   Using important days

Intuitively, if we think of sales on, Amazon we can say that the sales are significantly higher on weekends compared to the weekdays. Thus, creating a 1/0 flag of the row in the dataset corresponding to a weekday/weekend can convey this information to our machine learning models.

Further, if we plot the daily sales and see it peaking around national holidays such as Christmas, Thanksgiving where we can add a new column containing 1/0 flag for such important days. 

An even more informative feature would be the number of days to the nearest upcoming holiday. The dataset for such engineered feature would look like the following:

2.1.3.  Other thoughts

Here are a few miscellaneous thoughts on how we can use the time-stamps in our data to utmost advantage:

·      Time-zone: The time-zone in the timestamp can give information about the geography if not already present in the dataset.

·      Binning: We can further bin the hour of the day into 4 categories, namely, morning, afternoon, evening and night. Or if we are dealing across months we can add seasons column containing winter, spring, summer and autumn.

The techniques we discussed above also provide insights during descriptive data analysis. For example, if we add the seasons feature in the transaction level data. While rolling the data up on the “Seasons” column, we can directly compare sales across the 4 seasons. Next, we will discuss techniques around numeric data which is by far the most common datatype.

2.2.  Numerical Data

Anything containing real numbers (integers and decimals both) is numerical data. Machines can only understand and interpret numbers and hence, numerical data is essential for any machine learning model.

However, most models have underlying assumptions about the data where they work the best. Being able to identify these assumptions and transform the data accordingly is a key to getting optimal results from machine learning algorithms.

For example, linear regression assumes the residuals to be normally distributed and there is a linear relationship between the variables. These assumptions can be catered to by ensuring that the Y output variable is normally distributed.

 In the book Applied Predictive Modeling by Kuhn and Johnson, the authors claim that many machine learning algorithms work better when the features have symmetrical or unimodal distributions.

In this section I will discuss a few statistical transformations followed by a few more case-specific engineered features that involve techniques like binning the data. Both techniques provide strategies to deal with skewed datasets which can adversely affect model performance.

2.2.1.   Statistical transformations

We frequently observe skewed distributions like the ones in the plots below (Source). Such left or right skewed distributions can be transformed into a normal distribution by taking a logarithm of the data. This is called a log-transformation of the data.

If the log-transformation does not get the data close to a normal distribution try power transformation of the data using techniques like Box-cox transformations. The power transformations are mostly required when the data is excessively skewed. The other way-out is to remove the outliers if they are making the data very skewed and then run a log-transform. The histograms post the log-transformation look like the plots below (Source).

A popular application of the log-transform can be seen in the Titanic Kaggle problem where the fares are highly right-skewed. A log-transform of the fares makes the distribution a lot more closer to a normal distribution.

NOTE: The interpretability of the models changes when we do statistical transformations since the coefficients no more tell us about the original features but the transformed features. So we should take that into account while making inferences on the results.

2.2.2.    Binning

Binning is a way to convert numerical continuous variables into discrete variables by categorizing them on the basis of the range of values of the column in which they fall. A simple example would be to convert the number of tweets pertaining to different twitter-handles on the basis of the 4 quartiles. Top 25%, 25-50%, 50-75% and bottom 25%.

The 3 popular ways of binning data are as follows:

·     Quantile based binning

The technique discussed above is called Quantile based binning or adaptive binning. The technique is robust to the range of values within the numerical column. This technique insures equal representation of the categorical hence created. Refer to the plot below (Source). Quantile binning is helpful when dealing with skewed datasets like the ones we discussed in the previous section.

·     Fixed-width binning

In fixed-width binning we manually specify the bin-size which is usually of the same range.

There are several techniques to decide what bin-size to choose. A very wide range of bin-size will lead to very few categories leading to excessive data-loss and hence will negatively impact the model performance. Whereas, a very small bin-size will affect the model performance due to the curse of dimensionality creeping in when we one-hot encode the dataset for feeding into the models.

 ·     Normalization

Distance-based algorithms such as linear regression, k-means clustering, kNN are negatively impacted by outliers. To reduce the effect of outliers in the dataset we need to get the data on the same scale or almost on the same scale. This can be accomplished by the widely popular technique called Normalization.

There are several ways in which we can normalize numerical data. A few of them are discussed here in brief:

Standard Score = (X - mean(X))/sd(X)

Feature Scaling = (X - min(X))/(max(X) - min(X))

2.3.  Text data

Text data has been one of the most challenging forms of data for our community. Feature engineering can play an important role in converting text data into numeric features and use them in modeling. These techniques have mostly been derived from natural language processing. A detailed exploration of techniques dealing with feature engineering of text data has been left for a future article.

For now, this article by Dipanjan Sarkar is a great refresher on the prevalent techniques for feature engineering dealing with text data.

2.4.    Kaggle boosters

This section convers several techniques which are problem specific and might come handy.

2.4.1.    Combining less popular categories into “Others”

We often find that the categorical data is not equally represented in most datasets. There are a few categories which a very less number rows as compared to other categories. This occurs because of multiple reasons – the category was newly created and thus has only a few instances or there are large number of categories (eg., product categories on Amazon).

Training a model on skewedly represented categories is not a good practice. This is because the data does not contain enough instances for the model to understand what the category actually represents. Hence, we can bin all these less popular categories within each column in our dataset into one.

2.4.2.   Extracting substrings from ID’s

Sometimes the ID column contains important information about the categories in which the organizations store their data. For example, in the popular titanic problem on Kaggle, the Seat Number of the passengers on the ship are of the form C123, B099 and A982. The first letter represents the class of their seat and followed by 3 digits representing the seat number.

Extracting the first letter from the ID strings can be a very useful feature as it provides a way to classify the passengers on the ship in reference to their location on the ship.

2.4.3.    Frequency counts

Very often we have data with ID’s corresponding to multiple rows. For example, a transactions dataset. There will be multiple rows corresponding to each User_ID, though there will be a unique transaction ID’s for each transaction. Now suppose we have to predict the transaction amounts of each User_ID. In this case we can create a feature in the dataset which stores the number of times the particular User_ID made transactions on the portal. 

In my experience, there is no one concrete resource to learn feature engineering and it we can only get better by practice. That said, there are a few articles that I would suggest for the readers to read if you want to know more about feature engineering.

3. Suggested Readings

4. Summary

Thank you for reading this far. I hope by now you would appreciate the importance of feature engineering, their application in our everyday data-science hustle and more importantly you now have an overview of a few popular techniques that are used in this space. I am currently working on the next couple of articles in this series focussed at data cleaning and a further in-depth analysis of feature engineering.

Please feel free to reach me out at [email protected] for any feedback and comments. If you want access to a better formatted PDF copy of the article above, kindly drop me an email.

Mukul Chauhan

General Manager - Great Learning | Visiting Faculty - Great Lakes | Gen AI & ML - ISB | Indian Statistical Institute

6 年

Awesome

要查看或添加评论,请登录

Karan R.的更多文章

社区洞察

其他会员也浏览了