Why is Feature Engineering so Important?

Why is Feature Engineering so Important?

Many of you familiar with ML have probably heard something along the lines of: "Feature Engineering is one of the most important steps in Machine Learning" I know when I heard this for the first time, I struggled to understand what Feature Engineering actually entailed.

This article will hopefully make this term clearer by giving a clear definition and providing a practical example.


What is Feature Engineering?

A good way to think about the broad term of feature engineering is a combination of three smaller umbrella operations.


1) Feature Selection: This is the process of selecting a subset of all of your existing features for use in an ML model based on how important they are for the prediction of your target label. Some techniques used for Feature Selection are Pearson's R correlation (only for data that is obviously linear), chi-square (for categorical data), and the build-in feature importance functions you can find for most tree-based models like Random Forest.

A visual depiction of feature selection

2) Feature Creation: This is the process of creating entirely new features based on existing individual features or by combining multiple existing features. For example, suppose you are working with stock data for NVIDIA - which would hypothetically include average daily stock price. One feature that you could add is average weekly stock price - which could help your model recognize patterns on a bigger time scale. In my opinion, Feature Creation is the hardest concept in the entire ML pipeline because you need to be quite creative to come up with new features that are helpful for your model as well as needing strong domain knowledge to come up with ideas for new features. The Feature Creation process can also be quite iterative and tedious, as there can be a lot of trial and error involved in knowing which new features improve model performance and which ones make performance worse. My example will give a good idea of the thought process needed to come up with good features.


3) Feature Transformation: This is the subsection of Feature Engineering that you would most likely already be familiar with and use on a regular basis with ML tasks. You are transforming existing features to make the data more compatible and cleaner for your ML model. This involves things like filling null values with the most frequent value/mean, performing One Hot Encoding, and scaling your numerical features between 0 and 1.


Example of Feature Creation:

I recently revisited the Kaggle Advanced Regression Techniques - House Prices competition in an attempt to specifically improve my feature engineering skills. This is a popular Kaggle competition for individuals with intermediate ML knowledge. The goal of this competition is to predict housing prices for unseen test data based on many factors about the house and nearby amenities, such as the presence/absence of railroads and the neighborhood the house is in. My example relates specifically to the neighborhood feature.

In the Kaggle dataset - there are about 25 different neighborhoods within the city of Ames, Iowa. (which is where all the houses are) When I initially saw this feature, my initial thought was something along the lines of - "This feature has way too many values and that kinda sucks". Usually, a feature with too many values significantly increases dimensionality and makes it harder for the model to find patterns within that feature. The best transformation I thought of involving this feature was a new feature that still captured the general location of the house within Ames, Iowa without having so many dimensions. After a bit of thinking, I realized the general location could be captured through the cardinal directions: North, South, East, West, and Central. Below is a map of the Ames, Iowa neighborhoods that I used to help me with this. (637050278030900000)

Ames Map

For all of the neighborhoods present in my Kaggle data, I would find the associated neighborhood on the above map and visually judge which region of Ames it was in: North, West, South, East, or Central. I also added an additional category named ISU for Iowa State University, as the University housing prices would likely differ from the other regions. In order to create this new feature, I created a dictionary mapping my Kaggle neighborhoods to the general regions and then applied this mapping to my existing feature. My resultant column improved my score from 0.15690 to 0.15542 - which is not a huge numerical difference, but is a large improvement caused by just one additional feature. The major takeaway from this example should be: Try to be as creative as possible when coming up with ideas for new features and do not shy away from doing manual research of your own! (Like finding the above map)

In conclusion, Feature Engineering is a very broad term - but you can distill it into the critical elements of Feature Creation, Feature Transformation, and Feature Selection. Creativity and Domain Knowledge are the most important parts of the broad Feature Engineering process. I hope this article helps you out with your ML endeavors!


要查看或添加评论,请登录

Dave Patel的更多文章

社区洞察

其他会员也浏览了