Introduction to Group Feature Selection
Group Feature Selection by DALL-E

Introduction to Group Feature Selection

If you love data, you must love features.

Features in a Data Science project are the steps to your ladder, the water to your ship, the air to your airplane, the.. well, you got it.

Even though they are the bits of truth our model lives through they can’t all exist simultaneously, or at least they shouldn’t.

Some features can be irrelevant, poorly engineered, and (most likely) introduce multicollinearity that hurts our model’s accuracy and speed.

With the right features, you are 95% on the way to a model that achieves better accuracy, efficiency, and interpretability— Faster.

In this article, I will share my motivation for using existing feature selection methods and what made me engineer a whole (I think) new technique.

What do Data people usually do to select features?

Those are excellent methods, I use them separately or together all the time, but they introduce some technical issues:

  1. Statistical methods are too simple — those methods can easily miss some non-trivial connection between features that a model can use, thus removing features that could help.
  2. Recursive/Iterative methods are too slow; what if you have 200 features, and it takes 30 seconds to train a model? Say you use 5-fold cross-validation, which sums up to 30(seconds) * 200(features) * 5(folds) = that’s over 8 hours.
  3. Feature importance relies on the model’s ability to tell essential features from non-important features which can be difficult when you want to test a large number of features at once, the model will suffer from multicollinearity.

To address these limitations, I recently started using a modified version of my favorite — Recursive methods- a modification that helps the process go faster while protecting the prime features.

Introducing…?Group Feature Selection.

Group Feature Selection is a method I used in my last Kaggle competition —?Parkinson’s Freezing of Gait Prediction?(My final position: 205/1379).


The main algorithm

  1. Define a root group of features and get your initial score.
  2. Define groups of features that have a similar meaning but introduce different scales.
  3. Decide on a minimal score improvement per feature (denoted??).
  4. Add one group at a time- if the group scored better than your best score + (number of features added * ?) then this group is good enough for the next test.
  5. Now take each feature of the group that passed the test in step 3 and perform downward feature elimination- this will prevent multicollinearity within the group.
  6. If there are more groups go back to Step 3

Why is this method accurate?


Why is this method fast?

  • The reason is the groups, if a group fails to improve by the given threshold it is discarded.
  • If a group has (for example) 8 features, we can discard them all with one cv loop instead of 8.

When should I use it?

  • This method, I find — is very beneficial when you want to test a large number of features you engineered, yet it is worth mentioning that this is a very user-defined process, if done poorly- it could miss the purpose altogether.


What are the main user-defined points, and how should I define them?

  • Define the root group (step 0) with caution! choose basic features regarding the problem you are trying to solve (basic != the best features).
  • Try different group definitions. one option is taking all features that were generated from the same basic features and calling it a group,
  • another option is to build the groups by scale/method.

Some examples of the latter:

- All features that take the mean/median/std/min/max of other features.

- All features that take the lag (time series) of size X.

- All features that are model generated.

  • Avoid defining too small groups (or you will merely have an upwards feature selection).
  • Avoid feature groups that are too big, else you will defiantly miss valuable features that were not able to make a big enough impact to pass the test (step 3).
  • Experiment with different minimal score improvement values (?)
  • A good starting point will be the mean feature improvement you got from prior experiments.

When shouldn’t you use it?

  • If you are introducing a minor amount of features at each iteration in your Data Science cycle (EDA-> Feature Engineering -> Features. selection -> modeling->testing), you should use a different method.
  • If you don’t want to introduce the risk of a user-defined process.

That is all for now!

I hope you will benefit from this approach in your next project, And may the features be with you.

Amichai Oron

I help companies engage customers early & co-build products to their needs —in just 90 days ?? My battle-tested method saves 50% on development costs & maximizes growth!

5 个月

???? ??? ?? ?? ???????? ??? ????? ???? ?????? ???: ?????? ????? ??? ??????? ????? ????? ?????? ??????. https://chat.whatsapp.com/BubG8iFDe2bHHWkNYiboeU

回复
Netanel Stern

CEO and security engineer

6 个月

???? ??? ?? ?? ?????? ??????? ??? ???? ???? ????? ???? ?????? ???: https://chat.whatsapp.com/HWWA9nLQYhW9DH97x227hJ

回复
Bar Mosseri

WordPress Expert & Mentor | Empowering Web Success

6 个月

???? ??? ?? ??????! ??? ????? ???? ?????? ?????? ????? ?????? ????? ??? ????? ??????? ?????? ?????? ?????? ??????: https://chat.whatsapp.com/BubG8iFDe2bHHWkNYiboeU

回复
Omer Rugi

Backend Software Engineer || Data Flows Optimization || Java, Spring, Python, SQL, Mongodb

1 年

Very cool and informative :) Thanks for that!

要查看或添加评论,请登录

Aviv Levi的更多文章

  • Don't make me scrape?you!

    Don't make me scrape?you!

    In July of 2020 the coronavirus was spreading fast in my country — every day the number of new cases increased and I…

    8 条评论
  • Why losing a Kaggle competition is not bad news.

    Why losing a Kaggle competition is not bad news.

    When I started my Data science journey I had a good feeling about almost everything I did. If a learning process is…

    5 条评论

社区洞察

其他会员也浏览了