登录查看更多内容

Introduction to Group Feature Selection

Aviv Levi

Data Scientist @ Digital Turbine | Python, PySpark, Machine Learning

发布日期: 2023年6月26日

If you love data, you must love features.

Features in a Data Science project are the steps to your ladder, the water to your ship, the air to your airplane, the.. well, you got it.

Even though they are the bits of truth our model lives through they can’t all exist simultaneously, or at least they shouldn’t.

Some features can be irrelevant, poorly engineered, and (most likely) introduce multicollinearity that hurts our model’s accuracy and speed.

With the right features, you are 95% on the way to a model that achieves better accuracy, efficiency, and interpretability— Faster.

In this article, I will share my motivation for using existing feature selection methods and what made me engineer a whole (I think) new technique.

What do Data people usually do to select features?

Statistical tests: such as?VarianceThreshold?and?chi2.
Recursive methods such as?Recursive feature elimination with cross-validation.
Using the model’s inherent?features_importance, if it exists.

Those are excellent methods, I use them separately or together all the time, but they introduce some technical issues:

Statistical methods are too simple — those methods can easily miss some non-trivial connection between features that a model can use, thus removing features that could help.
Recursive/Iterative methods are too slow; what if you have 200 features, and it takes 30 seconds to train a model? Say you use 5-fold cross-validation, which sums up to 30(seconds) * 200(features) * 5(folds) = that’s over 8 hours.
Feature importance relies on the model’s ability to tell essential features from non-important features which can be difficult when you want to test a large number of features at once, the model will suffer from multicollinearity.

To address these limitations, I recently started using a modified version of my favorite — Recursive methods- a modification that helps the process go faster while protecting the prime features.

Introducing…?Group Feature Selection.

Group Feature Selection is a method I used in my last Kaggle competition —?Parkinson’s Freezing of Gait Prediction?(My final position: 205/1379).

The main algorithm

Define a root group of features and get your initial score.
Define groups of features that have a similar meaning but introduce different scales.
Decide on a minimal score improvement per feature (denoted??).
Add one group at a time- if the group scored better than your best score + (number of features added * ?) then this group is good enough for the next test.
Now take each feature of the group that passed the test in step 3 and perform downward feature elimination- this will prevent multicollinearity within the group.
If there are more groups go back to Step 3

Why is this method accurate?

This method takes its accuracy from its mother method,?the iterative feature selection.

领英推荐

Leaders Are Readers -- November 2024

Shawn Campbell 3 个月前

Unlocking the Power of Data & Algorithms: Transforming…

DataThick 9 个月前

Avoiding bias in data analytics

Naveen Joshi 7 年前

Why is this method fast?

The reason is the groups, if a group fails to improve by the given threshold it is discarded.
If a group has (for example) 8 features, we can discard them all with one cv loop instead of 8.

When should I use it?

This method, I find — is very beneficial when you want to test a large number of features you engineered, yet it is worth mentioning that this is a very user-defined process, if done poorly- it could miss the purpose altogether.

What are the main user-defined points, and how should I define them?

Define the root group (step 0) with caution! choose basic features regarding the problem you are trying to solve (basic != the best features).
Try different group definitions. one option is taking all features that were generated from the same basic features and calling it a group,
another option is to build the groups by scale/method.

Some examples of the latter:

- All features that take the mean/median/std/min/max of other features.

- All features that take the lag (time series) of size X.

- All features that are model generated.

Avoid defining too small groups (or you will merely have an upwards feature selection).
Avoid feature groups that are too big, else you will defiantly miss valuable features that were not able to make a big enough impact to pass the test (step 3).
Experiment with different minimal score improvement values (?)
A good starting point will be the mean feature improvement you got from prior experiments.

When shouldn’t you use it?

If you are introducing a minor amount of features at each iteration in your Data Science cycle (EDA-> Feature Engineering -> Features. selection -> modeling->testing), you should use a different method.
If you don’t want to introduce the risk of a user-defined process.

That is all for now!

I hope you will benefit from this approach in your next project, And may the features be with you.

Amichai Oron

UX/UI SAAS Product Designer & Consultant ?? | Helping SAAS / AI companies and Startups Build Intuitive, Scalable Products.

5 个月

???? ??? ?? ?? ???????? ??? ????? ???? ?????? ???: ?????? ????? ??? ??????? ????? ????? ?????? ??????. https://chat.whatsapp.com/BubG8iFDe2bHHWkNYiboeU

Netanel Stern

CEO and security engineer

7 个月

???? ??? ?? ?? ?????? ??????? ??? ???? ???? ????? ???? ?????? ???: https://chat.whatsapp.com/HWWA9nLQYhW9DH97x227hJ

Bar Mosseri

WordPress Expert & Mentor | Empowering Web Success

7 个月

???? ??? ?? ??????! ??? ????? ???? ?????? ?????? ????? ?????? ????? ??? ????? ??????? ?????? ?????? ?????? ??????: https://chat.whatsapp.com/BubG8iFDe2bHHWkNYiboeU

Omer Rugi

Backend Software Engineer @ Tipalti || Data Flows Optimization || C#, .NET, Java, Spring, Python, SQL, Mongodb, Redis, Kafka

1 年

Very cool and informative :) Thanks for that!

1 次回应

查看更多评论

要查看或添加评论，请登录

Aviv Levi的更多文章

Don't make me scrape?you!

2021年5月11日

Don't make me scrape?you!

In July of 2020 the coronavirus was spreading fast in my country — every day the number of new cases increased and I…

8 条评论
Why losing a Kaggle competition is not bad news.

2020年12月1日

Why losing a Kaggle competition is not bad news.

When I started my Data science journey I had a good feeling about almost everything I did. If a learning process is…

5 条评论

Introduction to Group Feature Selection

Aviv Levi

Data Scientist @ Digital Turbine | Python, PySpark, Machine Learning

If you love data, you must love features.

What do Data people usually do to select features?

Introducing…?Group Feature Selection.

The main algorithm

领英推荐

That is all for now!

Aviv Levi的更多文章

社区洞察

其他会员也浏览了

Why data products fail

5 of the Biggest Changes to Data and Tech

Stuck in the Muck: Big Data means Big Problems

When Linear Models Don’t Fit Your Data, Now What?

Statistical Distributions: Types and Importance.

I ran 580 model-dataset experiments to show that, even if you try very hard, it is almost impossible to know that a model is degrading just by looking

Understanding Shuffle Operations in Spark: An In-Depth Look

Quantico: Forecasting Panel & Single Series Data

Doing Data Science a bit differently

A Gentle Introduction to Probabilistic Data Structures

If you love data, you must love features.

What do Data people usually do to select features?

Introducing…?Group Feature Selection.

The main algorithm

领英推荐

That is all for now!

Aviv Levi的更多文章

Don't make me scrape?you!

Why losing a Kaggle competition is not bad news.

社区洞察

其他会员也浏览了

Why data products fail

5 of the Biggest Changes to Data and Tech

Stuck in the Muck: Big Data means Big Problems

When Linear Models Don’t Fit Your Data, Now What?

Statistical Distributions: Types and Importance.

I ran 580 model-dataset experiments to show that, even if you try very hard, it is almost impossible to know that a model is degrading just by looking

Understanding Shuffle Operations in Spark: An In-Depth Look

Quantico: Forecasting Panel & Single Series Data

Doing Data Science a bit differently

A Gentle Introduction to Probabilistic Data Structures