登录查看更多内容

5 advanced Scikit-learn features that will transform the way you code

Deena Gergis

AI & Data Science Expert @ McKinsey ? Improving lives, one AI product at a time

发布日期: 2020年9月2日

Very few packages have been successful in achieving what sklearn reached. It is not only that they provide almost all of the commonly used ML algorithms, it is also how they provide those algorithms.The core code of Sklearn is written in Cython granting optimized performance. Their API has been designed to provide consistency, readability and extensibility. And on top of the core ML algorithms, sklearn provides you additional functionalities that for creating end-to-end pipelines. If there is a single adjective that could describe this package, it should be “Beautiful”.

If you have ever worked with sklearn, you will probably be familiar with the common methods such as fit, predict and transform. Maybe you will be familiar with a couple of other preprocessing methods as well. But the power of this package goes way beyond the commonly used functionalities.

The goal of this article is to highlight some of the very powerful and less known features of sklearn. Those features will enable you to unleash the maximum potential of sklearn. You will get a quick glimpse of what those features are and how you can use them. A very short code snippet will be provided, followed by a reference for more details. The purpose of the code snippet is to illustrate the functionality and syntax only. The snippets do not represent the complete workflow. And finally, 0.22.1 was used.

1. Pipelines

Your models will always consist of multiple sequential phases, where the output of one phase will be the input of the next. For example, a classifier for high dimensional input would typically include normalisation, dimensionality reduction and the classification model.

Sklearn’s pipelines provide an elegant wrapper for chaining those sequential steps. When you will use pipelines, you will not have to worry about managing the intermediate objects. All you will need to do is to specify the steps and call a single fit method. When persisting your model, you only need to pickle one object, which is the pipeline. Using the pipeline will improve your code’s readability, decrease bugs and ease the persistence of your trained model.

2. Inline target transformers

For some cases, you could strongly benefit from a non-linear transformation of your target before training your model. For example, a log transformation for a heavy-tailed target is usually a very wise step. When using the model to predict new data, you also need to make sure that you inverse this transformation for the predictions.

Here is some good new for you: you do not need to use Pandas or Numpy to create those transformations. You can use Sklearn to apply target transformations directly as illustrated below:

The following will automatically happen under the hood:

While training: `regressor.fit(X,func(y))`
While predicting: `inverse_func(regressor.predict(X))`

3. Feature Union

Even with the sequential steps, you are not be limited to only one transformer per step. You can use multiple transformers, and concatenate the results in a single step.

In the Pipeline example above, we have used a single PCA to transform the normalized data before training. Let’s take an example where we want to use a kernel PCA, in addition to the Linear PCA. The PCAs were applied in parallel to the original data, and their results were automatically concatenated.

You can also plug-in the feature unions in the pipeline. And of course, you can also write your own custom transformers.

4. Chaining models for rolling predictions

Sometimes you will face a situation where you will need to chain multiple models, such that the output of the first model is the input of the second model. A very common use case for such chaining is in time series models: if we need to predict two timesteps, the prediction of y(t+1) will be an input for predicting at y(t+2) .

With sklearn, you have the option to create that chaining automatically. Your y will not be an array, but rather a matrix that contains the multiple dependent targets. And the RegressorChain will automatically include the previous target to predict the next one. During predictions, the chain will predict the next target based on the predictions of the previous one. All you need to do is to use the fit and predict methods as usual.

5. Feature importance using permutation

Feature importance is usually one of the most important modelling insights that we can have and present to the end user. But, depending on the algorithm that you are using, it is not always straight-forward to get them.

Permuting the features can be used to infer the importance of each of the features, regardless of the modelling method. The core idea behind is very intuitive: a single feature is randomly shuffled, and the decrease in the model score is quantified. The higher the change, the more important the feature. Sklearn has this method implemented, so you can use it out of the box.

Ahmad Abdulla

PhD in Computer Science | AI & Machine Learning Specialist | Educator & Researcher | Python Programming Expert"

4 年

Thanks?

Amr Helal

Data Science & AI Senior Manager at Sutherland | X Vodafone & Orange | Data Science & AI Instructor

4 年

Perfect one Deena :)

1 次回应

Dr. Christian Wittrock

Teammanager Data Science bei DKV Mobility

4 年

Great article Deena, thx for sharing! I didn't know the point about inverse functions :-)... I like learning new things! Maybe worth noting down. If you use pipelines you can always pass keyword arguments to the specific part of your pipeline using a double underscore together with the name of the pipeline step and the argument. This is especially interesting if you use self developed models together with base functions in a pipeline. For example use model__sample_weight in the pipeline fit to pass values directly to the sample weight.

1 次回应

Marco Mounir

Senior Software Engineer | Hayah Aktar Podcast Host

4 年

Thank you Deena Gergis for sharing this article. I want to share one more function which is sklearn's ColumnTransformer class. ColumnTransformer allows the application of different transformations to column subsets of the input data. It can be very powerful when combined with Pipelines and GridSearchCV.

1 次回应

查看更多评论

要查看或添加评论，请登录

Deena Gergis的更多文章

Your intuitive guide to interpret SHAP's beeswarm plot

2023年12月5日

Your intuitive guide to interpret SHAP's beeswarm plot

The SHAP beeswarm plot is a powerful tool for interpreting machine learning models, but it can be a bit intimidating at…

4 条评论
Your definitive guide to a #CareerShift from non-quantitive field to #DataScience

2022年12月5日

Your definitive guide to a #CareerShift from non-quantitive field to #DataScience

Dear Deena, I am an experienced Graphics Designer, but I am very interested in Machine Learning. Unfortunately, it is…

15 条评论
Dear Data Scientists, Stop removing your outliers!

2022年10月24日

Dear Data Scientists, Stop removing your outliers!

Stumbling upon Abhishek's post on LinkedIn about the academic boxplots vs. real-life boxplots, I couldn’t help but…

42 条评论
The 7 steps to choose the best topic for your #DataScience graduation project

2022年9月13日

The 7 steps to choose the best topic for your #DataScience graduation project

So, you’ve made the very wise decision of studying Data Science a while back. You have studied a bunch of various…

17 条评论
Are your KPIs deceiving you?

2022年2月21日

Are your KPIs deceiving you?

What if I told you that your company is implementing a process that is cutting down 7% of your bills? Awesome, you…

16 条评论
What makes successful people ... successful?

2021年11月8日

What makes successful people ... successful?

Here is a million dollar question: What makes successful people successful? This should be a simple answer: “Hard work…

25 条评论
MLflow: a better way to track your models

2021年9月27日

MLflow: a better way to track your models

In a perfect world, you would get clean data that you will feed into a single machine learning model and voila, done…

10 条评论
Beginner's guide: The top 10 Data Science libraries in Python

2021年8月2日

Beginner's guide: The top 10 Data Science libraries in Python

Dear aspiring Data Scientist, You think that Data Science is the coolest. Yes, you are right! So you decide to pursue…

33 条评论
Beyond the discomfort of learning

2021年7月6日

Beyond the discomfort of learning

"If a company is expecting me to learn, why would it hire me in the first place?" Few weeks ago, I received this…

10 条评论
Your 7 references to master Data Science

2021年5月24日

Your 7 references to master Data Science

If you are starting your career in Data Science, chances are you are feeling lost in all of the different resources…

15 条评论

See all articles

5 advanced Scikit-learn features that will transform the way you code

Deena Gergis

AI & Data Science Expert @ McKinsey ? Improving lives, one AI product at a time

1. Pipelines

2. Inline target transformers

3. Feature Union

4. Chaining models for rolling predictions

5. Feature importance using permutation

Now it's your turn: Share one other advanced function in Sklearn. Write this function in the comments below

Deena Gergis的更多文章

社区洞察

其他会员也浏览了

Issue #184 - THE ML ENGINEER ??

Issue #163 - THE ML ENGINEER ??

My Sole Advise to Data Scientists on Coursera & Quora

Beyond Algorithms: The Essential Skills for Thriving as a Machine Learning Engineer

Master Machine Learning: Best Regression Modeling Courses in 2024

How to get started with LLMs as a product manager

Creating Your First Machine Learning Classifier with Sklearn

Starting Machine Learning? Do not repeat my mistakes!

Key Machine Learning Areas You Need to Learn

Book Review: Hands-on machine learning with Scikit-learn, Keras & TensorFlow

1. Pipelines

2. Inline target transformers

3. Feature Union

4. Chaining models for rolling predictions

5. Feature importance using permutation

Now it's your turn: Share one other advanced function in Sklearn. Write this function in the comments below

Deena Gergis的更多文章

Your intuitive guide to interpret SHAP's beeswarm plot

Your definitive guide to a #CareerShift from non-quantitive field to #DataScience

Dear Data Scientists, Stop removing your outliers!

The 7 steps to choose the best topic for your #DataScience graduation project

Are your KPIs deceiving you?

What makes successful people ... successful?

MLflow: a better way to track your models

Beginner's guide: The top 10 Data Science libraries in Python

Beyond the discomfort of learning

Your 7 references to master Data Science

社区洞察

其他会员也浏览了

Issue #184 - THE ML ENGINEER ??

Issue #163 - THE ML ENGINEER ??

My Sole Advise to Data Scientists on Coursera & Quora

Beyond Algorithms: The Essential Skills for Thriving as a Machine Learning Engineer

Master Machine Learning: Best Regression Modeling Courses in 2024

How to get started with LLMs as a product manager

Creating Your First Machine Learning Classifier with Sklearn

Starting Machine Learning? Do not repeat my mistakes!

Key Machine Learning Areas You Need to Learn

Book Review: Hands-on machine learning with Scikit-learn, Keras & TensorFlow