5 advanced Scikit-learn features that will transform the way you code

5 advanced Scikit-learn features that will transform the way you code


Very few packages have been successful in achieving what sklearn reached. It is not only that they provide almost all of the commonly used ML algorithms, it is also how they provide those algorithms.The core code of Sklearn is written in Cython granting optimized performance. Their API has been designed to provide consistency, readability and extensibility. And on top of the core ML algorithms, sklearn provides you additional functionalities that for creating end-to-end pipelines. If there is a single adjective that could describe this package, it should be “Beautiful”.

If you have ever worked with sklearn, you will probably be familiar with the common methods such as fit, predict and transform. Maybe you will be familiar with a couple of other preprocessing methods as well. But the power of this package goes way beyond the commonly used functionalities.

The goal of this article is to highlight some of the very powerful and less known features of sklearn. Those features will enable you to unleash the maximum potential of sklearn. You will get a quick glimpse of what those features are and how you can use them. A very short code snippet will be provided, followed by a reference for more details. The purpose of the code snippet is to illustrate the functionality and syntax only. The snippets do not represent the complete workflow. And finally, 0.22.1 was used. 



1. Pipelines

Your models will always consist of multiple sequential phases, where the output of one phase will be the input of the next. For example, a classifier for high dimensional input would typically include normalisation, dimensionality reduction and the classification model.

Sklearn’s pipelines provide an elegant wrapper for chaining those sequential steps. When you will use pipelines, you will not have to worry about managing the intermediate objects. All you will need to do is to specify the steps and call a single fit method. When persisting your model, you only need to pickle one object, which is the pipeline. Using the pipeline will improve your code’s readability, decrease bugs and ease the persistence of your trained model.

No alt text provided for this image

Read more



2. Inline target transformers

For some cases, you could strongly benefit from a non-linear transformation of your target before training your model. For example, a log transformation for a heavy-tailed target is usually a very wise step. When using the model to predict new data, you also need to make sure that you inverse this transformation for the predictions. 

Here is some good new for you: you do not need to use Pandas or Numpy to create those transformations. You can use Sklearn to apply target transformations directly as illustrated below:

No alt text provided for this image

The following will automatically happen under the hood: 

  • While training: `regressor.fit(X,func(y))`
  • While predicting: `inverse_func(regressor.predict(X))`

Read more



3. Feature Union

Even with the sequential steps, you are not be limited to only one transformer per step. You can use multiple transformers, and concatenate the results in a single step. 

In the Pipeline example above, we have used a single PCA to transform the normalized data before training. Let’s take an example where we want to use a kernel PCA, in addition to the Linear PCA. The PCAs were applied in parallel to the original data, and their results were automatically concatenated.

No alt text provided for this image

You can also plug-in the feature unions in the pipeline. And of course, you can also write your own custom transformers.

Read more



4. Chaining models for rolling predictions 

Sometimes you will face a situation where you will need to chain multiple models, such that the output of the first model is the input of the second model. A very common use case for such chaining is in time series models: if we need to predict two timesteps, the prediction of y(t+1) will be an input for predicting at  y(t+2) .

With sklearn, you have the option to create that chaining automatically. Your y will not be an array, but rather a matrix that contains the multiple dependent targets. And the RegressorChain will automatically include the previous target to predict the next one. During predictions, the chain will predict the next target based on the predictions of the previous one. All you need to do is to use the fit and predict methods as usual. 

No alt text provided for this image

Read more


5. Feature importance using permutation

Feature importance is usually one of the most important modelling insights that we can have and present to the end user. But, depending on the algorithm that you are using, it is not always straight-forward to get them.

Permuting the features can be used to infer the importance of each of the features, regardless of the modelling method. The core idea behind is very intuitive: a single feature is randomly shuffled, and the decrease in the model score is quantified. The higher the change, the more important the feature. Sklearn has this method implemented, so you can use it out of the box.

No alt text provided for this image

Read more


Now it's your turn: Share one other advanced function in Sklearn. Write this function in the comments below

Ahmad Abdulla

PhD in Computer Science | AI & Machine Learning Specialist | Educator & Researcher | Python Programming Expert"

4 年

Thanks?

回复
Amr Helal

Data Science & AI Senior Manager at Sutherland | X Vodafone & Orange | Data Science & AI Instructor

4 年

Perfect one Deena :)

Dr. Christian Wittrock

Teammanager Data Science bei DKV Mobility

4 年

Great article Deena, thx for sharing! I didn't know the point about inverse functions :-)... I like learning new things! Maybe worth noting down. If you use pipelines you can always pass keyword arguments to the specific part of your pipeline using a double underscore together with the name of the pipeline step and the argument. This is especially interesting if you use self developed models together with base functions in a pipeline. For example use model__sample_weight in the pipeline fit to pass values directly to the sample weight.

Marco Mounir

Senior Software Engineer | Hayah Aktar Podcast Host

4 年

Thank you Deena Gergis for sharing this article. I want to share one more function which is sklearn's ColumnTransformer class. ColumnTransformer allows the application of different transformations to column subsets of the input data. It can be very powerful when combined with Pipelines and GridSearchCV.

要查看或添加评论,请登录

Deena Gergis的更多文章

社区洞察

其他会员也浏览了