5 advanced Scikit-learn features that will transform the way you code
Deena Gergis
AI & Data Science Expert @ McKinsey ? Improving lives, one AI product at a time
Very few packages have been successful in achieving what sklearn reached. It is not only that they provide almost all of the commonly used ML algorithms, it is also how they provide those algorithms.The core code of Sklearn is written in Cython granting optimized performance. Their API has been designed to provide consistency, readability and extensibility. And on top of the core ML algorithms, sklearn provides you additional functionalities that for creating end-to-end pipelines. If there is a single adjective that could describe this package, it should be “Beautiful”.
If you have ever worked with sklearn, you will probably be familiar with the common methods such as fit, predict and transform. Maybe you will be familiar with a couple of other preprocessing methods as well. But the power of this package goes way beyond the commonly used functionalities.
The goal of this article is to highlight some of the very powerful and less known features of sklearn. Those features will enable you to unleash the maximum potential of sklearn. You will get a quick glimpse of what those features are and how you can use them. A very short code snippet will be provided, followed by a reference for more details. The purpose of the code snippet is to illustrate the functionality and syntax only. The snippets do not represent the complete workflow. And finally, 0.22.1 was used.
1. Pipelines
Your models will always consist of multiple sequential phases, where the output of one phase will be the input of the next. For example, a classifier for high dimensional input would typically include normalisation, dimensionality reduction and the classification model.
Sklearn’s pipelines provide an elegant wrapper for chaining those sequential steps. When you will use pipelines, you will not have to worry about managing the intermediate objects. All you will need to do is to specify the steps and call a single fit method. When persisting your model, you only need to pickle one object, which is the pipeline. Using the pipeline will improve your code’s readability, decrease bugs and ease the persistence of your trained model.
2. Inline target transformers
For some cases, you could strongly benefit from a non-linear transformation of your target before training your model. For example, a log transformation for a heavy-tailed target is usually a very wise step. When using the model to predict new data, you also need to make sure that you inverse this transformation for the predictions.
Here is some good new for you: you do not need to use Pandas or Numpy to create those transformations. You can use Sklearn to apply target transformations directly as illustrated below:
The following will automatically happen under the hood:
- While training: `regressor.fit(X,func(y))`
- While predicting: `inverse_func(regressor.predict(X))`
3. Feature Union
Even with the sequential steps, you are not be limited to only one transformer per step. You can use multiple transformers, and concatenate the results in a single step.
In the Pipeline example above, we have used a single PCA to transform the normalized data before training. Let’s take an example where we want to use a kernel PCA, in addition to the Linear PCA. The PCAs were applied in parallel to the original data, and their results were automatically concatenated.
You can also plug-in the feature unions in the pipeline. And of course, you can also write your own custom transformers.
4. Chaining models for rolling predictions
Sometimes you will face a situation where you will need to chain multiple models, such that the output of the first model is the input of the second model. A very common use case for such chaining is in time series models: if we need to predict two timesteps, the prediction of y(t+1) will be an input for predicting at y(t+2) .
With sklearn, you have the option to create that chaining automatically. Your y will not be an array, but rather a matrix that contains the multiple dependent targets. And the RegressorChain will automatically include the previous target to predict the next one. During predictions, the chain will predict the next target based on the predictions of the previous one. All you need to do is to use the fit and predict methods as usual.
5. Feature importance using permutation
Feature importance is usually one of the most important modelling insights that we can have and present to the end user. But, depending on the algorithm that you are using, it is not always straight-forward to get them.
Permuting the features can be used to infer the importance of each of the features, regardless of the modelling method. The core idea behind is very intuitive: a single feature is randomly shuffled, and the decrease in the model score is quantified. The higher the change, the more important the feature. Sklearn has this method implemented, so you can use it out of the box.
PhD in Computer Science | AI & Machine Learning Specialist | Educator & Researcher | Python Programming Expert"
4 年Thanks?
Data Science & AI Senior Manager at Sutherland | X Vodafone & Orange | Data Science & AI Instructor
4 年Perfect one Deena :)
Teammanager Data Science bei DKV Mobility
4 年Great article Deena, thx for sharing! I didn't know the point about inverse functions :-)... I like learning new things! Maybe worth noting down. If you use pipelines you can always pass keyword arguments to the specific part of your pipeline using a double underscore together with the name of the pipeline step and the argument. This is especially interesting if you use self developed models together with base functions in a pipeline. For example use model__sample_weight in the pipeline fit to pass values directly to the sample weight.
Senior Software Engineer | Hayah Aktar Podcast Host
4 年Thank you Deena Gergis for sharing this article. I want to share one more function which is sklearn's ColumnTransformer class. ColumnTransformer allows the application of different transformations to column subsets of the input data. It can be very powerful when combined with Pipelines and GridSearchCV.