Column Transformer and Pipelines in Machine Learning
Column Transformer and Pipelines

Column Transformer and Pipelines in Machine Learning

Introduction:

When starting out or participating in competitions, it may seem beneficial to pre-process data in separate stages. However, when the goal is to develop a comprehensive machine learning project for production purposes, it becomes necessary to pre-process new data before feeding it to the model. Rewriting all the preprocessing steps each time can be time-consuming. To save time and effort, pipelines are employed. Machine learning pipelines enable us to execute all the preprocessing steps sequentially, and with the help of a Column Transformer, this can be achieved with just a single line of code.

Column Transformer:

Column Transformer is a tool in scikit-learn that helps us work with numerical and categorical data separately. It allows us to create and apply different transformations to specific columns of our data. To use Column Transformer, we need to provide a transformer object and specify the transformations we want to apply to each column. These transformations are passed in a tuple along with the column we want to apply them to.

To demonstrate column transformer, I use a toy data set of COVID.

The transformation we will build for:

  • Missing value imputation
  • Ordinal Encoding
  • One Hot Encoding

Now We will see a detailed difference between doing code with column transformers and without column transformers:

No alt text provided for this image
Fiigure 1: Importing Libraries and Data set

Without Column Transformer

No alt text provided for this image
Figure 2: Without Column Transformer
No alt text provided for this image
Figure 3: Concatinating Data

With Column Transformer

No alt text provided for this image
Figure 4: With Column Transformer

Machine Learning Pipelines:

Machine learning pipelines are like a series of connected steps, where the result of each step is passed to the next one. It's similar to how in neural networks, the output of one layer becomes the input for the next layer. Just like a pipeline carries water from one place to another, a machine learning pipeline carries data through each step until the final output is achieved. By using machine learning pipelines, the length of production code also reduces.

Demonstration of machine Learning Pipelines:

  1. Without using machine learning Pipelines:

No alt text provided for this image
Figure 5: Importing Libraries and data set
No alt text provided for this image
Figure 6: Applying train test split
No alt text provided for this image
Figure 7: Applying feature engineering and column concatination
No alt text provided for this image
Figure 8: Model Implementation and model saving

Production side code without using pipelines:

No alt text provided for this image
Figure 9: Production Code without using pipelining

As we can see, in production code we have to implement every feature engineering step which is haptic because we have to take care of the sequence that we use in implementing side.

2. using Machine Learning Pipeline:

No alt text provided for this image
Figure 10: importing libraries and data set and aplying train test split
No alt text provided for this image
Figure 11: Creating Pipeline nodes using column transformer
No alt text provided for this image
Figure 12 :Creating Pipeline
No alt text provided for this image
Figure 13: Exploring Pipelines and saving model

Production code using pipelines:

No alt text provided for this image
Figure 14: Production code using pipelines

Conclusion:

By using column transformers and machine learning pipelines, we can reduce the lines of code which also reduce the code reading complexity and also make our production side code easy.


要查看或添加评论,请登录

社区洞察

其他会员也浏览了