Data Transformations in Machine Learning |2 - Part 10
We have discussed the different data transformation techniques in the last article, i.e., Log Transformer, Reciprocal Transformer, Square Transformer, Square Root Transformer, etc.
We’ll continue our discussion of leftover two important techniques….
Data Transformation Techniques:
Binning/discretization and binarization are techniques used to transform continuous numerical data(Height, Weight, Mass, Temperature, Energy, Speed, Length, etc) into discrete or binary representations.
1. Binning / Discretization
Binning is the process of grouping a set of continuous or numerical data points into a smaller number of discrete “bins” for analysis.
What are Bins? Bins are intervals or ranges into which you divide the range of your continuous numerical data.
Why do we create Bins?
Here is an example, let’s consider the AGE feature, Instead of using individual ages, you might create bins like “0–10,” “11–20,” and so on. This way, you’ll group ages into categories or bins.
When utilizing the KBinsDiscretizer library in scikit-learn for binning, you will encounter a parameter named 'strategy.' This parameter offers different strategies to define the widths of the bins during the discretization process. The available strategies include 'uniform,' 'quantile,' and 'kmeans.'
By experimenting with these strategies in the provided Colab notebook, you can visualize how the data transforms through plots, gaining insights below, to the impact of each strategy on the resulting bin configuration.
Google Colaboratory
2. Binarization:
Binarization is the process of converting numerical data into binary form, typically 0s and 1s. It involves setting a threshold value, and any data point above the threshold is marked as 1, while those below or equal to the threshold are marked as 0.
Let’s take an example, temperatures, where anything above a certain temperature is considered “hot” (1), and anything below is considered “not hot” (0).
You can see the practical things below colab notebook,
领英推荐
Google Colaboratory
Key Differences between both:
1. Nature:
2. Output:
3. Method:
4. Flexibility:
In conclusion, our exploration of data binning and binarization in machine learning underscores the versatility and significance of tailoring our data to align with the demands of diverse models.
From the structured discretization introduced by binning to the simplicity of binary representation through binarization, each technique serves a crucial role in reshaping our datasets.
So, that's it for this article, we'll continue our discussion in the next article.
Previous article: 9. Data Transformation in ML.
Next article: 11. Column Transformer in ML.
YouTube Channel: