Normalization in Machine learning
Sandeepkumar Belamagi
Data Analyst | Machine Learning | Python & ML Pipelines | Power BI Expert | MLOps Learner | Transitioning to Data Science / ML Engineer
What is Normalization in Machine Learning?
Normalization is a scaling technique in Machine Learning applied during data preparation to change the values of numeric columns in the dataset to use a common scale. It is not necessary for all datasets in a model. It is required only when features of machine learning models have different ranges.
Methods of Data Normalization :
1) Decimal Scaling
2) Min-Max Normalization
3) z-Score Normalization(zero-mean Normalization)
Implementing different methods of Data normalization:
1. Decimal Scaling
Decimal normalization is a method of normalization in which the given value is normalized by shifting the decimal points of that value. The number of decimal points to move is determined by the absolute maximum value of the given set of data. If Vi value of attribute A, then Ui is given as,
Decimal Scale Normalization formula:
Where, j is the smallest integer such that max|Ui|<1.
领英推荐
min-max normalization:
Also known as min-max scaling or min-max normalization, rescaling is the simplest method and consists in rescaling the range of features to scale the range in 0 to 1 or ?1 to 1. Selecting the target range depends on the nature of the data. The general formula for a min-max of 0 to 1 is given as:
where x is an original value, x' is the normalized value. For example, suppose that we have the students' weight data, and the students' weights span 160 pounds, 200 pounds. To rescale this data, we first subtract 160 from each student's weight and divide the result by 40 (the difference between the maximum and minimum weights)
Z-Score Normalization / Standardization (zero-mean Normalization):
Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance. This method is widely used for normalization in many machine learning algorithms (e.g., support vector machines, logistic regression, and artificial neural networks). The general method of calculation is to determine the distribution mean and standard deviation for each feature. Next we subtract the mean from each feature. Then we divide the values (mean is already subtracted) of each feature by its standard deviation.
Where x is the original feature vector, x? = average(x) is the mean of that feature vector, and ?? is its standard deviation.
Some machine learning algorithms benefit from normalization and standardization, particularly when Euclidean distance is used. For example, if one of the variables in the K-Nearest Neighbor, KNN, is in the 1000s and the other is in the 0.1s, the first variable will dominate the distance rather strongly. In this scenario, normalization and standardization might be beneficial.
When to use normalization and standardization:
Thank you...!!!