Normalization and Standardization in Data?Science: When to apply one, when to apply the?other?
I’m going to bring you now probably the topic that generates the most doubts among those who are just starting their journey in data science.
What is the difference between normalization and standardization? When to use one technique or the other? What is the practical result of applying normalization and standardization? Even more experienced professionals may not be able to clearly describe the difference. There is no standard, no international regulation on this.
Thus, many people use a name they prefer, a name they understand, a name they interpret. Furthermore, if the professional comes from the statistics field, for example, they give it one name. If the professional comes from computer science, they give it another name. As they become more experienced in data science, they mix the names even more. This is normal because there is no single regulation, no single rule in data science. So, the two concepts end up causing a lot of confusion.
We are not working with assumptions here; we work with data, with facts. So, I will show you both concepts in practice. Several multivariate analysis techniques consider normalized or standardized data. If I don’t do this, I will be violating premises, rules of the algorithms I will use. So, I will have to apply normalization and standardization at various times. We apply one technique or the other. And then, of course, the question is: what is the difference? When to apply one, when to apply the other? I will explain all this to you now. It is not trivial.
Preprocessing Holy?Grail?
Data preprocessing is a fundamental step in data analysis and machine learning. So, we apply normalization and standardization to perform preprocessing.?
Considering the data science process, we start with defining the business problem. We then extract the data, perform an exploratory analysis to understand how the data is organized. We apply some cleaning strategy as necessary.?
And then, we apply the preprocessing, which is usually the last step before modeling, when we will apply the machine learning or statistical algorithm. Ok?
So, preprocessing generally happens just before modeling. And this is where we apply, or can apply, normalization and standardization. Two techniques are frequently used in data preprocessing.?
You will apply one or the other, normalization or standardization, depending on the data you have at hand, the algorithmyou will use at the modeling stage, the scale, and the format of the data. So, the choice between one technique or the other starts with understanding the data. That is why exploratory analysis is so important.
You have to look at the data, check the pattern you have there. And, based on that, apply the technique as needed. Ok? Let’s see what these techniques are and when to use one or the other.
Normalization
Normalization is used to resize data values to a common range without distorting the differences in the value ranges.
There is a difference between data and information. Normalization is used to modify the format of the data, but it cannot modify the information. Modifying the data is allowed. We have several strategies for this, various mathematical techniques. Normalization is mathematical; I will show you the formula shortly.?
I can manipulate the data however I want, according to what I need for the final result. But I cannot modify the information, ok? It has to be done with great care and criteria. Normally, data is resized to a range between 0 and 1. This is useful when the data has different scales, and you want to compare them in an equitable way.
For example, imagine I have two variables in my dataset: age and salary. And I am creating a model to predict whether an employee will resign or not.
So, I have two input variables, age and salary. Age is usually in two digits, right? If it is an employee, they are probably over 10 years old or more. They may even be over 100 years old, so I will also consider three digits, no problem. But usually, the scale is two digits, right? Age, 30 years, 45, 55, 18 years, and so on.
The salary is probably in the thousands. So, the person earns 8 thousand, 10 thousand, 15 thousand, 30 thousand, 50 thousand, and so on. Ok? The variables are on different scales. Depending on the algorithm I will use, I cannot leave it like this. The algorithm considers data on the same scale. So, what do I do? I can apply normalization, which is basically this mathematical formula, quite beautiful, by the way, which you are seeing now:
In other words, I apply a normalization formula to the age variable and apply the formula to the salary variable. As a result, they will be on the same scale. I am modifying the data without modifying the information. It’s a mathematical trick.
Many algorithms have this premise that the data is on the same scale. I apply normalization to meet the premise according to the algorithm that will be used. Almost all the algorithms we will use have the premise that the data should be on the same scale.
See, x is the value of the attribute, for example, age. I subtract the minimum value, so I look at the entire age column, check the minimum value. Then I take each element, each row of the age column, subtract the minimum value from the entire column. Divide by the maximum value minus the minimum value. The result is x-norm, normalized x.
In other words, a mathematical trick that changes the data scale but does not change the information. The information will remain the same, that is, the age of each employee, etc. The question is that it will be on a different scale. This is essentially normalization.
Normalization is the process of resizing numerical values in the dataset to a common scale without distorting the differences in value ranges or losing information. If you change the information, you are completely modifying the purpose of the analysis.
You have to take the raw data; there is some information there, which is what you want to use to train an algorithm, but maybe the data is not in the proper format. You modify the data with mathematical tricks without losing the information.
Generally, normalization is done so that values are between zero and one. This is extremely useful when variables have different units or a wide range in their intervals.
Common Normalization Methods
Oops, but wait a minute, how come? Yeah, but who said I can’t apply a different formula? Who said? Is it forbidden? No. If I keep that rule of modifying the data without modifying the information, I can apply the formula however I want. This is the big problem. Many people find it difficult to understand this right from the start, right?
Normalization is almost a standard you use when you want to perform this kind of activity. But nothing prevents you from modifying the formula, especially if you want to customize the data according to some characteristic, need, and so on. I always avoid doing this, I always try to work directly with the standard, but I want you to know that it is possible to use modifications of the technique.
These are the common methods, but you may find others for normalization, ok? When do you use normalization? I will make a comparison shortly, but to conclude normalization. When do you use it? When you have data on different scales! You then put the data on the same scale, in the range between 0 and 1. You can apply, for example, Min-Max Scaling, which can be done with the scikit-learn library in Python, using one line of code. Did you understand the concept of normalization? So, let’s move on to standardization.
Standardization
Immediately it will become very clear why the difference between normalization and standardization causes so many doubts. Follow me.?
Standardization is used to resize data so that it has a mean of 0 and a standard deviation of?1.
In normalization, I will change the data scale so that the data is in a range between 0 and?1.?
In standardization, I will change the data format, I will change the scale, so that the variable’s mean is 0 and the standard deviation is 1. It is a subtle difference, it is a detail. But the detail, you know, makes all the difference.
So, normalization and standardization, in both cases, we resize the data. These are preprocessing techniques. The way this is done is what is different. In normalization, I put the data in a range between 0 and 1. I am not concerned with the mean and the deviation. I want a range between 0 and 1.
In standardization, I am concerned with the mean and the deviation. I will explain why shortly. In this case, we resize the data so that the distribution has a mean of 0 and a standard deviation of 1. Once again, I am modifying the data without modifying the information.
This transforms the data into a standard normal distribution. Hence the name standardization.
It is because it is the standard normal distribution. Shortly, I will bring more details about the normal distribution. Also called Gaussian, ok? It is useful in algorithms that assume the data follows a standard normal distribution, like many machine learning algorithms.
How It?Works
The machine learning algorithm is born from a research effort. Researchers are finding an algorithm to solve a problem. They collect the data and put the data in a specific format. They then train the algorithm, create the model, and solve the problem.?
They announce to the community: “Hey, community, I solved the problem. You can use my algorithm. It’s free to use.” But there is a detail. To use my algorithm, you have to use the data in the same format I used during the research.
In several cases, the algorithms are trained and researched exactly with data that follows a standard normal distribution. That is, mean 0 and standard deviation equal to 1.
领英推荐
Here is the standardization formula. Basically, it is called Z-score standardization.
I take the value of X, subtract the mean. That letter that looks like the letter U, which is actually the Greek letter Mu, is the mean. Divided by omega, which is the standard deviation. See that the mathematical formula is different.
Did you notice that the mathematical formula is extremely simple? Basic arithmetic operation. The problem is not in the mathematics, as many people think. It is also not in the programming because I do this here with one line of code. The problem is in the concept. I am here trying to explain it to you in detail.
When we go to programming in Python, I will apply these techniques with one line of code. I don’t need to create a mathematical formula. But I have to know how to choose one or the other, right?
Standardization is used to resize the distribution of values so that the mean of the observed values is 0 and the standard deviation is 1. Thus, the data is transformed to have a distribution with a specific mean and standard deviation. This is useful in algorithms that assume all features are centered around 0 and have the same variation.
Common Standardization Methods
The most common is z-score Normalization. But wait a minute, how come? You have been talking about standardization and now you are talking about normalization?
So, don’t be mad at me. I am just the messenger, ok? People call the standardization technique z-score normalization.
The name, nomenclature, standardization, does not make much difference. What you need is to preprocess the data, resize it as needed for the algorithm, train the algorithm, create the model, solve the problem, deliver the project, make the client happy, and move on to the next.
In practice, what matters is solving the business problem. We apply normalization or standardization depending on the data and the algorithm being used. I will teach this to you in some practical projects.
So, people call the standardization technique z-score normalization. People say, “I will apply normalization,” and apply z-score when, in fact, they are applying standardization. But the name they gave the technique is z-score normalization.
In daily practice, you will find everything. That is why I am making this article as a guide, a starting point for you to understand these concepts well.
Considerations
Let’s see some considerations, follow me.
Well, here is where I basically welcome you to the world of chaos, complete chaos. How many algorithms do we have today? Hundreds. Considering the standard algorithm and its variations, there are hundreds. In other words, for each algorithm I decide to use, I have to check the documentation, the algorithm’s specifications, search the documentation, for example, in the framework, to know if that algorithm expects to receive data with a normal distribution, a standard normal distribution, data on the same scale, so that I can decide between normalization and standardization.
It is useless to try to create that de-para table. If it is this algorithm, it will be this technique. This is a waste of time. You have to know the algorithm, apply it daily, and then choose the ideal preprocessing technique.
Can the data already be on the appropriate scale? Yes, it is possible. It is not very common, but it is possible. In this case, do I need to apply normalization or standardization? Maybe not. It is not necessarily mandatory.
Normally, the data will come in different scales. Therefore, you may have to apply one strategy or the other. Then you have to analyze each variable, analyze the algorithm you are using and apply the technique.
Are you going to change the algorithm? Great, then maybe you have to go back to preprocessing and change the technique. Remember, we do data science. Science involves experimentation. You have to know the techniques, the strategies, the algorithms to reach the final result.
I hope that by this point, you already have a clear understanding of the difference between normalization and standardization, at least the conceptual difference. But to close the understanding, let’s look at data, right? After all, that’s what we do, data science.
Original Data
I have here a small table with original data. Raw data, extracted directly from the source. Variable A, variable B, variable C, with different values.
You already notice that in variable B, for example, there is a scale difference. I have a data point that is in the tens, value 50, and other points that are in the hundreds. So, I already noticed a scale difference there.
There is something important. You do not apply preprocessing only to change the scale. Sometimes you apply preprocessing, resize the data, because it simplifies mathematical operations, sometimes even speeds up the process. This is very simple for you to understand.
A value equal to 600 occupies space in the computer’s memory. If I resize 600 to be 0.02, it occupies another space in the computer’s memory. Considering a large dataset, this can make all the difference in mathematical calculations, in processing time.
Many people ignore the computer, take the computer out of the equation, forget that everything we do is on the computer. So, optimizing processing is also our job. Computer science is one of the pillars of data science. So, many times, I apply resizing not only to put data on the same scale but to simplify mathematical steps, since everything in machine learning boils down to mathematics.
Well, then I chose an algorithm, checked the specification, and the algorithm says the following: the data must be on the same scale, preferably with values between 0 and 1. What technique do I have to apply here? Normalization, right?
Min-Max Scaling
I then apply the mathematical formula, using Min-Max, and the data will have this format in the end. See that I modified the data, but I did not modify the information.
Look, for example, at variable A. In variable A, I have the number 100, then it increases proportionally to 200, increases proportionally to 300, right?
Now look at variable A with the normalized data. See that it increases proportionally, from 0 to 0.5 to 1. In other words, I put the data in the range between 0 and 1, maintaining the pattern I have in that variable, but I modified the data. The information was not lost, it is still there, but now it is on a different scale, and many algorithms expect to receive the data in this format. This is normalization.
But with the same data, I noticed that the algorithm did not give a good result. So I decided to change the algorithm. I checked the specification again. The algorithm says the following: the data must have a standard normal distribution, must follow a standard normal distribution. The original data is the same, but I changed the algorithm on the other end.
So now I have to change the preprocessing strategy. Done. I apply standardization using Z-score Normalization and observe the result now.
The data was resized, right? I now have a negative value. Can I have a negative value? Yes, no problem. Because in practice, I modified the data. But the information is the same. Except that the data now follows a standard normal distribution.
If you calculate the mean of each column, it will be zero. The standard deviation will be equal to 1. It can be plus 1 or minus 1. The standard deviation is the distance from the mean, it can be 1 or minus 1.
I put the data here in very characteristic values, right? So that it would be a very didactic example. In practice, you may have other values, of course, as long as the mean is zero and the standard deviation is equal to 1.
See that I also resized the data. But I did it differently. Why? Because my algorithm requires this, or the data has a specific format that is suitable for normalization or standardization.
Phew! I hope I have achieved the goal of definitively explaining the difference between normalization and standardization, because this causes many doubts. Many! This is a topic that normally complicates a lot. And during multivariate analysis, we will apply normalization and standardization all the time.?
So I made a point of bringing the concept now, we are using the fundamentals, and we will apply this concept throughout several projects. If there is still any doubt, feel free to leave a message for me, and I will respond with great pleasure.
Thank you for accompanying me this far, ?? ??.?