Quartiles
Claudiu Clement
CTO @ e-Comas and PhD in Stats, sharing simplified insights on e-commerce analytics and eRetailer trends.
One of the last articles here. The newsletter is moved to substack.
Table of contents:
1. What are quartiles
Quartiles are position indicators that divide a sequence of numbers into 4 equal parts.
Let’s look at the below schema.
Thanks for reading Is Not Rocket Science! Subscribe for free to receive new posts and support my work.
Subscribed
We have a sequence with n=12 (numbers from 14 to 57) and let’s imagine these represent the number of tractors some 12 farms have in the northern region of Statistics Land.
Quartiles analysis is part of descriptive statistics and consequently, helps us better understand the data at hand. With these, specifically, we will understand what is called central tendency or :
“A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data”
2. When can we use them
Ok, let’s come back now. We have a businessman that asks us to help him better understand this northern region where he wants to open a farm. His main question is: “Will I be in the top 25% of farms considering tractor count if I open a farm with 43 tractors?”
Hmm… Great question.
As we can see in Figure 1 above, I have already split data into 4 quadrants.
The interpretation is the following:
I know. I took it took fast. Let’s look below to see how we actually calculate these values.
3. How to calculate them manually
Before doing any calculation of Quartiles we need to order the sequence from lowest to highest. [already did that in Figure 1]
Next, let’s look below at Figure 2.1 and 2.2 below for some clarification.
领英推荐
Further, let’s calculate the position of these quartiles before calculating them.
So, what do these 3.25 / 6.5 / 9.75 mean?
These represent the position of quartile’s values in this sequence of numbers.
In figure 2.1 in blue, I have written the index of each number. For Q1 for example, its position is 3,25 so somewhere between 3 and 4. Closer to 3rd index number than to 4th. We can now intuitively state that Q1 is equal to something between 16 and 17 (index 3=16, index 4=17).
But what is the number?
What we can do is apply the average of these two numbers and we will end up with the value of our Q1.
So Q1 = (16+17)/2 = 16.5
For Q2, the value is somewhere between the 6th and 7th index so we average the 6th and 7th values of our sequence, thus Q2 = (32+40)/2 = 36.
Same for Q3, the value is 9.75 so somewhere between 9th and 10th position. Q3 = (50/52)/2 = 51
Now, here if you would add my numbers you would get slightly different results. Why? Because precisely Q1 in our case is 3.25 which means the very accurate value is right on the first quarter of the distance between 16 and 17. The website above shows us the value 16.25 and that is calculated as (16*3+17)/4. The value is 3 times closer to 16 than it is to 17.
4. How to calculate them using python
Ok, now moving to python. For this calculation, we will use NumPy.
Alright, ok. But something is off, isn’t it? We have Q1 = 16.75 compared to 16.5 or 16.25 that the above website calculates.
Ok. Not going to boil you too much. The problem relies on the default setting of np.quantile, more on it here.
A simple solution to this and to stick with our reasoning from this article, add interpolation=’midpoint’ and we’re all set.
In Figure 1 I wrote “Interquartile Range” and that represents the midspread or the middle 50% of our data. This also helps us understand which values can be considered outliers, I wrote a few things about this here.
In the end, let’s not forget our business man. Let’s reply his question. Will he be in top 25% of businesses if his farm will own 43 tractors?
The answer is no, for him to be in top 25%, he will need at least 51 tractors.
Conclusion
We’ve looked on what quartiles are, when we can use them and how they’re calculated. I hope it is a bit clearer now.
Until next time, keep learning.