Out of the Box Insights with Boxplot

Lazy Sunday. No wonder, my mind started visualizing. Science says mind visualization is a powerful tool to boost your creativity. Well, the same goes for data analysis too, visualization is a powerful tool to bring out the facts that aren't visible otherwise through the numbers.?

How do we compare a set of variables? To describe and summarize a dataset, which measurement technique do we consider? Do we only look at the mean?

Let's explore a hypothetical scenario.

Suppose we collected historical spot viewership rating points(TRPs)for various Television channels that we want to analyze. Based on these historical TRPs, we'll select the most efficient Television Channel that meets the stringent CPRP (Cost Per Unit Rating Point) commitment stipulated in the agency-client contract.?

Here we're focusing on one particular Target Audience, genre, and market. For simplicity's sake, assume 3 channels are shortlisted with identical Prime Time/Non-Prime Time dispersion, and also all 3 channels satisfy the desired reach% commitment. Hence, our determinant factor will only be the rating points (TRPs) which we normalized for 10". Also, assuming the average cost per spot for all these channels is appx similar.

And we collected the spot TRPs for these 3 channels, Channel-X, Channel-Y, and Channel-Z. Let's look at the average TRP (mean):

Channel????Mean TRP

Channel X?? 0.50

Channel Y?? 0.31

Channel Z?? 0.84

Based on the mean, Channel Z is the clear winner. However, this is incomplete if we only look at the mean.?Let's look at the Standard Deviation for spread.

Channel????Std

Channel X??0.31

Channel Y??0.24

Channel Z??0.90

Ok, so Channel Z has the highest mean and standard deviation value. But there's more to it, which is not getting highlighted by the numbers. Here comes Boxplot! Boxplots visualize the distribution of observations with five number summary, Centre(Median), Spread (IQR & Range), and identify potential outliers, So, here are our boxplots for all three channels:

No alt text provided for this image

The first thing to notice, there're a lot of outliers in the data with many unusually high values, particularly for Channel Y &Z. Also, the distribution is not symmetric and is skewed heavily to the right (would be clearer with a histogram).

For instance, for Channel Z, the unusually high data points are pulling the mean up, so, our first approach of using mean or simple average to summarize this data was not appropriate. Instead, we should look at the median value, which is much less sensitive to outliers.

Channels????Median TRP

Channel X?? 0.62

Channel Y 0.24

Channel Z 0.64

When we considered the mean to compare, due to the outliers, Channel Z appeared to be way more attractive than Channel X. However, the tables are turned when we use the median value, Channel X now looks equally good. We can further deep dive to know the causes of the outliers? For example, maybe a certain program is generating very high TRPs because something related to the show is in news recently, etc.?

Secondly, Channel Z has the highest median TRP. However, the interquartile range (IQR, length of the box) is also highest for Channel Z. This indicates higher spread, values for Z are more dispersed. They spread further away from the median, leading to a larger variance and standard deviation. Whereas, Channel X reflects a much lesser spread indicating most of its data points are closer to its average value.?

So? So, in my opinion, Channel X could be a better choice with minimum variance, thus more dependable. Who doesn't want a little normality in life?

Here are the Summary Statistics:

???? Channel X? Channel Y??Channel Z

mean 0.50 ? 0.31 ??? 0.84

std 0.31 ? 0.24 ??? 0.90

min 0.00 ? 0.00 ??? 0.00

25% 0.21 ? 0.15 ??? 0.19----> First Quartile?

50% 0.62 ? 0.24 ??? 0.64---> Median

75% 0.79 ? 0.39 ??? 1.10---> 3rd Quartile

max 4.68 ? 2.20 ??? 6.92

So, here we're able to get a pretty comprehensive look at this data and demonstrate how useful boxplots are for making comparisons of sets of observations. Also, how mean could be misleading in certain situations.

I know what you're thinking! Boxplots can hide some shape aspects of distribution, Histograms do a better job at displaying shape. Yes, but that's for another day!?

PS-Just to mention, I used Python Seaborn & Pandas to analyze and create the graph. Also, I haven't gone into the theoretical details of Boxplot, you can visit Khan Academy tutorials to brush up on the concept.?

#statistic #datascience #data #stats #dataanalytics #python

要查看或添加评论,请登录

Chandralekha Ghosh的更多文章

社区洞察