Seeing is Believing: Visualising Data for Better Analytics
When people say they love data, they usually mean they love data visualization. They rarely want to write endless SQL queries or learn about how maximum likelihood fits logistic regression models. They want to see the data in a format that allows for easy interpretation.
Although poor data visualization can woefully mislead, good data visualization is an excellent way to understand data.
In my experience as a data analyst, visualization is often left until after the analysis is done to support the data narrative. The analysis is usually driven by descriptive and inferential summary statistics. What is the average house price? What is the standard deviation? What is the standard error? How many observations do we have? Does the ANOVA test report significant differences? Are the assumptions largely met?
These are very useful outputs but they have their limitations. Many make assumptions of the data that do not hold and lead to conclusions that are not grounded in truth.
My main problem is that you are seeing parts of the picture but are, to some degree, running blind.
So, how can we better understand our data without simply looking at the raw numbers or relying entirely on summary statistics?
Here are my thoughts. And, for the many that are already doing this, please share your experiences in the comments!
Know Your Data Distributions
I use an example from the “House Sales in King Country” dataset from Kaggle. In this case, we are trying to predict house prices. So, what does the data look like? A good first step is a histogram so we can see how these prices are distributed.
The first thing we notice is that there are some outliers causing the distribution to have a very long tail. This is useful to know but is obstructing our view. Let’s remove these extremely expensive houses and see what happens to our output.
This looks a lot better now. We can see that the mean (red dotted line) has been dragged out by outliers and skew. The data is clearly not normally distributed.
Density plots are an alternative to histograms which can also be used to great effect. Rather than getting counts, we get the shape of the full data distribution.
What have we learned about the distribution of housing prices? It is positively skewed and has may outliers. The median value is $450,000 but the mean is $540,088. So, this immediately raises a question: which number is more important? If we cut the data by the number of bedrooms in the house, do we compare means or medians?
Although central limit theorem will allow you to run t-tests and ANOVA on skewed data distributions (the sampling distributions tend to become normally distributed once n>30), it is still a number that doesn’t tell us much.
There is nothing typical about the average value of highly skewed data. The median, on the other hand, gives us a better view of what a typical house sells at.
So, I would recommend focusing on the median and, if using significance tests, rely of non-parametric tests. Although summary statistics would have shown that the skew made the mean a problematic metric, this visualization makes this more intuitive.
What about categorical variables? Once again, visualising the distribution makes life easier for us.
Most houses have between two and five bedrooms. We have one mansion with 33 bedrooms which might be an outlier or a data entry error. So, if someone has a house with 10 rooms or more, our dataset will not be much help as we don’t have many data points at that end of the distribution.
Know How the Variables Relate to Each Other
Bivariate relationships should always be visualized. Aside from making it easier to assess whether there may be non-linear relationships, it also provides an intuitive sense of the effect size.
This is not to say that the formal tests should be scrapped. Data visualization on its own is dangerous and should only inform what is tested, rather than being the test itself.
But let’s say we want to understand the relationship between the number of bedrooms and the median price of a house? The scatter plot clearly shows that as the number of bedrooms increases, the median house price increases.
Additionally, you could create overlapping density plots to compare the full distributions.
This shows that there are still many exceptions to the rule, as would be expected. So, the summary statistics showed a very simple picture of median house prices increasing as houses have more bedrooms, but the more nuanced interpretation is that this is what tends to occur subject to some error, which highlights the need for proper inferential tests to assess whether there is a difference or not.
What about two continuous variables? In this case, we are looking at the living space in square feet and the value of the property.
The initial plot is not very clear. Most values are centered around the bottom left and then the rest of the plot extends to all the outliers in the dataset. It seems to suggest a positive relationship, but it's messy.
If, however, we make each data point partly transparent, we see a lot more. We can also add in a smooth spline to give us an initial intuition for whether we have a linear or non-linear relationship.
We aren’t estimating anything yet, this is just something that helps us come up with ideas.
The relationship seems to be positive and most likely linear (the curvature is most likely overfitting in the sparser regions of the feature space), but it would be worth testing whether a non-linear parameter outperforms a linear parameter when model building.
Geospatial Data
The housing dataset contains longitude and latitude data, which can be used to create maps. Do we have houses from an entire city? Is it only certain neighborhoods? Is it a whole state?
The initial map shows that it is located in one small pocket of the US.
Let’s zoom in so we have a better view.
Any realtor worth their salt will tell you that location matters. So, which locations are associated with higher housing prices? This is a very easy thing to visualize.
As can be seen, the south is associated with lower house prices but there is a reasonable amount of variation.
Which Tool Should I Use?
This is up to you and depends on your skill set. I personally recommend avoiding an over-reliance on tools like Excel and Tableau which can be quite restrictive compared to open-source tools.
R can be daunting for non-coders, but once you get the hang of it, it is an incredibly powerful way of exploring a dataset.
If you do use R for data visualization, make sure you develop a clear coding style (the “Tidyverse” packages are excellent, the magrittr (“%>%”) was a game changer for me) and build your own functions where appropriate to limit the potential for keystroke errors. The advantage of these open-source programs is that they are entirely configurable so you can develop your own visualization style and plot things more creatively. Most importantly, they allow you to both wrangle and visualise data in the same environment, making them all-purpose tools.
Seeing is Believing
To conclude, next time you are running analysis, I hope you find the time to visualize your data first so you have a better handle of what you are working with. Whether you are building a dashboard or a machine learning model, data visualization is a key part of the analytical process.
Senior Data Consultant at Qrious Limited
6 年Thanks for a helpful post.
Software Engineer at Insight Investment
6 年This is such a great read for someone like me learning more about data analytics. Thank you!
Data Scientist, Medical Physicist | helping companies to advance their business with Machine Learning, artificial Neural Networks and AI
7 年Absolutely true.
Catalysing the journey to data driven cultures by design using Human Centric Analytics. Proud to be dyspraxic/dyslexic and tail end menopausal! I heat the sea!
7 年It's a fundamental part of analytics development if people are involved at any stage.