登录查看更多内容

Seeing is Believing: Visualising Data for Better Analytics

Jehan Gonsal

Principal Product Manager at Atlassian

发布日期: 2018年2月25日

When people say they love data, they usually mean they love data visualization. They rarely want to write endless SQL queries or learn about how maximum likelihood fits logistic regression models. They want to see the data in a format that allows for easy interpretation.

Although poor data visualization can woefully mislead, good data visualization is an excellent way to understand data.

In my experience as a data analyst, visualization is often left until after the analysis is done to support the data narrative. The analysis is usually driven by descriptive and inferential summary statistics. What is the average house price? What is the standard deviation? What is the standard error? How many observations do we have? Does the ANOVA test report significant differences? Are the assumptions largely met?

These are very useful outputs but they have their limitations. Many make assumptions of the data that do not hold and lead to conclusions that are not grounded in truth.

My main problem is that you are seeing parts of the picture but are, to some degree, running blind.

So, how can we better understand our data without simply looking at the raw numbers or relying entirely on summary statistics?

Here are my thoughts. And, for the many that are already doing this, please share your experiences in the comments!

Know Your Data Distributions

I use an example from the “House Sales in King Country” dataset from Kaggle. In this case, we are trying to predict house prices. So, what does the data look like? A good first step is a histogram so we can see how these prices are distributed.

The first thing we notice is that there are some outliers causing the distribution to have a very long tail. This is useful to know but is obstructing our view. Let’s remove these extremely expensive houses and see what happens to our output.

This looks a lot better now. We can see that the mean (red dotted line) has been dragged out by outliers and skew. The data is clearly not normally distributed.

Density plots are an alternative to histograms which can also be used to great effect. Rather than getting counts, we get the shape of the full data distribution.

What have we learned about the distribution of housing prices? It is positively skewed and has may outliers. The median value is $450,000 but the mean is $540,088. So, this immediately raises a question: which number is more important? If we cut the data by the number of bedrooms in the house, do we compare means or medians?

Although central limit theorem will allow you to run t-tests and ANOVA on skewed data distributions (the sampling distributions tend to become normally distributed once n>30), it is still a number that doesn’t tell us much.

There is nothing typical about the average value of highly skewed data. The median, on the other hand, gives us a better view of what a typical house sells at.

So, I would recommend focusing on the median and, if using significance tests, rely of non-parametric tests. Although summary statistics would have shown that the skew made the mean a problematic metric, this visualization makes this more intuitive.

What about categorical variables? Once again, visualising the distribution makes life easier for us.

Most houses have between two and five bedrooms. We have one mansion with 33 bedrooms which might be an outlier or a data entry error. So, if someone has a house with 10 rooms or more, our dataset will not be much help as we don’t have many data points at that end of the distribution.

Know How the Variables Relate to Each Other

Bivariate relationships should always be visualized. Aside from making it easier to assess whether there may be non-linear relationships, it also provides an intuitive sense of the effect size.

This is not to say that the formal tests should be scrapped. Data visualization on its own is dangerous and should only inform what is tested, rather than being the test itself.

But let’s say we want to understand the relationship between the number of bedrooms and the median price of a house? The scatter plot clearly shows that as the number of bedrooms increases, the median house price increases.

Additionally, you could create overlapping density plots to compare the full distributions.

This shows that there are still many exceptions to the rule, as would be expected. So, the summary statistics showed a very simple picture of median house prices increasing as houses have more bedrooms, but the more nuanced interpretation is that this is what tends to occur subject to some error, which highlights the need for proper inferential tests to assess whether there is a difference or not.

What about two continuous variables? In this case, we are looking at the living space in square feet and the value of the property.

The initial plot is not very clear. Most values are centered around the bottom left and then the rest of the plot extends to all the outliers in the dataset. It seems to suggest a positive relationship, but it's messy.

If, however, we make each data point partly transparent, we see a lot more. We can also add in a smooth spline to give us an initial intuition for whether we have a linear or non-linear relationship.

We aren’t estimating anything yet, this is just something that helps us come up with ideas.

The relationship seems to be positive and most likely linear (the curvature is most likely overfitting in the sparser regions of the feature space), but it would be worth testing whether a non-linear parameter outperforms a linear parameter when model building.

Geospatial Data

The housing dataset contains longitude and latitude data, which can be used to create maps. Do we have houses from an entire city? Is it only certain neighborhoods? Is it a whole state?

The initial map shows that it is located in one small pocket of the US.

Let’s zoom in so we have a better view.

Any realtor worth their salt will tell you that location matters. So, which locations are associated with higher housing prices? This is a very easy thing to visualize.

As can be seen, the south is associated with lower house prices but there is a reasonable amount of variation.

Which Tool Should I Use?

This is up to you and depends on your skill set. I personally recommend avoiding an over-reliance on tools like Excel and Tableau which can be quite restrictive compared to open-source tools.

R can be daunting for non-coders, but once you get the hang of it, it is an incredibly powerful way of exploring a dataset.

If you do use R for data visualization, make sure you develop a clear coding style (the “Tidyverse” packages are excellent, the magrittr (“%>%”) was a game changer for me) and build your own functions where appropriate to limit the potential for keystroke errors. The advantage of these open-source programs is that they are entirely configurable so you can develop your own visualization style and plot things more creatively. Most importantly, they allow you to both wrangle and visualise data in the same environment, making them all-purpose tools.

Seeing is Believing

To conclude, next time you are running analysis, I hope you find the time to visualize your data first so you have a better handle of what you are working with. Whether you are building a dashboard or a machine learning model, data visualization is a key part of the analytical process.

Kenny Nguyen

Senior Data Consultant at Qrious Limited

6 年

Thanks for a helpful post.

1 次回应

Leanna D.

Software Engineer at Insight Investment

6 年

This is such a great read for someone like me learning more about data analytics. Thank you!

1 次回应

Dr. Natallia Lundqvist

Data Scientist, Medical Physicist | helping companies to advance their business with Machine Learning, artificial Neural Networks and AI

7 年

Absolutely true.

1 次回应

Christina Phillips

Catalysing the journey to data driven cultures by design using Human Centric Analytics. Proud to be dyspraxic/dyslexic and tail end menopausal! I heat the sea!

7 年

It's a fundamental part of analytics development if people are involved at any stage.

1 次回应

查看更多评论

要查看或添加评论，请登录

Jehan Gonsal的更多文章

My shift into product management

2020年2月29日

My shift into product management

I spent years investing in my career as a data analyst. I read entire textbooks on statistics and machine learning.

3 条评论
How I networked my way into a great job

2019年5月13日

How I networked my way into a great job

I'm not writing this to show off. This is a very common way of getting a role, but is something that many people…

8 条评论
Is Coding the Least Important Core Skill of Data Science?

2018年2月18日

Is Coding the Least Important Core Skill of Data Science?

Don’t get me wrong. Coding is a core skill of data science.

17 条评论
The Relaxed Lasso: A Better Way to Fit Linear and Logistic Models

2018年1月28日

The Relaxed Lasso: A Better Way to Fit Linear and Logistic Models

Model building can be a painful process when building data-driven linear and logistic regression models. Stepwise…

1 条评论
How Santa Leverages Analytics to Distribute Presents to 1.9 Billion Children

2017年12月9日

How Santa Leverages Analytics to Distribute Presents to 1.9 Billion Children

It turns out that Santa is not only real but running a large-scale organization that leverages information systems…

2 条评论
Opening the Black Box: Visualising Machine Learning Algorithms

2017年11月25日

Opening the Black Box: Visualising Machine Learning Algorithms

These days machine learning is all the hype. Unfortunately, these algorithms are usually considered rather hard to…

4 条评论
Significance Testing is Broken (and How to Fix it)

2017年9月4日

Significance Testing is Broken (and How to Fix it)

Making inferences from a sample of data is hard. We often use significance testing to see how well we can generalize…

6 条评论
Imputing Missing Data: Playing with Fire

2017年3月5日

Imputing Missing Data: Playing with Fire

Missing data is what keeps most analysts and data scientists awake at night. Or at least me.

94 条评论
Statistical and Machine Learning Modelling for the Rest of Us

2017年1月2日

Statistical and Machine Learning Modelling for the Rest of Us

Let’s say you are a highly-experienced professional with little to no experience in analytics. You find yourself paired…

8 条评论
Career Advice from the Lean Startup

2015年7月24日

Career Advice from the Lean Startup

I recently got into the Lean Startup, an entrepreneur movement that takes a hands-on, analytical approach to starting a…

2 条评论

See all articles

Seeing is Believing: Visualising Data for Better Analytics

Jehan Gonsal

Principal Product Manager at Atlassian

Know Your Data Distributions

Know How the Variables Relate to Each Other

Geospatial Data

Which Tool Should I Use?

Seeing is Believing

Jehan Gonsal的更多文章

社区洞察

其他会员也浏览了

The Creative Intersection of Data Analysis and Art: A Comparison

Demystifying Data Analytics: Your Guide to Formulas and Functions

Unleashing the Power of Data Analysis: Enhancing Input Efficiency for a Compelling Story

Data Strategy Roadmap: Prioritizing Analytics Projects

From Raw Data to Business Insights: The Thrilling World of Data Analysis ??

Unlocking Insights: A Step-by-Step Framework for Data Analysis

Focusing on the Data - perspective by Darko Medin

Data Analytics: Data exploration and key techniques

What is a Ghost Code and Ghost in the Machine: The Predictive Power of Big Data Analytics

Data is NOT our Superpower.

Know Your Data Distributions

Know How the Variables Relate to Each Other

Geospatial Data

Which Tool Should I Use?

Seeing is Believing

Jehan Gonsal的更多文章

My shift into product management

How I networked my way into a great job

Is Coding the Least Important Core Skill of Data Science?

The Relaxed Lasso: A Better Way to Fit Linear and Logistic Models

How Santa Leverages Analytics to Distribute Presents to 1.9 Billion Children

Opening the Black Box: Visualising Machine Learning Algorithms

Significance Testing is Broken (and How to Fix it)

Imputing Missing Data: Playing with Fire

Statistical and Machine Learning Modelling for the Rest of Us

Career Advice from the Lean Startup

社区洞察

其他会员也浏览了

The Creative Intersection of Data Analysis and Art: A Comparison

Demystifying Data Analytics: Your Guide to Formulas and Functions

Unleashing the Power of Data Analysis: Enhancing Input Efficiency for a Compelling Story

Data Strategy Roadmap: Prioritizing Analytics Projects

From Raw Data to Business Insights: The Thrilling World of Data Analysis ??

Unlocking Insights: A Step-by-Step Framework for Data Analysis

Focusing on the Data - perspective by Darko Medin

Data Analytics: Data exploration and key techniques

What is a Ghost Code and Ghost in the Machine: The Predictive Power of Big Data Analytics

Data is NOT our Superpower.