Viz 101 - Scatterplots
My Visualizations (Viz) 101 series is intended for people who want to work in analytics or are just starting out. I will provide you with a description of many different types of visualizations, what they are best used for, and a technical guide on how to create them in Excel (because pretty much everyone uses Excel and has a copy available) and Tableau (because it's an amazing BI tool and Tableau Public is free). In each article I will walk you through a scenario with real-world, publicly available data to provide you with a framework of how you could answer similar questions on your own using various techniques and visualizations.
Why should you listen to me? Here are my credentials: I have a master's degree in Mathematical Finance. I've used Excel for over a decade and while I have used Tableau for considerably less time, I am a Tableau Desktop Certified Associate. I've been working for the past four years as an analyst and data architect for an academic medical center. I live and breathe data and how it is handled and presented is something that I am passionate about. Good enough? Great. Let's get going.
Scatterplots
Scatterplots are good for comparing things on two dimensions. This could be to see if the two dimensions are related (think age and height of a person) or to compare multiple individual items on two dimensions (think sales and profit of inventory).
Scenario and Data Retrieval
I am interested in looking at the weather in South Carolina (SC) and seeing if there is a relationship between the amount of rainfall and the average temperature of a given area (each weather station). I am going to use the High Plains Regional Climate Center's CLIMOD located here to obtain this data.
Because I want to get all available weather stations in SC, I need to select Multi-Station and then Data for Date Range, so I can get the amount of data I want (one month).
After making those selections, the Options selection automatically opens up. I need to select HTML, so I can easily copy the data into Excel. Then I enter 2021-07-01 for my start date and 2021-07-31 for my end date to get the month of July. First, I will get the average temperature, so I have to change the variable to Avg temp and just leave the Summary on mean. I also chose to the leave the default Allowable missing days at 5. This setting identifies weather stations where the data was not recorded for 5 or more days during the selected time period.
All that is left now is under the Station/Area selection to select State and South Carolina.
Now click go and copy the contents of the table into Excel. I have to leave the top row blank in Excel so I can type the headers as I am unable to copy them.
Now, I just repeat the steps above but for the Variable I enter Precipitation and leave the Summary on sum. I then copy that data into a separate sheet in the same Excel workbook that I copied the average temperatures and name the tabs "Temp" and "Rain".
Scatterplots in Excel
In my previous post I aggregated raw data for a pie chart. In this case, the data I retrieved is already aggregated, so even though I am using my raw data, it is still aggregated. Scatterplots can be utilized for either aggregated or non-aggregated data. In that way they are a bit different than other types of graphs as most of the time, bar graphs, line graphs, and pie charts are all using aggregated data.
As I get started, I immediately notice that there are a lot more weather stations that reported rain (589) than those that reported temperatures (97). I need to see how many of the weather stations I have reported both rain and temperature (I need two dimensions to use a scatterplot). Since there are less weather stations that reported temperature, I will start with that and see how many of those reported rain as well. To do this, I am going to write a formula. I first name Column D "Rain" and then in cell D2, enter the formula =VLOOKUP(A2,Rain!A$1:C$589,2,FALSE). This may look complicated, but what it is doing is taking the name of the weather station in cell A2, comparing that to the list on the other tab called "Rain", if there is an exact match, it grabs the value in the second column, which is the Total Precipitation. Then I just need to copy that formula to the bottom of my data. Any time the station name on my Temp worksheet does not match with one from the Rain worksheet, the formula returns the #N/A error. Fortunately, there appear to only be seven from the Temp worksheet that do not match the Rain worksheet, so I have a lot of data points! The M value means the weather station missed more than 5 days, so I will need to exclude those as well.
To exclude the #N/A and M values, I need to add a filter. In Excel, all I need to do is navigate to the Data tab, click anywhere in my header row, and then click the Filter button on the ribbon. This creates a dropdown button next to each header.
In the dropdowns for Mean Avg Temperature and Rain, I can uncheck #N/A and M.
Now I'm ready to make my scatterplot with 70 good values. To do this, I simply highlight both columns, navigate to Insert and select Scatterplot similar to the way I did when creating the line chart for stock prices in one of my previous articles. If I go ahead and change the title to something more meaningful I end up with this chart.
领英推荐
Now in previous posts, the meaning of the axes was pretty obvious and did not really require explanation. Here, however, it is a little more difficult to tell which is which. If I think about, I may realize that it makes more since for the average temperate in July to be around 80 (in Fahrenheit) and that 80" of rain in month would be a lot, but I can make the graph easier to interpret by simply adding axis titles. To do that, I just click on the graph, hit the +, and check Axis Titles.
Then, I simply double click on the Axis Titles to rename them, add a subtitle to the chart for more context, and I have a scatterplot that shows absolutely no correlation between average temperature and total rainfall. For fun, I could repeat this with a year's worth of data (2020-07-01 - 2021-06-30) and I would find what appears to be a slight negative correlation. Meaning, the more rain, the lower the average temperature. Sometimes, more data is required!
Scatterplots in Tableau
Now, I want to show how to do the same thing, but in Tableau. If you do not have Tableau Public, you can get it for free here. The only catch to Tableau Public is that if you want to save your work, you must publish it to the public. This means you do not want to use Tableau Public for any proprietary, private, or otherwise restricted data. In all my examples, I will be using publicly available data, so that will not be an issue for this series. Alternatively, you may purchase a Tableau Creator license here (for $840 annually) or use a free, 14-day trial here which will allow you to save workbooks without posting them publicly.
The first step in Tableau is connecting to my data set, which is in .xlsx format (Excel). Conveniently, that's the first option to Connect to under To a File. Simply navigate through the File Explorer that opens as you would to open any other file. You've done this before, and so have I.
Now, I need to drag both the Rain and Temp worksheets onto the canvas in Tableau and establish their relationship. A relationship in Tableau is kind of like a join and it allows Tableau to utilize both tables by connecting them through a common field or multiple fields. This is same procedure as I did with the line charts in my previous article. In this case, I will use the Name field to define the relationship.
To use a scatterplot in Tableau, my fields must be set as measures. Tableau has defined them as dimensions because there is a mixture of numeric values for and the string "M" for weather stations that missed more than 5 days. All I need to do is right-click on Mean Avg Temperature and Total Precipitation and tell Tableau to Convert to Measure and to Number (decimal).
Now, all that's left is to make the scatterplot. If I drag the Total Precipitation and Mean Avg Temperature to the Rows and Columns respectively, I get a graph that only has one point.
Well, for starters, the formula in Columns and Rows has CNTD in front of it. That means tableau is doing a distinct count of the data behind those two fields. This isn't what I want. I want the to show the average and total (sum) like my fields say! To do that, I just select the dropdown on each pill and change it to the appropriate aggregate.
Hang on, I still only have one mark. Why? Well, average and sum are still aggregates, just like the count distinct. That means Tableau is aggregating my data for me. To bring in more detail, I need to add the Name field to the Details card of my graph.
Now, all that's left is for me to update my title and axis labels.
Summary
You now have a taste for retrieving online data to answer a question and how to visualize it in Excel and Tableau using a scatterplot. Here are three things to keep in mind with scatterplots:
I hope you found this helpful and if you did please consider liking or sharing the article. Stay tuned for future articles in this series.
MBA, CPC, CPAR, CRCR, CSBI
3 年Great lesson Andrew. I remember doing these when working on my MBA.?