Viz 101 - Column Charts
My Visualizations (Viz) 101 series is intended for people who want to work in analytics or are just starting out. I will provide you with a description of many different types of visualizations, what they are best used for, and a technical guide on how to create them in Excel (because pretty much everyone uses Excel and has a copy available) and Tableau (because it's an amazing BI tool and Tableau Public is free). In each article I will walk you through a scenario with real-world, publicly available data to provide you with a framework of how you could answer similar questions on your own using various techniques and visualizations.
Why should you listen to me? Here are my credentials: I have a master's degree in Mathematical Finance. I've used Excel for over a decade and while I have used Tableau for considerably less time, I am a Tableau Desktop Certified Associate. I've been working for the past four years as an analyst and data architect for an academic medical center. I live and breathe data and how it is handled and presented is something that I am passionate about. Good enough? Great. Let's get going.
Column Charts
Column (vertical) charts, are common visualizations used for quick comparisons between categories or a dimension. When looking across a category, column charts can make it easy to identify the minimum and maximum or make relative comparisons.
Scenario and Data Retrieval
I have a family member who has COPD and I want to know which area hospital(s) have better than expected readmission rates, meaning that the hospital readmits patients with COPD less often than expected.
Let's say I am new to the area, so the first thing I need to do is determine which hospitals are nearby. I can do this by going to Medicare's website for hospital comparisons. I live in the Augusta, GA area, so I am going to use 30912 for my zip code.
By default, CMS searches for hospitals within 50 miles of the entered zip code, which is fine with me; however, there are a few different types of hospitals listed and I am only interested in acute care hospitals. I am not a part of a part of the military, so I am not going to select the acute care hospitals for VA or DoD.
Now I have my list of hospitals: AU Medical Center, University Hospital, Doctors Hospital, Aiken Regional Medical Center, Burke Medical Center, and University McDuffie County Regional Medical Center. To get the data I want on COPD readmission rates, I am going to navigate to a different website. CMS has an enormous amount of publicly available data on health care providers. Simply go to their website and for my purposes, I will navigate to the Hospital Compare data.
Just for comparing hospitals, CMS has over 70 data sets available, but they provide a way to search through them. I can easily find what I am looking for by searching for "readmission". The top result is Hospital Readmission Reduction Program, which contains what I want. On the page that opens, there is a filter section on the right side which I use to filter to my list of hospitals (see above) and I can filter to the information of interest for COPD readmission rates (READM-30-COPD-HRRP).
I can preview the data right on CMS' website and I see that Burke County has several fields with N/A (Not Applicable), so either they did not report this information to CMS or they had too few patients for CMS to report it while maintaining patient anonymity. No matter, I will just export all the data anyway by selecting "CSV for Excel".
Uh-oh! Turns out Doctors Hospital and University Hospital are not unique names in the CMS data, so I will delete the data for the hospitals outside my area to clean up my data by holding CTRL and left clicking the row numbers. I can then right-click and select "Delete" to remove the entire row. Now I am left with the six correct results. Always preview your data before creating a visualization. Here, simply looking at the the number of rows clued me into the fact that I had more than the expected six hospitals.
For what I am doing, I don't need all of these columns of data, but there are not that many and my data set is small, so I am going to leave them there. When dealing with large data sets, limiting the columns or fields you include may help to improve performance. I have made all the changes I need. Now, I am ready to create a visualization.
Column Charts in Excel
To create a new, blank column chart in Excel, simply navigate to Insert > Insert Column or Bar Chart > Clustered Column as shown below.
Now I have a blank chart. To add data, right-click on the blank chart and select "Select Data..." which will open the Select Data Source dialog box.
On the Select Data Source dialog box, click on "Add" under the Legend Entries (Series). This is where to add metrics or the numerical data. In this case, the Excess Readmission Ratio which is an observed over expected measure. This means if the ratio is above one then the observed was higher than expected. In the case of readmissions, a higher ratio means more readmissions occurred than expected, i.e. the hospital performed worse than expected. I will add the column header to the series name and then add the values for each row in the series values. Now I have my data on the chart.
But my chart does not display the hospital names on it. Excel automatically labeled the series with 1, 2, 3... because I have not assigned any axis labels yet. To do that, from the Select Data Source dialog, I click on "Edit" under the Horizontal (Category) Axis Labels. In the same way that I selected my data, I will now select the facility names. Note that I do not need to select a series name like I did for my data selection. Voila! Now my chart displays the hospital names and their readmission ratios.
Even though the values are relatively close, we can fairly easily tell that Doctors Hospital is the lowest (best performing) and University Hospital is the highest (worst performing). In fact, University looks to be the only area hospital with a worse than expected (higher than 1) readmission ratio for COPD patients. We also see that Excel will still show a place holder for Burke County even though it had no numerical value. In some scenarios this may be helpful to show that I did not leave out something, but for me, I want to hide it since it has no data. Excel makes this very simple. All I have to do is right-click on the row number (5) for Burke County and select "Hide". Not only does this hide the data in my worksheet, but it also hides it from the graph. I still have the data (as opposed to if I deleted it), so if I need to go back later and see it, I can un-hide it.
In addition to hiding the Burke County result, Excel also automatically changed the y-axis range from 0 - 1.2 to 0.9 - 1.06. This zoomed in view makes the differences much more apparent and easier to compare the hospitals, but it also visually exaggerates the differences. It no longer shows that the hospitals are in tight competition with Doctors pulling ahead of the others and University falling behind. In this instance, I find this view undesirable and am going to adjust the y-axis to include 0 for a better frame of reference.
To change the y-axis range, I simply right-click on the y-axis and select "Format Axis".
领英推荐
On the Format Axis panel, I enter 0 for the Minimum Bounds in the Axis Options and Excel automatically changes the Maximum back to the original 1.2. I can edit this if I want, but for now, I will leave it as is.
Now my chart is looking great, but the default title is not very descriptive. To edit the title, left-click on it twice. I want my title to be very informative, so I have included information about my filters (Augusta, GA Area), the metric being viewed (COPD Excess Read. Ratios), the applicable time period (through 6/30/2018, which is the "End Date" field in my data set), how to read it (Lower is better), and where the data came from (data.medicare.gov).
As noted previously, it is fairly easy to pick out the lowest and highest bars, but how those in the middle compare to each other is not quite as clear. One way to make this easier for the consumer is to apply a sort to the graph. Since lower is better, I am going to sort this graph in ascending order (lowest to highest). To do that, I am going to highlight my measure (Excess Readmission Ratio), navigate to the Data tab, and click "Sort Smallest to Largest". This will sort my data and the graph at the same time. It is important to remember when sorting (if you select only a single column as I did below), to select "Expand current selection" so that your entire data set will be sorted by the field you have selected. Otherwise, just the selected field will be sorted and the values will no longer correspond to the facility names (or any other field) in your data set.
The consumer can now easily tell the rank of each area hospital: #1 Doctors, #2 AU, #3 University McDuffie, #4 Aiken Reg., and #5 University.
Now assume that I am an employee of University's health system which includes University Hospital and University Hospital McDuffie. I want to color my hospitals one color and all the competitors in another color. There is not a nice way to do this in Excel, but I can manually select each bar and color them through Format > Shape Fill. Unfortunately, University is clearly not an area leader for this metric and that is readily made apparent when the University hospitals are highlighted in a different color than its competitors.
Column Charts in Tableau
Now, I want to show how to do the same thing, but in Tableau. If you do not have Tableau Public, you can get it for free here. The only catch to Tableau Public is that if you want to save your work, you must publish it to the public. This means you do not want to use Tableau Public for any proprietary, private, or otherwise restricted data. In all my examples, I will be using publicly available data, so that will not be an issue for this series. Alternatively, you may purchase a Tableau Creator license here (for $840 annually) or use a free, 14-day trial here which will allow you to save workbooks without posting them publicly.
The first step in Tableau is connecting to my data set, which is in .csv format (a type of text file). Conveniently, that's the second option to Connect to under To a File. Simply navigate through the File Explorer that opens as you would to open any other file. You've done this before.
Tableau previews the data in a format that looks very similar to an Excel worksheet. If I wanted to I could change field names here, hide unwanted fields, or change field data types, and I see that Tableau has set the Excess Readmission Ratio to a text value denoted by an "Abc" above the field name. To change this to a number, all I have to do is click on the "Abc" and change it to "#" in the drop down. Note that Tableau changes the N/A value from Burke County to a null because it cannot store the text value as a number.
To begin making a visualization I click on the worksheet tab "Sheet 1". If you have not used a Business Intelligence (BI) tool before, it can seem a little strange to be working in an environment where the data is not visible, like in Excel. However, creating the visualization will be much simpler in Tableau. To make a bar chart in Tableau, all I have to do is drag the Facility Name field from my dimensions to Columns and the Excess Readmission Ratio measure to Rows. Done.
Now, just like before I want to get rid of the Burke County value since it is null. I could filter it out, similar to what I did in Excel when I hid the row, but there is an easier and more general solution that applies to all null values instead of dealing with them individually. At the bottom right of the chart I just made, there is a little flag that pops up when there are null values. In this case, it displays "1 null". If I click on that flag, Tableau presents me with the option to filter out all null values by clicking "Filter data"
Unlike Excel, Tableau kept my y-axis at 0, so I do not have to adjust that. However, my current view looks a little squished. Almost all of the facility names are shortened with ellipses. I can change the fit of the view from Standard to Entire View to provide some more room so the facility names are not cut off.
I'll now change the title of my worksheet to be the same information I entered in my Excel chart title. I'll then sort the chart in ascending order by clicking the ascending order sort button on the toolbar.
Now for that final touch of highlighting the University hospitals. Like Excel, I could color by facility name and manually change the color of each facility to what I want, but unlike Excel, I can create a new dimension and color by it instead. This is a much more robust method because in the future if I want to create this graph again or expand on this data set to include the other metrics instead of just the one on COPD, I can reuse the dimension I create now to apply the same colors over and over again instead of having to manually color them like I would in Excel.
I am going to create an In System dimension with the calculation: IF CONTAINS([Facility Name],"UNIVERSITY") THEN "In System" ELSE "Competitor" END. Then all I have to do is drag my new dimension to the Color for the chart and I'm done. Note, the CONTAINS function is case sensitive.
Summary
You now have a taste for retrieving online data to answer a question and how to visualize it in Excel and Tableau using a column chart. Here are three things to keep in mind with column charts:
I hope you found this helpful and if you did please consider liking or sharing the article. Stay tuned for future articles in this series.
Special thanks to my sister-in-law, Rebekah Smith, who proofread this article.
MBA, CPC, CPAR, CRCR, CSBI
3 年Thank you. I will definitely check it out
MBA, CPC, CPAR, CRCR, CSBI
3 年Great lesson