Dependent Plots
Welcome to this fifth article in our Designing charts like a stoic series!
Introduction
Today, we will cover dependent plots, which are plots that do not have any axis mapped to an independent variable. For example: scatter plots. These plots are used to visualize raw data, as opposed to data resulting from a PIVOT transformation. They are especially effective at visualizing relationships between two measures (dependent variables).
Terminology
The following terminology is used in all articles of this series:
- A visual can be either a chart, a map, or a diagram.
- A variable is a column in a spreadsheet or database table.
- A dimension is shorthand for an independent variable.
- A measure is shorthand for a dependent variable.
- A continuable variable is a discrete variable that could be made continuous.
History
Historical references will be provided for specific dependent plot types whenever available.
Structure
Dependent plots map dependent variables on both the horizontal and vertical axes. In some rare cases like the rug plot, a single axis is used (horizontal or vertical). Usually, the horizontal axis supports duplicate values (e.g. scatter plot, tick plot), but some dependent plots like the lollipop plot and mast plot only support unique values for the measure mapped onto the horizontal axis.
Use cases
The Financial Times Visual Vocabulary identifies five major use cases for dependent plots: correlation, deviation, distribution, change over time, and magnitude. Nevertheless, there are quite a few more worth mentioning, which makes this family of plots one of the most versatile.
Positive relationship
Two variables have a positive relationship when their correlation coefficient is close to 1. This can be visualized on a scatter plot where points tend to cluster around a rising straight line. On the following dataset, this correlation coefficient is approximately equal to 0.815, making GDP per capita and Happiness index meaningfully correlated.
Negative relationship
Two variables have a negative relationship when their correlation coefficient is close to -1. This can be visualized on a scatter plot where points tend to cluster around a rising straight line. On the following dataset, this correlation coefficient is approximately equal to -0.787, making GDP per capita and infant mortality negatively correlated.
Non-linear relationship
Two variables have a non-linear relationship when they are correlated but their ratio of change is not constant. This can be visualized on a scatter plot where points tend to cluster around a smooth curve.
No relationship
Two variables have no relationship when their correlation coefficient is close to 0. This can be visualized on a scatter plot where points are randomly distributed. On the following dataset, this correlation coefficient is approximately equal to -0.346, making GDP per capita and Gini coefficient unequivocally non correlated.
Clusters
Two measures visualized with a scatter plot make it relatively easy to perform cluster analysis. Individual clusters can be visualized with colors, symbols, or outlines. The automatic detection of clusters is a non-trivial task, especially if the number of clusters is not known in advance.
Gaps
Gaps within a univariate dataset can be visualized with a rug plot. Such gaps might be indicative of missing data, or might highlight an interesting pattern. On the following visual, we observe the latter.
Outliers
Outliers within a univariate dataset can be visualized with a rug plot as well.
Deviation
Deviation from a baseline can be visualized with any dependent chart.
Distribution
The rug plot is a space-efficient alternative to the histogram for visualizing the distribution of a numeric variable. It is particularly effective for identifying clusters, gaps, and outliers. But unlike the histogram, it cannot be used to evaluate the size of bins in the distribution.
Magnitude
Magnitude can be visualized with any dependent plot.
Change over time
A bivariate dependent chart with a temporal variable mapped to its horizontal axis is a great way to visualize change over time. The following batton plot goes a step further: it was produced by converting a bivariate time series into a differential time series, thereby combining the benefits of dependent plots and differential charts.
Change over space
A bivariate dependent chart with a geospatial variable mapped to ones of its axes is a great way to visualize change over space. Furthermore, this type of visualization is not limited to geospatial variables, and can be used instead with any non-temporal variable.
Types
Dependent plots can be classified in relation to the types of variables mapped to their horizontal and vertical axes. This systematic approach makes it possible to design plot types that are rarely used, or even discover new plot types that are genuinely original.
Lollipop plot
The lollipop plot is recommended for the following configuration:
- Horizontal axis: continuous measure with unique values
- Vertical axis: intensive measure
Mast plot
The mast plot is recommended for the following configuration:
- Horizontal axis: continuous measure with unique values
- Vertical axis: extensive measure
Strip plot
The strip plot is recommended for the following configuration:
- Horizontal axis: discrete measure
- Vertical axis: continuous measure
It is a more intuitive (yet less information-rich) alternative to the box plot.
Rug plot
The rug plot is recommended for the following configuration:
- Horizontal axis: continuous measure
- No vertical axis
The rug plot is a univariate alternative to the strip plot.
The rug plot should not be confused with the carpet plot.
Scatter plot
The scatter plot is recommended for the following configuration:
- Horizontal axis: continuous measure with possible duplicate values
- Vertical axis: continuous measure
This plot's name is sometimes spelled scatterplot.
The scatter plot is certainly the most important plot of all the ones covered in this article.
The origins of the scatter plot are not entirely clear, but this paper by Michael Friendly & Daniel Denis suggests that the first one was produced in 1833 by the English scientist John Frederick W. Herschel. This other article by Dan Kopf provides some additional context.
By default, a scatter plot should always be designed using a circle mark. This mark is the simplest of all after the point mark, and makes it easier to visualize closely-located data points. Alternatively, a point mark can be used. The square mark might look better to some, but brings no benefits over the point mark. Other symbols such as triangle, diamond, or cross should be reserved for cases where an additional discrete measure of low cardinality (up to 5) must be visualized using symbols.
Batton plot
The batton plot is a differential plot recommended for the following configuration:
- Horizontal axis: continuous measure
- Vertical axis: intensive measures
Stick plot
The stick plot is a differential plot recommended for the following configuration:
- Horizontal axis: continuous measure
- Vertical axis: extensive measures
Options
Dependent plots offer a wide range of options that deserve to be reviewed carefully.
Color
The color of marks can be used to visualize an additional measure.
Color legends can be omitted whenever obvious:
Size
The size of marks can be used to visualize an additional measure.
Symbol
The symbol of marks can be used to visualize an additional measure. This can make dependent plots like a scatter plot visually confused though, and this technique should be used only when data points are naturally clustered. Otherwise, the use of color or tone should be preferred.
Aspect ratio
In some cases, the aspect ratio of dependent plots should be set carefully, especially when designing scatter plots for two congruent measures. This is especially important when the visual is design to showcase a linear association between two variables. In that case, a 1:1 aspect ratio should be used.
Regression line
Whenever a dependent plot is designed to showcase an association between two dependent variables, a regression line should be drawn. If the association is not linear, a regression curve should be drawn instead, but this feature might not be supported by some data visualization tools.
Highlight
Color can also be used to highlight certain data points like outliers.
Opacity
When multiple marks overlap on a dependent chart, opacity should be used to improve readability.
Issues
Dependent plots can be affected by two main issues: overplotting and misinterpretation.
Overplotting
Overplotting can occur when too many data points are rendered within too small an area. As a result, the overlapping of marks makes it difficult to properly assess the association between measures. Three main techniques can be used to address this issue:
Misinterpretation
Correlation (association) should not be confused with causation. For more on this topic, please read The Book of Why, by Judea Pearl and Dana Mackenzie. While the automatic detection of correlations is relatively straightforward (for most datatypes at least), the automatic extraction of causality models from raw datasets largely remains an open issue.
Statistical plots
Dependent plots are frequently used in conventional statistical plots.
Bland-Altman plot
The Bland-Altman plot is made of a scatter plot and some reference lines.
Biplot
The biplot is made of two layered scatter plots.
Galbraith plot
The Galbraith plot is a beautiful scatter plot layered with a radial rug plot.
Partial regression plot
The partial regression plot is a scatter plot layered with a regression line.
Partial residual plot
The partial residual plot is a scatter plot layered with a regression curve.
Poincaré plot
The Poincaré plot is a scatter plot layered with an ellipse.
Run chart
The run chart (also called run-sequence plot) is a simple point plot.
Volcano plot
The volcano plot can be produced with a scatter plot and a simple data transformation.
Alternatives
The term dependent plot encompasses a wide range of visuals. As such, there are no real alternatives to them, but some related visuals should be mentioned. Furthermore, it should be obvious that the set of dependent plots outlined in this article is not exhaustive, for the wide range of datatypes defined by Principia Data is an open invitation to data visualization practitioners to create datatype-specific visuals that were never produced before.
Geopoint map
The geopoint map is a geospatial alternative to the scatter plot.
Heat map
The heat map (or heatmap) can be thought of as a discretized version of a scatter plot, for which the two dependent variables mapped to the horizontal and vertical axes have been discretized into a pair of independent variables, and the color measure visualizes counts.
Connected scatter plot
The connected scatter plot is a variation on the scatter plot with an extra ordered measure.
Conclusion
By design, dependent plots are really effective at visualizing raw data. Nevertheless, care should be taken in selecting the most appropriate plot for the variables being visualized, in relation to their datatypes. Furthermore, options such as opacity, highlights, or regression lines can significantly enhance their visual appeal and interpretability.
Next week's article will cover univariate bar charts.
Sales Leadership: Better Business Thru Technology
5 年You can tell better stories when you have the vocabulary. Turn a pile of data points into a motivating story. And do it right. Help lead people to make good decisions. Studying the series of which this post is part is a great investment in your capabilities.