登录查看更多内容

Dependent Plots

Ismael Chang Ghalimi

CEO @ STOIC

发布日期: 2020年2月3日

+ 关注

Welcome to this fifth article in our Designing charts like a stoic series!

Introduction

Today, we will cover dependent plots, which are plots that do not have any axis mapped to an independent variable. For example: scatter plots. These plots are used to visualize raw data, as opposed to data resulting from a PIVOT transformation. They are especially effective at visualizing relationships between two measures (dependent variables).

Terminology

The following terminology is used in all articles of this series:

A visual can be either a chart, a map, or a diagram.
A variable is a column in a spreadsheet or database table.
A dimension is shorthand for an independent variable.
A measure is shorthand for a dependent variable.
A continuable variable is a discrete variable that could be made continuous.

History

Historical references will be provided for specific dependent plot types whenever available.

Structure

Dependent plots map dependent variables on both the horizontal and vertical axes. In some rare cases like the rug plot, a single axis is used (horizontal or vertical). Usually, the horizontal axis supports duplicate values (e.g. scatter plot, tick plot), but some dependent plots like the lollipop plot and mast plot only support unique values for the measure mapped onto the horizontal axis.

Use cases

The Financial Times Visual Vocabulary identifies five major use cases for dependent plots: correlation, deviation, distribution, change over time, and magnitude. Nevertheless, there are quite a few more worth mentioning, which makes this family of plots one of the most versatile.

Positive relationship

Two variables have a positive relationship when their correlation coefficient is close to 1. This can be visualized on a scatter plot where points tend to cluster around a rising straight line. On the following dataset, this correlation coefficient is approximately equal to 0.815, making GDP per capita and Happiness index meaningfully correlated.

Negative relationship

Two variables have a negative relationship when their correlation coefficient is close to -1. This can be visualized on a scatter plot where points tend to cluster around a rising straight line. On the following dataset, this correlation coefficient is approximately equal to -0.787, making GDP per capita and infant mortality negatively correlated.

Non-linear relationship

Two variables have a non-linear relationship when they are correlated but their ratio of change is not constant. This can be visualized on a scatter plot where points tend to cluster around a smooth curve.

No relationship

Two variables have no relationship when their correlation coefficient is close to 0. This can be visualized on a scatter plot where points are randomly distributed. On the following dataset, this correlation coefficient is approximately equal to -0.346, making GDP per capita and Gini coefficient unequivocally non correlated.

Clusters

Two measures visualized with a scatter plot make it relatively easy to perform cluster analysis. Individual clusters can be visualized with colors, symbols, or outlines. The automatic detection of clusters is a non-trivial task, especially if the number of clusters is not known in advance.

Gaps

Gaps within a univariate dataset can be visualized with a rug plot. Such gaps might be indicative of missing data, or might highlight an interesting pattern. On the following visual, we observe the latter.

Outliers

Outliers within a univariate dataset can be visualized with a rug plot as well.

Deviation

Deviation from a baseline can be visualized with any dependent chart.

Distribution

The rug plot is a space-efficient alternative to the histogram for visualizing the distribution of a numeric variable. It is particularly effective for identifying clusters, gaps, and outliers. But unlike the histogram, it cannot be used to evaluate the size of bins in the distribution.

Magnitude

Magnitude can be visualized with any dependent plot.

Change over time

A bivariate dependent chart with a temporal variable mapped to its horizontal axis is a great way to visualize change over time. The following batton plot goes a step further: it was produced by converting a bivariate time series into a differential time series, thereby combining the benefits of dependent plots and differential charts.

Change over space

A bivariate dependent chart with a geospatial variable mapped to ones of its axes is a great way to visualize change over space. Furthermore, this type of visualization is not limited to geospatial variables, and can be used instead with any non-temporal variable.

Types

Dependent plots can be classified in relation to the types of variables mapped to their horizontal and vertical axes. This systematic approach makes it possible to design plot types that are rarely used, or even discover new plot types that are genuinely original.

Lollipop plot

The lollipop plot is recommended for the following configuration:

Horizontal axis: continuous measure with unique values
Vertical axis: intensive measure

Mast plot

The mast plot is recommended for the following configuration:

Horizontal axis: continuous measure with unique values
Vertical axis: extensive measure

Strip plot

The strip plot is recommended for the following configuration:

Horizontal axis: discrete measure
Vertical axis: continuous measure

It is a more intuitive (yet less information-rich) alternative to the box plot.

Rug plot

The rug plot is recommended for the following configuration:

Horizontal axis: continuous measure
No vertical axis

The rug plot is a univariate alternative to the strip plot.

The rug plot should not be confused with the carpet plot.

Scatter plot

The scatter plot is recommended for the following configuration:

Horizontal axis: continuous measure with possible duplicate values
Vertical axis: continuous measure

This plot's name is sometimes spelled scatterplot.

The scatter plot is certainly the most important plot of all the ones covered in this article.

The origins of the scatter plot are not entirely clear, but this paper by Michael Friendly & Daniel Denis suggests that the first one was produced in 1833 by the English scientist John Frederick W. Herschel. This other article by Dan Kopf provides some additional context.

By default, a scatter plot should always be designed using a circle mark. This mark is the simplest of all after the point mark, and makes it easier to visualize closely-located data points. Alternatively, a point mark can be used. The square mark might look better to some, but brings no benefits over the point mark. Other symbols such as triangle, diamond, or cross should be reserved for cases where an additional discrete measure of low cardinality (up to 5) must be visualized using symbols.

Batton plot

The batton plot is a differential plot recommended for the following configuration:

Horizontal axis: continuous measure
Vertical axis: intensive measures

Stick plot

The stick plot is a differential plot recommended for the following configuration:

Horizontal axis: continuous measure
Vertical axis: extensive measures

Options

Dependent plots offer a wide range of options that deserve to be reviewed carefully.

Color

The color of marks can be used to visualize an additional measure.

Color legends can be omitted whenever obvious:

Size

The size of marks can be used to visualize an additional measure.

Symbol

The symbol of marks can be used to visualize an additional measure. This can make dependent plots like a scatter plot visually confused though, and this technique should be used only when data points are naturally clustered. Otherwise, the use of color or tone should be preferred.

Aspect ratio

In some cases, the aspect ratio of dependent plots should be set carefully, especially when designing scatter plots for two congruent measures. This is especially important when the visual is design to showcase a linear association between two variables. In that case, a 1:1 aspect ratio should be used.

Regression line

Whenever a dependent plot is designed to showcase an association between two dependent variables, a regression line should be drawn. If the association is not linear, a regression curve should be drawn instead, but this feature might not be supported by some data visualization tools.

Highlight

Color can also be used to highlight certain data points like outliers.

Opacity

When multiple marks overlap on a dependent chart, opacity should be used to improve readability.

Issues

Dependent plots can be affected by two main issues: overplotting and misinterpretation.

Overplotting

Overplotting can occur when too many data points are rendered within too small an area. As a result, the overlapping of marks makes it difficult to properly assess the association between measures. Three main techniques can be used to address this issue:

Sampling
Opacity
Heat map

Misinterpretation

Correlation (association) should not be confused with causation. For more on this topic, please read The Book of Why, by Judea Pearl and Dana Mackenzie. While the automatic detection of correlations is relatively straightforward (for most datatypes at least), the automatic extraction of causality models from raw datasets largely remains an open issue.

Statistical plots

Dependent plots are frequently used in conventional statistical plots.

Bland-Altman plot

The Bland-Altman plot is made of a scatter plot and some reference lines.

Biplot

The biplot is made of two layered scatter plots.

Galbraith plot

The Galbraith plot is a beautiful scatter plot layered with a radial rug plot.

Partial regression plot

The partial regression plot is a scatter plot layered with a regression line.

Partial residual plot

The partial residual plot is a scatter plot layered with a regression curve.

Poincaré plot

The Poincaré plot is a scatter plot layered with an ellipse.

Run chart

The run chart (also called run-sequence plot) is a simple point plot.

Volcano plot

The volcano plot can be produced with a scatter plot and a simple data transformation.

Alternatives

The term dependent plot encompasses a wide range of visuals. As such, there are no real alternatives to them, but some related visuals should be mentioned. Furthermore, it should be obvious that the set of dependent plots outlined in this article is not exhaustive, for the wide range of datatypes defined by Principia Data is an open invitation to data visualization practitioners to create datatype-specific visuals that were never produced before.

Geopoint map

The geopoint map is a geospatial alternative to the scatter plot.

Heat map

The heat map (or heatmap) can be thought of as a discretized version of a scatter plot, for which the two dependent variables mapped to the horizontal and vertical axes have been discretized into a pair of independent variables, and the color measure visualizes counts.

Connected scatter plot

The connected scatter plot is a variation on the scatter plot with an extra ordered measure.

Conclusion

By design, dependent plots are really effective at visualizing raw data. Nevertheless, care should be taken in selecting the most appropriate plot for the variables being visualized, in relation to their datatypes. Furthermore, options such as opacity, highlights, or regression lines can significantly enhance their visual appeal and interpretability.

Next week's article will cover univariate bar charts.

John Morris

Sales Leadership: Better Business Thru Technology

5 年

You can tell better stories when you have the vocabulary. Turn a pile of data points into a motivating story. And do it right. Help lead people to make good decisions. Studying the series of which this post is part is a great investment in your capabilities.

1 次回应

要查看或添加评论，请登录

Ismael Chang Ghalimi的更多文章

The fastest computer for data analytics

2021年3月10日

The fastest computer for data analytics

Five years ago, STOIC was going through a very intense period of QA and bug fixing. Consequently, there was very little…

53 条评论
Breathe Year 1

2021年1月4日

Breathe Year 1

When the pandemic hit, I found myself bored and scared. To regain a modicum of agency, I decided to challenge myself…

6 条评论
Less formulas, more fun

2020年12月20日

Less formulas, more fun

I love spreadsheets, but they make it almost too easy to create a tangled mess of formulas. When we designed STOIC, we…
STOIC, a Modern Data Workbench

2020年11月24日

STOIC, a Modern Data Workbench

Ever since we started STOIC, we have introduced it as a next-generation spreadsheet, or a big data spreadsheet. This…

6 条评论
Technical Writing with STOIC

2020年11月23日

Technical Writing with STOIC

Our main investor is credited with coining the expression "eating your own dogfood", and this is something that we try…

9 条评论
Cell Plots

2020年2月17日

Cell Plots

Welcome to this seventh article in our Designing charts like a stoic series! Introduction Today, we will cover cell…

3 条评论
Univariate Bar Charts

2020年2月10日

Univariate Bar Charts

Welcome to this sixth article in our Designing charts like a stoic series! Introduction Today, we will cover univariate…

5 条评论
Differential Charts

2020年1月27日

Differential Charts

Welcome to this fourth article in our Designing charts like a stoic series. Introduction Today, we will cover…
Designing area charts like a stoic

2020年1月20日

Designing area charts like a stoic

Welcome to this third article in our Designing charts like a stoic series. Introduction Today, we will cover area…

1 条评论
Designing line charts like a stoic

2020年1月13日

Designing line charts like a stoic

Welcome to this second article in our Designing charts like a stoic series. Introduction Today, we will cover line…

2 条评论

See all articles