Ggplot2 - A 101 approach to the 101 graphics package
One thing is for sure – since I started using #R, not only the data got easier to manipulate but the output graphics of the data got drastically more appealing.
#ggplot2 is probably the first graphics package you will get introduced to when surfing the R language and it is one that it is important to master for its usefulness and easiness to use when visualizing your data. This led me to create the present 101 approach to this package. With the examples presented, I intend to explain each of the main functionalities of the package as well as some hacks.
As stated in the ggplot2 guide, everything starts with three key components:
- Data;
- A set of aesthetic mappings between variables in the data and visual properties, and
- At least one layer which describes how to render each observation. Layers are usually created with a “geom” function.
The methodology is stated as follows:
ggplot (data = <DATA> ,
<GEOM_FUNCTION> (mapping = aes( <MAPPINGS> ),
stat = <STAT> , position = <POSITION> ) + <COORDINATE_FUNCTION> + <FACET_FUNCTION> + <SCALE_FUNCTION> + <THEME_FUNCTION>
As a first example, the following code uses the “iris” dataset to compare its Sepal Length and Width in a scatter plot. The iris dataset has 150 rows of data and 5 columns with the following information: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width and Species.
EXAMPLE 1
ggplot(data = iris) + geom_point(mapping = aes(x=Sepal.Length , y=Sepal.Width, color = Species))
Well, the first milestone was achieved - creating a ggplot - lets then understand each component. As stated, I will explain both the required layers - geom and aesthetics – and the non-required ones.
As a starter, lets understand what are "geoms" and "aesthetics". The geoms are the layers that describe the output. The aesthetics describe how the variables will be outputted. Different types of aesthetic attributes work better with different types of variables. For example, color and shape work well with categorical variables, while size works well for continuous variables.
The first example starts by stating the data frame used, this comes with the “data = iris”. Then we add the geom, in this case the “geom_point()”, a two variable geom, that creates a scatter plot. Inside the geom we define its aesthetics with “mapping = aes()”. The fundamental aesthetics to create an output are the "x" and "y" axis. In this case we stated that the "x" axis should represent the Sepal Length and the "y" axis the Sepal Width. As a last aesthetic we added a color to the data points with “color = []”. In this example we are not specifying any color, like we could do with “color = “blue” ” but saying that the color should represent the different Species in the data frame.
EXAMPLE 2
ggplot(data = iris, mapping = aes(x=Sepal.Length , y=Sepal.Width, color = Species)) + geom_point() + geom_density2d() + scale_color_manual(values = c("#E7B800", "#FC4E07","#00AFBB")) + labs(title="Iris", x="Represents the Sepal Length",y="Represents the Sepal Width") + facet_wrap(~Species)
This example starts with a different way of mapping the aesthetics. On the first example we mapped them inside the “geom_point()”, but on this one we are mapping them inside the “ggplot()” argument. Whats the difference? This is useful when you want to use more than one geom in the same ggplot, for this way the aesthetics will be present in every argument and not only on a certain geom. Breaking down each part:
color – controls the outline color of points / lines / …
geom_density2d() – creates a density estimation of the variables by drawing contour lines;
scale_color_manual – lets you specify your own values to the aesthetics in the data, in this case, the “color”;
labs – lets you insert labels in the output. In this case, we are introducing a title for the output and different labels for the "x" and "y" axis;
facet_wrap() – lets you make subplots of the data. In this case, we are separating the data according to the Species.
EXAMPLE 3
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_smooth(mapping = aes(group = Species, linetype = Species), se = FALSE) + geom_point(mapping = aes(shape = Species, color = Species), size = 3) + scale_shape_manual(values = c(8, 18, 15)) + scale_linetype_manual(values = c(1,2,3)) + scale_color_viridis_d() + theme_minimal() + theme(legend.position = "top")
geom_smooth() – it created a smooth local regression of the data. There are two aesthetics given to the smooth:
group – defines the grouping structure of the data, in this case, by Species;
linetype – Specifies the type of line to plot, having the following codes:
0 = blank, 1 = solid, 2 = dashed, 3 = dotted, 4 = dotdash, 5 = longdash, 6 = twodash
Through the usage of “scale_linetype_manual”, we can force which type of lines we want for each line.
se = FALSE – is asking if we want to display the confidence interval or not. In this case, we do not.
shape – controls the shape of points, having the following codes:
00–14: hollow shapes
15–18: solid shapes
19–23: filled shapes
Just like the “linetype”, we can force which shape to give to each set of data through the usage of “scale_shape_manual”;
scale_color_viridis_d() – Similar to the “scale_color_manual”, here we are using the viridis color scale, a color scale designed to be perceived by viewers with common forms of colour blindness;
theme_minimal() – themes are used to give plots a consistent customized look. Theme_minimal is one of many themes;
theme(legend.position = "top") – I is used to force the position of the legend. With the “theme()” argument one can change the customize the components of the plot.
EXAMPLE 4
?ggplot(mdf, aes(x=Inicio, y=value, colour = day, group=day)) + geom_point() + geom_line() + geom_text(data = subMax, aes(label=value), vjust=-.5) + theme(legend.position = "right",axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + labs(title="Lisbon Metro Train Flow 2020", x="",y="No of trains in circulation") + scale_colour_manual(values = c("darkred", "steelblue", "darkgreen", "#F79646", "#8064A2"))
In line with my last post on R where I have acquired data through APIs, this example uses the information from Lisbons' Metro obtained in 5 consecutive days of 2020 as represents the number of trains in circulation, per period, per day. It has 105 entries (rows) and 3 columns: the time period, the day and the number of trains.
geom_line() – used to make line plots;
geom_text() – useful for labeling plots. We are using a subset of the data to show just the maximum value for each time period. The label is defined via “aes(label=value)”. “vjust” controls vertical justification of the text, i.e., positioning.
axis.text.x = element_text() – we are editing the text elements of the "x" axis.
EXAMPLE 5
ggplot(data = diamonds, mapping = aes(x = cut, fill = color)) + geom_bar(alpha = 1/2, position = "dodge") + scale_fill_brewer(palette="Reds")
In the present case, the “diamonds” r dataset was used.
In a “geom_bar()” we can only have an "x" or "y" aesthetic. However, we may have a fill aesthetic, that helps your visual presentation.
“geom_bar()” shows the distribution of categorical variables. It uses a statistical transformation, by default the count, to display the plot. The "y" axis values can be overwritten (ex. showing the proportion), i.e. geom_bar(mapping = aes(x = cut, y = stat(prop), group = 1)).
alpha – the alpha controls the transparency of points / lines / smooth / …
position – The position argument is specially used in bar charts and it determines the stacking of the elements in a bar. If we don’t want a stacked bar chart we have 3 option: “identity”, “dodge” or “fill”;
- “identity” will place each object where it falls in the context of the graph. It will overlap the bars;
- “fill” will make every bar of the same height, making the proportions comparable;
- “dodge” places each object directly beside each other;
scale_fill_brewer – gives us the possibility to specify our own values for the fill aesthetic, meaning the color within the points / lines / smooth / …
EXAMPLE 6
ggplot(data = diamonds, mapping = aes(x = color, fill = cut)) + geom_bar(position = "fill") + geom_text(data=percentData, aes(y=n,label=ratio), position=position_fill(vjust=0.5)) + scale_fill_brewer(palette="Blues") + coord_flip()
The main differences on this example are:
- flipping the coordinates, i.e., changing the "x" and "y" axis, which is done via “coord_flip”;
- the “fill” position, which makes the proportions comparable;
- the labeling of the percentages. For this step I had to do a data summarizing, using the dplyr package (coming in a later article) to get the percentage values:
percentData <- diamonds %>% group_by(color) %>% count(cut) %>% mutate(ratio=scales::percent(n/sum(n)))
EXAMPLE 7
ggplot(iris,mapping = aes(x = Sepal.Length)) + geom_histogram(aes(fill = Species), color="#e9ecef", alpha=0.6, position = 'identity', binwidth=0.1) + geom_density(aes(color = Species), size = 1.5) + scale_fill_manual(values = c("darkred", "steelblue", "darkgreen")) + scale_color_manual(values = c("darkred", "steelblue", "darkgreen")) + scale_x_continuous(breaks=seq(4,8,0.5))
What is the difference between a bar chart and a histogram? In theory, a bar chart provides a visual presentation of categorical data while an histogram is used to plot density of interval data. However, in the ggplot2 package there is no difference, even stating the following in the docs: “geom_histogram is an alias for geom_bar plus “stat_bin” ”.
In this example we are using both a "geom_histogram" and a “geom_density”. The later computes and draws kernel density estimate, i.e., a smoothed version of the histogram.
Here, I wanted to show the clear difference between fill and color, using the command "fill" for the histogram bars (color within the bars) and the command "color" for the density (color only on the density line). Therefore, when changing the colors manually, there was the need to change both "color" and "fill".
binwidth – defines the width of the histogram bars
scale_x_continuous – used to changes the "x" axis scale. In this case, with a sequence from 4 to 8 with intervals of 0.5.
EXAMPLE 8
To create a pie chart using ggplot we need to change the coordinates of a “geom_bar” to polar coordinates, using “coord_polar”, where one can specify which of the "x" and "y" coordinates will be the theta, and the starting point. This example shows a normal pie chart that has two variable (hence the "x" axis having the value “”). On the next and final example you will see a windrose chart that compares three variables of the data.
df <- data.frame(heigth=rep(c("Very Low", "Low", "Medium", "High", "Very High")), number=c(3,5,4,7,9))
ggplot(df, aes(x = "", y = number, fill = heigth)) + geom_bar(width = 1, stat="identity") + coord_polar(theta = "y", start=pi / 3) + ggtitle("Pie Chart") +labs(x="",y="") + scale_fill_viridis_d()
EXAMPLE 9
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, fill = Species)) + geom_bar(width = 0.1, stat="identity") + coord_polar("y", start=pi / 2) + ggtitle("Pie Chart")
This concludes the presentation of the ggplo2 package where I truly hope I could help you understand not only how to work with it but how each part truly works.
Useful references
https://r4ds.had.co.nz/data-visualisation.html#geometric-objects
https://stackoverflow.com/questions/28253587/variable-label-position-in-ggplot-line-chart
https://cran.r-project.org/web/packages/ggplot2/vignettes/ggplot2-specs.html
https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html
https://stackoverflow.com/questions/7263849/what-do-hjust-and-vjust-do-when-making-a-plot-using-ggplot
https://stackoverflow.com/questions/14570293/special-variables-in-ggplot-count-density-etc
https://plotnine.readthedocs.io/en/stable/tutorials/miscellaneous-show-counts-on-a-stacked-bar-plot.html