Correlation plots in?R
Dr. Saurav Das
Research Director | Farming Systems Trial | Rodale Institute | Soil Health, Biogeochemistry of Carbon & Nitrogen, Environmental Microbiology, and Data Science | Outreach & Extension | Vibe coding
In the field of statistics, the term ‘correlation’ commonly refers to the relationship that exists between two or more random variables. More precisely, it quantifies the degree of linear association between these variables.
(Note: For the purpose of our discussion, I will be using the ‘mtcars’ dataset. However, the mathematical background and underlying equations will not be covered in this article.)
Our main focus in this article is to construct correlation matrices, both with the help of predefined packages and without them. For instance, we will start with the ‘corrplot’ package.”
#loading the dataset
data(“mtcars”)
#we will use “corrplot” library
library(corrplot)
#to make the correlation matrix plot
corrplot(cor(mtcars)) #it creates the correlation matrix
It’s important to understand what the various circles and color keys situated on the right side of the plot signify. These represent the ‘correlation coefficient,’ also known as the ‘R-value, which indicates the degree of the linear relationship between the variables. The value of ‘R’ can range from -1 to +1. A value of +1 implies a 100% positive correlation, meaning that if one variable increases, the other also increases, and vice versa. On the other hand, -1 denotes a 100% negative correlation, implying that an increase in one variable results in a decrease in the other. A value of 0.0, however, signifies no linear relationship. Additionally, the size of the circles is proportional to the correlation percentage.
Customizing the ‘corrplot’ Visualization:?
The presentation of the correlation matrix can be modified according to your preference. The method can be set to ‘square,’ ‘circle,’ or ‘number,’ altering the matrix’s visual representation. Additionally, the ‘type’ can be changed to display either the upper or lower half of the matrix (keeping in mind that these are mirror images of each other). Therefore, you can choose to exhibit the upper, lower, or the entire matrix based on your needs.
corrplot(
cor(mtcars),
method = “square”,
type = “upper”,
tl.col = “black”,
tl.cex = 2,
col = colorRampPalette(c(“purple”, “dark green”))(200)
)
You also have the option to generate a mixed-type matrix with the following code. In this example, the upper section utilizes the ‘square’ method, while the lower section displays numerical values corresponding to the correlation coefficients.
corrplot.mixed(cor(mtcars),
upper = “square”,
lower = “number”,
addgrid.col = “black”,
tl.col = “black”)
2. Using package ggcorrplot
You can achieve the same results using the package “ggcorrplot”.
ggcorrplot(cor(mtcars))
To customize?:
There are a lot of customization options, I just used a few, you can modify them based on your preference.
ggcorrplot(cor(mtcars),
method = “circle”,
type = “lower”,
outline.color = “black”,
lab_size = 6)
3. Without any predefined package (using ggplot so, you can customize more)
Now, first, let's calculate and create a correlation matrix, and then we will see how to create a visualization using ggplot.
领英推荐
for this, we will use the package “rstatix”
library(rstatix)
cor_test <- cor_mat(mtcars) #to create the correlation matrix
cor_test
you can also get a p-value matrix using the following code:
cor_p <- cor_pmat(mtcars)
cor_p
Now as we have the matrix for r-value, we can just gather all the data into variable columns (for all the keys) and the actual r value in another column using the following code.
df <- cor_test %>% gather(-rowname, key = cor_var, value = r)
Now, as we have gather the data, its easy to put into plots/graphs using the ggplot package, so we will use the geom_tile function to achieve the above plots. you can use the following code to make a basic tile plot correlation matrix.
df %>% ggplot(aes(rowname, cor_var, fill = r)) + geom_tile() +
labs(x = “variables”, y = “variables”)
Now let's say, we want to customize it, you can use the basic ggplot functions to customize it. For example:
df %>% ggplot(aes(rowname, cor_var, fill = r)) + geom_tile() +
labs(x = “variables”, y = “variables”) +
scale_fill_gradient(low = “light yellow”, high = “dark green”)
Now let's say, you want to add the actual values in your plot, You can use the following code:
df %>% ggplot(aes(rowname, cor_var, fill = r)) + geom_tile() +
labs(x = “variables”, y = “variables”) +
scale_fill_gradient(low = “light yellow”, high = “dark green”) +
geom_text(aes(label = r))
4. Another informative package is “perfromanceanalytics”, which gives you p-value, distribution (histograms), and correlation coefficient. For example:
library(PerformanceAnalytics)
chart.Correlation(cor(mtcars))
The red stars in the figure define the level of significance. = 0.05, * = 0.01, *** = 0.001
5. Exploring the ‘Lares’ Package:?
This interesting package adds a new dimension to our analysis. It ranks correlations and arranges columns in a progressively descending order. This feature is particularly useful when identifying and analyzing the variables with the highest correlation
library(lares)
corr_cross(mtcars, rm.na = T, max_pvalue = 0.05, top = 15, grid = T)
If you have any questions, you can put them in the comment box. If you have felt this post helped you, you can buy me a coffee: https://www.buymeacoffee.com/dashboard
Biotechnology & Biostatistics
1 年Thank you.. the article has really helped
Nice graph but be careful -- simple (2-variable) correlations can in fact reflect a third (or more) "lurking variable" that is the reason for the apparent correlation. Always use your domain knowledge to see if these make sense.
Great work, helped me a lot. please give an example of how to arrange data for correlation analysis