登录查看更多内容

Correlation plots in?R

Dr. Saurav Das

Research Director | Farming Systems Trial | Rodale Institute | Soil Health, Biogeochemistry of Carbon & Nitrogen, Environmental Microbiology, and Data Science | Outreach & Extension | Vibe coding

发布日期: 2023年8月29日

In the field of statistics, the term ‘correlation’ commonly refers to the relationship that exists between two or more random variables. More precisely, it quantifies the degree of linear association between these variables.

(Note: For the purpose of our discussion, I will be using the ‘mtcars’ dataset. However, the mathematical background and underlying equations will not be covered in this article.)

Our main focus in this article is to construct correlation matrices, both with the help of predefined packages and without them. For instance, we will start with the ‘corrplot’ package.”

Using corrplot package

#loading the dataset
data(“mtcars”)

#we will use “corrplot” library
library(corrplot)

#to make the correlation matrix plot
corrplot(cor(mtcars)) #it creates the correlation matrix

It’s important to understand what the various circles and color keys situated on the right side of the plot signify. These represent the ‘correlation coefficient,’ also known as the ‘R-value, which indicates the degree of the linear relationship between the variables. The value of ‘R’ can range from -1 to +1. A value of +1 implies a 100% positive correlation, meaning that if one variable increases, the other also increases, and vice versa. On the other hand, -1 denotes a 100% negative correlation, implying that an increase in one variable results in a decrease in the other. A value of 0.0, however, signifies no linear relationship. Additionally, the size of the circles is proportional to the correlation percentage.

Customizing the ‘corrplot’ Visualization:?

The presentation of the correlation matrix can be modified according to your preference. The method can be set to ‘square,’ ‘circle,’ or ‘number,’ altering the matrix’s visual representation. Additionally, the ‘type’ can be changed to display either the upper or lower half of the matrix (keeping in mind that these are mirror images of each other). Therefore, you can choose to exhibit the upper, lower, or the entire matrix based on your needs.

corrplot(
 cor(mtcars),
 method = “square”,
 type = “upper”,
 tl.col = “black”,
 tl.cex = 2,
 col = colorRampPalette(c(“purple”, “dark green”))(200)
)

You also have the option to generate a mixed-type matrix with the following code. In this example, the upper section utilizes the ‘square’ method, while the lower section displays numerical values corresponding to the correlation coefficients.

corrplot.mixed(cor(mtcars),
 upper = “square”,
 lower = “number”,
 addgrid.col = “black”,
 tl.col = “black”)

2. Using package ggcorrplot

You can achieve the same results using the package “ggcorrplot”.

ggcorrplot(cor(mtcars))

To customize?:

There are a lot of customization options, I just used a few, you can modify them based on your preference.

ggcorrplot(cor(mtcars), 
 method = “circle”,
 type = “lower”,
 outline.color = “black”,
 lab_size = 6)

3. Without any predefined package (using ggplot so, you can customize more)

Now, first, let's calculate and create a correlation matrix, and then we will see how to create a visualization using ggplot.

领英推荐

Statistics for people in a hurry

Cassie Kozyrkov 4 年前

The Trick That Helps All Statisticians Survive

Keith McNulty 7 个月前

Simple Linear Regression in Statistics using Least…

Lean Manufacturing & Six Sigma Worldwide 9 个月前

for this, we will use the package “rstatix”

library(rstatix)

cor_test <- cor_mat(mtcars) #to create the correlation matrix
cor_test

you can also get a p-value matrix using the following code:

cor_p <- cor_pmat(mtcars)
cor_p

Now as we have the matrix for r-value, we can just gather all the data into variable columns (for all the keys) and the actual r value in another column using the following code.

df <- cor_test %>% gather(-rowname, key = cor_var, value = r)

Now, as we have gather the data, its easy to put into plots/graphs using the ggplot package, so we will use the geom_tile function to achieve the above plots. you can use the following code to make a basic tile plot correlation matrix.

df %>% ggplot(aes(rowname, cor_var, fill = r)) + geom_tile() +
 labs(x = “variables”, y = “variables”)

Now let's say, we want to customize it, you can use the basic ggplot functions to customize it. For example:

df %>% ggplot(aes(rowname, cor_var, fill = r)) + geom_tile() +
 labs(x = “variables”, y = “variables”) +
 scale_fill_gradient(low = “light yellow”, high = “dark green”)

Now let's say, you want to add the actual values in your plot, You can use the following code:

df %>% ggplot(aes(rowname, cor_var, fill = r)) + geom_tile() +
 labs(x = “variables”, y = “variables”) +
 scale_fill_gradient(low = “light yellow”, high = “dark green”) +
 geom_text(aes(label = r))

4. Another informative package is “perfromanceanalytics”, which gives you p-value, distribution (histograms), and correlation coefficient. For example:

library(PerformanceAnalytics)

chart.Correlation(cor(mtcars))

The red stars in the figure define the level of significance. = 0.05, * = 0.01, *** = 0.001

5. Exploring the ‘Lares’ Package:?

This interesting package adds a new dimension to our analysis. It ranks correlations and arranges columns in a progressively descending order. This feature is particularly useful when identifying and analyzing the variables with the highest correlation

library(lares)
corr_cross(mtcars, rm.na = T, max_pvalue = 0.05, top = 15, grid = T)

If you have any questions, you can put them in the comment box. If you have felt this post helped you, you can buy me a coffee: https://www.buymeacoffee.com/dashboard

R for Soil Science

2,632 位关注者

Malambo Nchimunya Muloongo

Biotechnology & Biostatistics

1 年

Thank you.. the article has really helped

1 次回应

罗大伟

1 年

Nice graph but be careful -- simple (2-variable) correlations can in fact reflect a third (or more) "lurking variable" that is the reason for the apparent correlation. Always use your domain knowledge to see if these make sense.

3 次回应

Dr. Pradeep Kumar Dash

1 年

Great work, helped me a lot. please give an example of how to arrange data for correlation analysis

1 次回应

查看更多评论

要查看或添加评论，请登录

Dr. Saurav Das的更多文章

Synthetic Data for Soil C Modeling

2025年2月9日

Synthetic Data for Soil C Modeling

Note: The article is not complete yet My all-time question is, do we need all and precise data from producers (maybe I…
Bootstrapping

2025年1月7日

Bootstrapping

1. Introduction to Bootstrapping Bootstrapping is a statistical resampling method used to estimate the variability and…
Ecosystem Service Dollar Valuation (Series - Rethinking ROI)

2024年12月24日

Ecosystem Service Dollar Valuation (Series - Rethinking ROI)

The valuation of ecosystem services in monetary terms represents a critical frontier in environmental economics…
Redefining ROI for True Sustainability

2024年8月28日

Redefining ROI for True Sustainability

It’s been a while since I last posted for Muddy Monday, but a few thoughts have been taking root in my mind, growing…
Linear Plateau in R

2024年5月22日

Linear Plateau in R

When working with data in fields such as agriculture, biology, and economics, it’s common to observe a response that…

2 条评论
R vs R-Studio

2024年3月29日

R vs R-Studio

R: R is a programming language and software environment for statistical computing and graphics. Developed by Ross Ihaka…

1 条评论
Backtransformation

2024年2月22日

Backtransformation

Backtransformation is the process of converting the results obtained from a transformed dataset back to the original…

3 条评论
Spectroscopic Methods and Use in Soil Organic Matter & Carbon Measurement

2024年1月30日

Spectroscopic Methods and Use in Soil Organic Matter & Carbon Measurement

Spectroscopic methods comprise a diverse array of analytical techniques that quantify how light interacts with a…

2 条评论
Regression & Classification

2024年1月30日

Regression & Classification

Regression and classification are two predictive modeling approaches in statistics and machine learning. Here's a brief…

2 条评论
Vectorization over loop

2024年1月17日

Vectorization over loop

Vectorization Vectorization in R refers to the practice of applying a function to an entire vector or array of data at…

See all articles

Correlation plots in?R

Dr. Saurav Das

Research Director | Farming Systems Trial | Rodale Institute | Soil Health, Biogeochemistry of Carbon & Nitrogen, Environmental Microbiology, and Data Science | Outreach & Extension | Vibe coding

领英推荐

R for Soil Science

2,632 位关注者

Dr. Saurav Das的更多文章

社区洞察

其他会员也浏览了

Statistics For People In A Hurry

Interpreting the Intercept in Regression Models

Understanding Wide Confidence Intervals and Significant p-values in Research

The Powers of “Normal Distribution”

Beyond the Average: The Diverse World of Statistical Means

Harnessing the Power of Random Forest for Glucose Prediction, How I Completed This Task

A Lay Interpretation of Statistical Significance and p-values: Slow Burn, Love at First Sight or Enduring Love?

What is a Time Series

Hyothesis Testing: Accept the Knowledge & Reject Fear

When it Makes Sense to Categorize a Continuous Predictor in a Regression Model

领英推荐

R for Soil Science

2,632 位关注者

Dr. Saurav Das的更多文章

Synthetic Data for Soil C Modeling

Bootstrapping

Ecosystem Service Dollar Valuation (Series - Rethinking ROI)

Redefining ROI for True Sustainability

Linear Plateau in R

R vs R-Studio

Backtransformation

Spectroscopic Methods and Use in Soil Organic Matter & Carbon Measurement

Regression & Classification

Vectorization over loop

社区洞察

其他会员也浏览了

Statistics For People In A Hurry

Interpreting the Intercept in Regression Models

Understanding Wide Confidence Intervals and Significant p-values in Research

The Powers of “Normal Distribution”

Beyond the Average: The Diverse World of Statistical Means

Harnessing the Power of Random Forest for Glucose Prediction, How I Completed This Task

A Lay Interpretation of Statistical Significance and p-values: Slow Burn, Love at First Sight or Enduring Love?

What is a Time Series

Hyothesis Testing: Accept the Knowledge & Reject Fear

When it Makes Sense to Categorize a Continuous Predictor in a Regression Model