Correlation plots in?R

Correlation plots in?R


In the field of statistics, the term ‘correlation’ commonly refers to the relationship that exists between two or more random variables. More precisely, it quantifies the degree of linear association between these variables.

(Note: For the purpose of our discussion, I will be using the ‘mtcars’ dataset. However, the mathematical background and underlying equations will not be covered in this article.)

Our main focus in this article is to construct correlation matrices, both with the help of predefined packages and without them. For instance, we will start with the ‘corrplot’ package.”

  1. Using corrplot package

#loading the dataset
data(“mtcars”)        
#we will use “corrplot” library
library(corrplot)        
#to make the correlation matrix plot
corrplot(cor(mtcars)) #it creates the correlation matrix        

It’s important to understand what the various circles and color keys situated on the right side of the plot signify. These represent the ‘correlation coefficient,’ also known as the ‘R-value, which indicates the degree of the linear relationship between the variables. The value of ‘R’ can range from -1 to +1. A value of +1 implies a 100% positive correlation, meaning that if one variable increases, the other also increases, and vice versa. On the other hand, -1 denotes a 100% negative correlation, implying that an increase in one variable results in a decrease in the other. A value of 0.0, however, signifies no linear relationship. Additionally, the size of the circles is proportional to the correlation percentage.

Customizing the ‘corrplot’ Visualization:?

The presentation of the correlation matrix can be modified according to your preference. The method can be set to ‘square,’ ‘circle,’ or ‘number,’ altering the matrix’s visual representation. Additionally, the ‘type’ can be changed to display either the upper or lower half of the matrix (keeping in mind that these are mirror images of each other). Therefore, you can choose to exhibit the upper, lower, or the entire matrix based on your needs.

corrplot(
 cor(mtcars),
 method = “square”,
 type = “upper”,
 tl.col = “black”,
 tl.cex = 2,
 col = colorRampPalette(c(“purple”, “dark green”))(200)
)        

You also have the option to generate a mixed-type matrix with the following code. In this example, the upper section utilizes the ‘square’ method, while the lower section displays numerical values corresponding to the correlation coefficients.

corrplot.mixed(cor(mtcars),
 upper = “square”,
 lower = “number”,
 addgrid.col = “black”,
 tl.col = “black”)        

2. Using package ggcorrplot

You can achieve the same results using the package “ggcorrplot”.

ggcorrplot(cor(mtcars))        

To customize?:

There are a lot of customization options, I just used a few, you can modify them based on your preference.

ggcorrplot(cor(mtcars), 
 method = “circle”,
 type = “lower”,
 outline.color = “black”,
 lab_size = 6)        


3. Without any predefined package (using ggplot so, you can customize more)

Now, first, let's calculate and create a correlation matrix, and then we will see how to create a visualization using ggplot.

for this, we will use the package “rstatix”
library(rstatix)        
cor_test <- cor_mat(mtcars) #to create the correlation matrix
cor_test        

you can also get a p-value matrix using the following code:

cor_p <- cor_pmat(mtcars)
cor_p        

Now as we have the matrix for r-value, we can just gather all the data into variable columns (for all the keys) and the actual r value in another column using the following code.

df <- cor_test %>% gather(-rowname, key = cor_var, value = r)        

Now, as we have gather the data, its easy to put into plots/graphs using the ggplot package, so we will use the geom_tile function to achieve the above plots. you can use the following code to make a basic tile plot correlation matrix.

df %>% ggplot(aes(rowname, cor_var, fill = r)) + geom_tile() +
 labs(x = “variables”, y = “variables”)        

Now let's say, we want to customize it, you can use the basic ggplot functions to customize it. For example:

df %>% ggplot(aes(rowname, cor_var, fill = r)) + geom_tile() +
 labs(x = “variables”, y = “variables”) +
 scale_fill_gradient(low = “light yellow”, high = “dark green”)        

Now let's say, you want to add the actual values in your plot, You can use the following code:

df %>% ggplot(aes(rowname, cor_var, fill = r)) + geom_tile() +
 labs(x = “variables”, y = “variables”) +
 scale_fill_gradient(low = “light yellow”, high = “dark green”) +
 geom_text(aes(label = r))        

4. Another informative package is “perfromanceanalytics”, which gives you p-value, distribution (histograms), and correlation coefficient. For example:

library(PerformanceAnalytics)        
chart.Correlation(cor(mtcars))        

The red stars in the figure define the level of significance. = 0.05, * = 0.01, *** = 0.001

5. Exploring the ‘Lares’ Package:?

This interesting package adds a new dimension to our analysis. It ranks correlations and arranges columns in a progressively descending order. This feature is particularly useful when identifying and analyzing the variables with the highest correlation

library(lares)
corr_cross(mtcars, rm.na = T, max_pvalue = 0.05, top = 15, grid = T)        

If you have any questions, you can put them in the comment box. If you have felt this post helped you, you can buy me a coffee: https://www.buymeacoffee.com/dashboard

Malambo Nchimunya Muloongo

Biotechnology & Biostatistics

1 年

Thank you.. the article has really helped

Nice graph but be careful -- simple (2-variable) correlations can in fact reflect a third (or more) "lurking variable" that is the reason for the apparent correlation. Always use your domain knowledge to see if these make sense.

Great work, helped me a lot. please give an example of how to arrange data for correlation analysis

要查看或添加评论,请登录

Dr. Saurav Das的更多文章

  • Synthetic Data for Soil C Modeling

    Synthetic Data for Soil C Modeling

    Note: The article is not complete yet My all-time question is, do we need all and precise data from producers (maybe I…

  • Bootstrapping

    Bootstrapping

    1. Introduction to Bootstrapping Bootstrapping is a statistical resampling method used to estimate the variability and…

  • Ecosystem Service Dollar Valuation (Series - Rethinking ROI)

    Ecosystem Service Dollar Valuation (Series - Rethinking ROI)

    The valuation of ecosystem services in monetary terms represents a critical frontier in environmental economics…

  • Redefining ROI for True Sustainability

    Redefining ROI for True Sustainability

    It’s been a while since I last posted for Muddy Monday, but a few thoughts have been taking root in my mind, growing…

  • Linear Plateau in R

    Linear Plateau in R

    When working with data in fields such as agriculture, biology, and economics, it’s common to observe a response that…

    2 条评论
  • R vs R-Studio

    R vs R-Studio

    R: R is a programming language and software environment for statistical computing and graphics. Developed by Ross Ihaka…

    1 条评论
  • Backtransformation

    Backtransformation

    Backtransformation is the process of converting the results obtained from a transformed dataset back to the original…

    3 条评论
  • Spectroscopic Methods and Use in Soil Organic Matter & Carbon Measurement

    Spectroscopic Methods and Use in Soil Organic Matter & Carbon Measurement

    Spectroscopic methods comprise a diverse array of analytical techniques that quantify how light interacts with a…

    2 条评论
  • Regression & Classification

    Regression & Classification

    Regression and classification are two predictive modeling approaches in statistics and machine learning. Here's a brief…

    2 条评论
  • Vectorization over loop

    Vectorization over loop

    Vectorization Vectorization in R refers to the practice of applying a function to an entire vector or array of data at…

社区洞察

其他会员也浏览了