Data Wrangling in R

Data Wrangling in R

R, with its suite of packages known as the tidyverse, offers unparalleled capabilities for data manipulation. This post gives a overview of the primary data manipulation functions, with few hypothetical examples.

#Installing tidyverse (if you dont have the package already installed in your system)
install.packages("tidyverse", dependencies = T)        
#load the library
library(tidyverse)        

Before diving into data manipulation, it's essential to grasp the concepts of data structure in R: observations, variables, and values.

  • Observations: Each observation represents an individual unit of analysis. In a dataset, an observation might be a person, a company, a country, or any entity you are studying. In R's data frames or tibbles, each observation is typically a row.
  • Variables: Variables capture specific characteristics of an observation. For instance, if the observation is a person, the variables could be their age, name, or height. In a dataset, variables are usually represented as columns.
  • Values: These are the actual data points for a variable for a specific observation. For example, if "Age" is a variable, then "25" might be the value of the Age variable for one specific observation (or person).

Together, these three concepts structure data into a format that's easy to analyze. A typical dataset in R is essentially a collection of values (data points), organized into variables (columns), for a series of observations (rows).

Tidy data graphic from "r for data-science" by Garrett Grolemund and Hadley Wickham

Data wrangling

1. Pivoting Data

Transforming data from wide to long format (or vice-versa) is often a starting point for analysis.

??Pivot Longer (Melting): Converts data from a wide format to a long one.

#demo-data sets
df <- tibble( ID = c(1, 2, 3), Score_A = c(85, 90, 88), Score_B = c(80, 87, 84) )
        
df
# A tibble: 3 × 3
     ID Score_A Score_B
  <dbl>   <dbl>   <dbl>
1     1      85      80
2     2      90      87
3     3      88      84        
#pivot longer
df_long <- df %>% pivot_longer(cols = starts_with("Score"), names_to = "Subject", values_to = "Score")


#longer version of the dataset
df_long
# A tibble: 6 × 3
     ID Subject Score
  <dbl> <chr>   <dbl>
1     1 Score_A    85
2     1 Score_B    80
3     2 Score_A    90
4     2 Score_B    87
5     3 Score_A    88
6     3 Score_B    84        

?? Pivot Wider (Casting): Essentially the inverse of pivoting longer. It stretches the data into a wider format.

#longer to wider
df_wide <- df_long %>% pivot_wider(names_from = "Subject", values_from = "Score")

#wide dataset
df_wide

# A tibble: 3 × 3
     ID Score_A Score_B
  <dbl>   <dbl>   <dbl>
1     1      85      80
2     2      90      87
3     3      88      84
        

2. Sorting and Filtering Data

?? Sorting: Organizes your data based on specific criteria.

?? Filtering: Extracts specific data based on conditions.

# Sorting by Score_A 
df_sorted <- df %>% arrange(Score_A) 
df_sorted
# A tibble: 3 × 3
     ID Score_A Score_B
  <dbl>   <dbl>   <dbl>
1     1      85      80
2     3      88      84
3     2      90      87

# Filtering rows where Score_A is above 85 
df_filtered <- df %>% filter(Score_A > 85)

df_filtered
# A tibble: 2 × 3
     ID Score_A Score_B
  <dbl>   <dbl>   <dbl>
1     2      90      87
2     3      88      84        

3. Splitting Text-to-Columns

?? tidyr::separate(): Effortlessly splits a column into multiple columns.

> #demo dataaset
> df_name <- tibble( Full_Name = c("Steve_Hervey", "Jane_Smith", "Elon_Musk") ) 
> df_name
# A tibble: 3 × 1
  Full_Name   
  <chr>       
1 Steve_Hervey
2 Jane_Smith  
3 Elon_Musk   


> #splitting the first and last name
> df_name_split <- df_name %>% separate(Full_Name, into = c("First", "Last"), sep = "_")
> df_name_split
# A tibble: 3 × 2
  First Last  
  <chr> <chr> 
1 Steve Hervey
2 Jane  Smith 
3 Elon  Musk          

4. Merging and Combining Data

?? Joining Data: Combine datasets based on a common identifier.

#demo dataset one
df1 <- tibble(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
df1
# A tibble: 3 × 2
     ID Name   
  <dbl> <chr>  
1     1 Alice  
2     2 Bob    
3     3 Charlie

#demo dataset 2 
df2 <- tibble(ID = c(1, 2,3), Score = c(95, 87, 92))
df2
# A tibble: 3 × 2
     ID Score
  <dbl> <dbl>
1     1    95
2     2    87
3     3    92

#using the left_join to join both the dataset, by provides the name of the column or reference based on which, the joining will be done, here the reference column is ID

left_joined <- left_join(df1, df2, by = "ID")
left_joined
# A tibble: 3 × 3
     ID Name    Score
  <dbl> <chr>   <dbl>
1     1 Alice      95
2     2 Bob        87
3     3 Charlie    92        

5. Grouping Data

?? Grouping: Segment and then compute on these segments.

df_long %>% group_by(Subject) %>% summarise(Mean_Score = mean(Score, na.rm = TRUE))
# A tibble: 2 × 2
  Subject Mean_Score
  <chr>        <dbl>
1 Score_A       87.7
2 Score_B       83.7        

6. Text Functions with stringr

?? Detect Patterns: Use stringr to find specific sequences.

names <- c("Alice", "Bob", "Charlie") # Detect if names contain the letter 'l' 
str_detect(names, "l")        

7. Transposing Data

??Sometimes, switching rows and columns is what's needed:

df_transposed <- as.data.frame(t(df))        

8. Modifying Data with mutate

??With mutate(), you can create or modify columns.

df_with_total <- df %>% mutate(Total_Score = Score_A + Score_B)
df_with_total
# A tibble: 3 × 4
     ID Score_A Score_B Total_Score
  <dbl>   <dbl>   <dbl>       <dbl>
1     1      85      80         165
2     2      90      87         177
3     3      88      84         172        

9. Deduplicating Data

?? distinct(): Get rid of duplicate rows.

df_dupes <- tibble(ID = c(1, 2, 2, 3), Name = c("Alice", "Bob", "Bob", "Charlie")) 
df_dupes
# A tibble: 4 × 2
     ID Name   
  <dbl> <chr>  
1     1 Alice  
2     2 Bob    
3     2 Bob    
4     3 Charlie


df_unique <- df_dupes %>% distinct()
df_unique
# A tibble: 3 × 2
     ID Name   
  <dbl> <chr>  
1     1 Alice  
2     2 Bob    
3     3 Charlie        

Conclusion

This is just a glimpse into the data manipulation capabilities of R and the tidyverse. Each function has its own wealth of parameters and capabilities that cater to all kinds of scenarios, offering a versatile approach to data analysis.


To answer Dr. Pradeep Kumar Dash

For correlation tests and to visualize the results using the "corrplot" package in R, the data should typically be structured in a wide format. Each variable (column) you want to include in the correlation analysis should be numeric.

Now lets create a demo-data set (this is not a real dataset, just hypothetical):

# Creating a static dataset
data <- data.frame(
  ID = 1:10,
  Age = c(25, 32, 28, 45, 29, 34, 40, 23, 38, 27),
  Height_cm = c(167, 175, 169, 173, 170, 168, 172, 165, 174, 171),
  Weight_kg = c(68, 74, 70, 76, 72, 71, 73, 69, 75, 70),
  Avg_Sleep_hrs = c(7, 8, 6.5, 7.5, 8, 6, 7, 7.5, 6.5, 8)
)

# Printing the dataset
print(data)
   ID Age Height_cm Weight_kg Avg_Sleep_hrs
1   1  25       167        68           7.0
2   2  32       175        74           8.0
3   3  28       169        70           6.5
4   4  45       173        76           7.5
5   5  29       170        72           8.0
6   6  34       168        71           6.0
7   7  40       172        73           7.0
8   8  23       165        69           7.5
9   9  38       174        75           6.5
10 10  27       171        70           8.0        

Before performing the correlation test, there's an essential preparatory step to highlight. For our dataset, we want to exclude the identifier column, which in our case is the "ID" column. The reasoning behind this is that IDs typically don't have any meaningful numeric relationship with other variables, making them irrelevant for correlation analyses.

Why and How to Exclude Columns in R? In R, the data frame or matrix indexing works by specifying rows and columns using the [row, column] format. When you want to exclude specific rows or columns, you can use negative indexing.

?? Key Point: To exclude the "ID" column (which is the first column in our dataset), you can use the code:

name_of_your_dataframe[,-1]        

Here, [,-1] implies that we're keeping all rows (since there's nothing before the comma) and excluding the first column (because of the -1 after the comma).

Breaking it Down:

  • The [ ] brackets are used for indexing.
  • The format inside is [row, column].
  • If you want to remove specific rows, specify them before the comma. Conversely, to target columns, they should be after the comma.
  • The negative sign (-) before a number indicates exclusion of that specific row or column.

Now lets do the correlation test and create a matrix

#doing correlation and creating correlation matrix
cor_matrix = cor(data[,-1]) 
print(cor_matrix)
                     Age Height_cm  Weight_kg Avg_Sleep_hrs
Age            1.0000000 0.6824145 0.87910167   -0.20156158
Height_cm      0.6824145 1.0000000 0.85834395    0.22797276
Weight_kg      0.8791017 0.8583439 1.00000000    0.08185035
Avg_Sleep_hrs -0.2015616 0.2279728 0.08185035    1.00000000        

and to plot using corrplot package

#install the package, if not already installed using
install.packages("corrplot")

#load the library
library(corrplot)

#plotting and doing the correlation same time
corrplot(cor(data[,-1])        

This post provides a concise overview. For a comprehensive guide, the R documentation and tidyverse website are invaluable resources. Happy data wrangling! I will keep updating this post time to time to include more relevant codes and transformation. For further reading: https://r4ds.had.co.nz/wrangle-intro.html

Thank you@ Dr. Saurav Das for presenting and explained the content so nicely.

要查看或添加评论,请登录

Dr. Saurav Das的更多文章

  • Synthetic Data for Soil C Modeling

    Synthetic Data for Soil C Modeling

    Note: The article is not complete yet My all-time question is, do we need all and precise data from producers (maybe I…

  • Bootstrapping

    Bootstrapping

    1. Introduction to Bootstrapping Bootstrapping is a statistical resampling method used to estimate the variability and…

  • Ecosystem Service Dollar Valuation (Series - Rethinking ROI)

    Ecosystem Service Dollar Valuation (Series - Rethinking ROI)

    The valuation of ecosystem services in monetary terms represents a critical frontier in environmental economics…

  • Redefining ROI for True Sustainability

    Redefining ROI for True Sustainability

    It’s been a while since I last posted for Muddy Monday, but a few thoughts have been taking root in my mind, growing…

  • Linear Plateau in R

    Linear Plateau in R

    When working with data in fields such as agriculture, biology, and economics, it’s common to observe a response that…

    2 条评论
  • R vs R-Studio

    R vs R-Studio

    R: R is a programming language and software environment for statistical computing and graphics. Developed by Ross Ihaka…

    1 条评论
  • Backtransformation

    Backtransformation

    Backtransformation is the process of converting the results obtained from a transformed dataset back to the original…

    3 条评论
  • Spectroscopic Methods and Use in Soil Organic Matter & Carbon Measurement

    Spectroscopic Methods and Use in Soil Organic Matter & Carbon Measurement

    Spectroscopic methods comprise a diverse array of analytical techniques that quantify how light interacts with a…

    2 条评论
  • Regression & Classification

    Regression & Classification

    Regression and classification are two predictive modeling approaches in statistics and machine learning. Here's a brief…

    2 条评论
  • Vectorization over loop

    Vectorization over loop

    Vectorization Vectorization in R refers to the practice of applying a function to an entire vector or array of data at…

社区洞察

其他会员也浏览了