Data Wrangling in R
Dr. Saurav Das
Research Director | Farming Systems Trial | Rodale Institute | Soil Health, Biogeochemistry of Carbon & Nitrogen, Environmental Microbiology, and Data Science | Outreach & Extension | Vibe coding
R, with its suite of packages known as the tidyverse, offers unparalleled capabilities for data manipulation. This post gives a overview of the primary data manipulation functions, with few hypothetical examples.
#Installing tidyverse (if you dont have the package already installed in your system)
install.packages("tidyverse", dependencies = T)
#load the library
library(tidyverse)
Before diving into data manipulation, it's essential to grasp the concepts of data structure in R: observations, variables, and values.
Together, these three concepts structure data into a format that's easy to analyze. A typical dataset in R is essentially a collection of values (data points), organized into variables (columns), for a series of observations (rows).
Data wrangling
1. Pivoting Data
Transforming data from wide to long format (or vice-versa) is often a starting point for analysis.
??Pivot Longer (Melting): Converts data from a wide format to a long one.
#demo-data sets
df <- tibble( ID = c(1, 2, 3), Score_A = c(85, 90, 88), Score_B = c(80, 87, 84) )
df
# A tibble: 3 × 3
ID Score_A Score_B
<dbl> <dbl> <dbl>
1 1 85 80
2 2 90 87
3 3 88 84
#pivot longer
df_long <- df %>% pivot_longer(cols = starts_with("Score"), names_to = "Subject", values_to = "Score")
#longer version of the dataset
df_long
# A tibble: 6 × 3
ID Subject Score
<dbl> <chr> <dbl>
1 1 Score_A 85
2 1 Score_B 80
3 2 Score_A 90
4 2 Score_B 87
5 3 Score_A 88
6 3 Score_B 84
?? Pivot Wider (Casting): Essentially the inverse of pivoting longer. It stretches the data into a wider format.
#longer to wider
df_wide <- df_long %>% pivot_wider(names_from = "Subject", values_from = "Score")
#wide dataset
df_wide
# A tibble: 3 × 3
ID Score_A Score_B
<dbl> <dbl> <dbl>
1 1 85 80
2 2 90 87
3 3 88 84
2. Sorting and Filtering Data
?? Sorting: Organizes your data based on specific criteria.
?? Filtering: Extracts specific data based on conditions.
# Sorting by Score_A
df_sorted <- df %>% arrange(Score_A)
df_sorted
# A tibble: 3 × 3
ID Score_A Score_B
<dbl> <dbl> <dbl>
1 1 85 80
2 3 88 84
3 2 90 87
# Filtering rows where Score_A is above 85
df_filtered <- df %>% filter(Score_A > 85)
df_filtered
# A tibble: 2 × 3
ID Score_A Score_B
<dbl> <dbl> <dbl>
1 2 90 87
2 3 88 84
3. Splitting Text-to-Columns
?? tidyr::separate(): Effortlessly splits a column into multiple columns.
> #demo dataaset
> df_name <- tibble( Full_Name = c("Steve_Hervey", "Jane_Smith", "Elon_Musk") )
> df_name
# A tibble: 3 × 1
Full_Name
<chr>
1 Steve_Hervey
2 Jane_Smith
3 Elon_Musk
> #splitting the first and last name
> df_name_split <- df_name %>% separate(Full_Name, into = c("First", "Last"), sep = "_")
> df_name_split
# A tibble: 3 × 2
First Last
<chr> <chr>
1 Steve Hervey
2 Jane Smith
3 Elon Musk
4. Merging and Combining Data
?? Joining Data: Combine datasets based on a common identifier.
#demo dataset one
df1 <- tibble(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
df1
# A tibble: 3 × 2
ID Name
<dbl> <chr>
1 1 Alice
2 2 Bob
3 3 Charlie
#demo dataset 2
df2 <- tibble(ID = c(1, 2,3), Score = c(95, 87, 92))
df2
# A tibble: 3 × 2
ID Score
<dbl> <dbl>
1 1 95
2 2 87
3 3 92
#using the left_join to join both the dataset, by provides the name of the column or reference based on which, the joining will be done, here the reference column is ID
left_joined <- left_join(df1, df2, by = "ID")
left_joined
# A tibble: 3 × 3
ID Name Score
<dbl> <chr> <dbl>
1 1 Alice 95
2 2 Bob 87
3 3 Charlie 92
5. Grouping Data
?? Grouping: Segment and then compute on these segments.
df_long %>% group_by(Subject) %>% summarise(Mean_Score = mean(Score, na.rm = TRUE))
# A tibble: 2 × 2
Subject Mean_Score
<chr> <dbl>
1 Score_A 87.7
2 Score_B 83.7
6. Text Functions with stringr
?? Detect Patterns: Use stringr to find specific sequences.
领英推荐
names <- c("Alice", "Bob", "Charlie") # Detect if names contain the letter 'l'
str_detect(names, "l")
7. Transposing Data
??Sometimes, switching rows and columns is what's needed:
df_transposed <- as.data.frame(t(df))
8. Modifying Data with mutate
??With mutate(), you can create or modify columns.
df_with_total <- df %>% mutate(Total_Score = Score_A + Score_B)
df_with_total
# A tibble: 3 × 4
ID Score_A Score_B Total_Score
<dbl> <dbl> <dbl> <dbl>
1 1 85 80 165
2 2 90 87 177
3 3 88 84 172
9. Deduplicating Data
?? distinct(): Get rid of duplicate rows.
df_dupes <- tibble(ID = c(1, 2, 2, 3), Name = c("Alice", "Bob", "Bob", "Charlie"))
df_dupes
# A tibble: 4 × 2
ID Name
<dbl> <chr>
1 1 Alice
2 2 Bob
3 2 Bob
4 3 Charlie
df_unique <- df_dupes %>% distinct()
df_unique
# A tibble: 3 × 2
ID Name
<dbl> <chr>
1 1 Alice
2 2 Bob
3 3 Charlie
Conclusion
This is just a glimpse into the data manipulation capabilities of R and the tidyverse. Each function has its own wealth of parameters and capabilities that cater to all kinds of scenarios, offering a versatile approach to data analysis.
To answer Dr. Pradeep Kumar Dash
For correlation tests and to visualize the results using the "corrplot" package in R, the data should typically be structured in a wide format. Each variable (column) you want to include in the correlation analysis should be numeric.
Now lets create a demo-data set (this is not a real dataset, just hypothetical):
# Creating a static dataset
data <- data.frame(
ID = 1:10,
Age = c(25, 32, 28, 45, 29, 34, 40, 23, 38, 27),
Height_cm = c(167, 175, 169, 173, 170, 168, 172, 165, 174, 171),
Weight_kg = c(68, 74, 70, 76, 72, 71, 73, 69, 75, 70),
Avg_Sleep_hrs = c(7, 8, 6.5, 7.5, 8, 6, 7, 7.5, 6.5, 8)
)
# Printing the dataset
print(data)
ID Age Height_cm Weight_kg Avg_Sleep_hrs
1 1 25 167 68 7.0
2 2 32 175 74 8.0
3 3 28 169 70 6.5
4 4 45 173 76 7.5
5 5 29 170 72 8.0
6 6 34 168 71 6.0
7 7 40 172 73 7.0
8 8 23 165 69 7.5
9 9 38 174 75 6.5
10 10 27 171 70 8.0
Before performing the correlation test, there's an essential preparatory step to highlight. For our dataset, we want to exclude the identifier column, which in our case is the "ID" column. The reasoning behind this is that IDs typically don't have any meaningful numeric relationship with other variables, making them irrelevant for correlation analyses.
Why and How to Exclude Columns in R? In R, the data frame or matrix indexing works by specifying rows and columns using the [row, column] format. When you want to exclude specific rows or columns, you can use negative indexing.
?? Key Point: To exclude the "ID" column (which is the first column in our dataset), you can use the code:
name_of_your_dataframe[,-1]
Here, [,-1] implies that we're keeping all rows (since there's nothing before the comma) and excluding the first column (because of the -1 after the comma).
Breaking it Down:
Now lets do the correlation test and create a matrix
#doing correlation and creating correlation matrix
cor_matrix = cor(data[,-1])
print(cor_matrix)
Age Height_cm Weight_kg Avg_Sleep_hrs
Age 1.0000000 0.6824145 0.87910167 -0.20156158
Height_cm 0.6824145 1.0000000 0.85834395 0.22797276
Weight_kg 0.8791017 0.8583439 1.00000000 0.08185035
Avg_Sleep_hrs -0.2015616 0.2279728 0.08185035 1.00000000
and to plot using corrplot package
#install the package, if not already installed using
install.packages("corrplot")
#load the library
library(corrplot)
#plotting and doing the correlation same time
corrplot(cor(data[,-1])
This post provides a concise overview. For a comprehensive guide, the R documentation and tidyverse website are invaluable resources. Happy data wrangling! I will keep updating this post time to time to include more relevant codes and transformation. For further reading: https://r4ds.had.co.nz/wrangle-intro.html
Thank you@ Dr. Saurav Das for presenting and explained the content so nicely.