登录查看更多内容

Data Wrangling in R

Dr. Saurav Das

Research Director | Farming Systems Trial | Rodale Institute | Soil Health, Biogeochemistry of Carbon & Nitrogen, Environmental Microbiology, and Data Science | Outreach & Extension | Vibe coding

发布日期: 2023年9月5日

R, with its suite of packages known as the tidyverse, offers unparalleled capabilities for data manipulation. This post gives a overview of the primary data manipulation functions, with few hypothetical examples.

#Installing tidyverse (if you dont have the package already installed in your system)
install.packages("tidyverse", dependencies = T)

#load the library
library(tidyverse)

Before diving into data manipulation, it's essential to grasp the concepts of data structure in R: observations, variables, and values.

Observations: Each observation represents an individual unit of analysis. In a dataset, an observation might be a person, a company, a country, or any entity you are studying. In R's data frames or tibbles, each observation is typically a row.
Variables: Variables capture specific characteristics of an observation. For instance, if the observation is a person, the variables could be their age, name, or height. In a dataset, variables are usually represented as columns.
Values: These are the actual data points for a variable for a specific observation. For example, if "Age" is a variable, then "25" might be the value of the Age variable for one specific observation (or person).

Together, these three concepts structure data into a format that's easy to analyze. A typical dataset in R is essentially a collection of values (data points), organized into variables (columns), for a series of observations (rows).

Tidy data graphic from "r for data-science" by Garrett Grolemund and Hadley Wickham

Data wrangling

1. Pivoting Data

Transforming data from wide to long format (or vice-versa) is often a starting point for analysis.

??Pivot Longer (Melting): Converts data from a wide format to a long one.

#demo-data sets
df <- tibble( ID = c(1, 2, 3), Score_A = c(85, 90, 88), Score_B = c(80, 87, 84) )

df
# A tibble: 3 × 3
     ID Score_A Score_B
  <dbl>   <dbl>   <dbl>
1     1      85      80
2     2      90      87
3     3      88      84

#pivot longer
df_long <- df %>% pivot_longer(cols = starts_with("Score"), names_to = "Subject", values_to = "Score")


#longer version of the dataset
df_long
# A tibble: 6 × 3
     ID Subject Score
  <dbl> <chr>   <dbl>
1     1 Score_A    85
2     1 Score_B    80
3     2 Score_A    90
4     2 Score_B    87
5     3 Score_A    88
6     3 Score_B    84

?? Pivot Wider (Casting): Essentially the inverse of pivoting longer. It stretches the data into a wider format.

#longer to wider
df_wide <- df_long %>% pivot_wider(names_from = "Subject", values_from = "Score")

#wide dataset
df_wide

# A tibble: 3 × 3
     ID Score_A Score_B
  <dbl>   <dbl>   <dbl>
1     1      85      80
2     2      90      87
3     3      88      84

2. Sorting and Filtering Data

?? Sorting: Organizes your data based on specific criteria.

?? Filtering: Extracts specific data based on conditions.

# Sorting by Score_A 
df_sorted <- df %>% arrange(Score_A) 
df_sorted
# A tibble: 3 × 3
     ID Score_A Score_B
  <dbl>   <dbl>   <dbl>
1     1      85      80
2     3      88      84
3     2      90      87

# Filtering rows where Score_A is above 85 
df_filtered <- df %>% filter(Score_A > 85)

df_filtered
# A tibble: 2 × 3
     ID Score_A Score_B
  <dbl>   <dbl>   <dbl>
1     2      90      87
2     3      88      84

3. Splitting Text-to-Columns

?? tidyr::separate(): Effortlessly splits a column into multiple columns.

> #demo dataaset
> df_name <- tibble( Full_Name = c("Steve_Hervey", "Jane_Smith", "Elon_Musk") ) 
> df_name
# A tibble: 3 × 1
  Full_Name   
  <chr>       
1 Steve_Hervey
2 Jane_Smith  
3 Elon_Musk   


> #splitting the first and last name
> df_name_split <- df_name %>% separate(Full_Name, into = c("First", "Last"), sep = "_")
> df_name_split
# A tibble: 3 × 2
  First Last  
  <chr> <chr> 
1 Steve Hervey
2 Jane  Smith 
3 Elon  Musk

4. Merging and Combining Data

?? Joining Data: Combine datasets based on a common identifier.

#demo dataset one
df1 <- tibble(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
df1
# A tibble: 3 × 2
     ID Name   
  <dbl> <chr>  
1     1 Alice  
2     2 Bob    
3     3 Charlie

#demo dataset 2 
df2 <- tibble(ID = c(1, 2,3), Score = c(95, 87, 92))
df2
# A tibble: 3 × 2
     ID Score
  <dbl> <dbl>
1     1    95
2     2    87
3     3    92

#using the left_join to join both the dataset, by provides the name of the column or reference based on which, the joining will be done, here the reference column is ID

left_joined <- left_join(df1, df2, by = "ID")
left_joined
# A tibble: 3 × 3
     ID Name    Score
  <dbl> <chr>   <dbl>
1     1 Alice      95
2     2 Bob        87
3     3 Charlie    92

5. Grouping Data

?? Grouping: Segment and then compute on these segments.

df_long %>% group_by(Subject) %>% summarise(Mean_Score = mean(Score, na.rm = TRUE))
# A tibble: 2 × 2
  Subject Mean_Score
  <chr>        <dbl>
1 Score_A       87.7
2 Score_B       83.7

6. Text Functions with stringr

?? Detect Patterns: Use stringr to find specific sequences.

领英推荐

Tidyverse

360DigiTMG 1 年前

The Art of Data Cleaning: Best Practices for Clean…

Noorain Fathima 6 个月前

Handling imbalanced data with SMOTE

Fabio Rebecchi 4 年前

names <- c("Alice", "Bob", "Charlie") # Detect if names contain the letter 'l' 
str_detect(names, "l")

7. Transposing Data

??Sometimes, switching rows and columns is what's needed:

df_transposed <- as.data.frame(t(df))

8. Modifying Data with mutate

??With mutate(), you can create or modify columns.

df_with_total <- df %>% mutate(Total_Score = Score_A + Score_B)
df_with_total
# A tibble: 3 × 4
     ID Score_A Score_B Total_Score
  <dbl>   <dbl>   <dbl>       <dbl>
1     1      85      80         165
2     2      90      87         177
3     3      88      84         172

9. Deduplicating Data

?? distinct(): Get rid of duplicate rows.

df_dupes <- tibble(ID = c(1, 2, 2, 3), Name = c("Alice", "Bob", "Bob", "Charlie")) 
df_dupes
# A tibble: 4 × 2
     ID Name   
  <dbl> <chr>  
1     1 Alice  
2     2 Bob    
3     2 Bob    
4     3 Charlie


df_unique <- df_dupes %>% distinct()
df_unique
# A tibble: 3 × 2
     ID Name   
  <dbl> <chr>  
1     1 Alice  
2     2 Bob    
3     3 Charlie

Conclusion

This is just a glimpse into the data manipulation capabilities of R and the tidyverse. Each function has its own wealth of parameters and capabilities that cater to all kinds of scenarios, offering a versatile approach to data analysis.

To answer Dr. Pradeep Kumar Dash

For correlation tests and to visualize the results using the "corrplot" package in R, the data should typically be structured in a wide format. Each variable (column) you want to include in the correlation analysis should be numeric.

Now lets create a demo-data set (this is not a real dataset, just hypothetical):

# Creating a static dataset
data <- data.frame(
  ID = 1:10,
  Age = c(25, 32, 28, 45, 29, 34, 40, 23, 38, 27),
  Height_cm = c(167, 175, 169, 173, 170, 168, 172, 165, 174, 171),
  Weight_kg = c(68, 74, 70, 76, 72, 71, 73, 69, 75, 70),
  Avg_Sleep_hrs = c(7, 8, 6.5, 7.5, 8, 6, 7, 7.5, 6.5, 8)
)

# Printing the dataset
print(data)
   ID Age Height_cm Weight_kg Avg_Sleep_hrs
1   1  25       167        68           7.0
2   2  32       175        74           8.0
3   3  28       169        70           6.5
4   4  45       173        76           7.5
5   5  29       170        72           8.0
6   6  34       168        71           6.0
7   7  40       172        73           7.0
8   8  23       165        69           7.5
9   9  38       174        75           6.5
10 10  27       171        70           8.0

Before performing the correlation test, there's an essential preparatory step to highlight. For our dataset, we want to exclude the identifier column, which in our case is the "ID" column. The reasoning behind this is that IDs typically don't have any meaningful numeric relationship with other variables, making them irrelevant for correlation analyses.

Why and How to Exclude Columns in R? In R, the data frame or matrix indexing works by specifying rows and columns using the [row, column] format. When you want to exclude specific rows or columns, you can use negative indexing.

?? Key Point: To exclude the "ID" column (which is the first column in our dataset), you can use the code:

name_of_your_dataframe[,-1]

Here, [,-1] implies that we're keeping all rows (since there's nothing before the comma) and excluding the first column (because of the -1 after the comma).

Breaking it Down:

The [ ] brackets are used for indexing.
The format inside is [row, column].
If you want to remove specific rows, specify them before the comma. Conversely, to target columns, they should be after the comma.
The negative sign (-) before a number indicates exclusion of that specific row or column.

Now lets do the correlation test and create a matrix

#doing correlation and creating correlation matrix
cor_matrix = cor(data[,-1]) 
print(cor_matrix)
                     Age Height_cm  Weight_kg Avg_Sleep_hrs
Age            1.0000000 0.6824145 0.87910167   -0.20156158
Height_cm      0.6824145 1.0000000 0.85834395    0.22797276
Weight_kg      0.8791017 0.8583439 1.00000000    0.08185035
Avg_Sleep_hrs -0.2015616 0.2279728 0.08185035    1.00000000

and to plot using corrplot package

#install the package, if not already installed using
install.packages("corrplot")

#load the library
library(corrplot)

#plotting and doing the correlation same time
corrplot(cor(data[,-1])

This post provides a concise overview. For a comprehensive guide, the R documentation and tidyverse website are invaluable resources. Happy data wrangling! I will keep updating this post time to time to include more relevant codes and transformation. For further reading: https://r4ds.had.co.nz/wrangle-intro.html

R for Soil Science

2,632 位关注者

Dr. Pradeep Kumar Dash

1 年

Thank you@ Dr. Saurav Das for presenting and explained the content so nicely.

1 次回应

要查看或添加评论，请登录

Dr. Saurav Das的更多文章

Synthetic Data for Soil C Modeling

2025年2月9日

Synthetic Data for Soil C Modeling

Note: The article is not complete yet My all-time question is, do we need all and precise data from producers (maybe I…
Bootstrapping

2025年1月7日

Bootstrapping

1. Introduction to Bootstrapping Bootstrapping is a statistical resampling method used to estimate the variability and…
Ecosystem Service Dollar Valuation (Series - Rethinking ROI)

2024年12月24日

Ecosystem Service Dollar Valuation (Series - Rethinking ROI)

The valuation of ecosystem services in monetary terms represents a critical frontier in environmental economics…
Redefining ROI for True Sustainability

2024年8月28日

Redefining ROI for True Sustainability

It’s been a while since I last posted for Muddy Monday, but a few thoughts have been taking root in my mind, growing…
Linear Plateau in R

2024年5月22日

Linear Plateau in R

When working with data in fields such as agriculture, biology, and economics, it’s common to observe a response that…

2 条评论
R vs R-Studio

2024年3月29日

R vs R-Studio

R: R is a programming language and software environment for statistical computing and graphics. Developed by Ross Ihaka…

1 条评论
Backtransformation

2024年2月22日

Backtransformation

Backtransformation is the process of converting the results obtained from a transformed dataset back to the original…

3 条评论
Spectroscopic Methods and Use in Soil Organic Matter & Carbon Measurement

2024年1月30日

Spectroscopic Methods and Use in Soil Organic Matter & Carbon Measurement

Spectroscopic methods comprise a diverse array of analytical techniques that quantify how light interacts with a…

2 条评论
Regression & Classification

2024年1月30日

Regression & Classification

Regression and classification are two predictive modeling approaches in statistics and machine learning. Here's a brief…

2 条评论
Vectorization over loop

2024年1月17日

Vectorization over loop

Vectorization Vectorization in R refers to the practice of applying a function to an entire vector or array of data at…

See all articles

Data Wrangling in R

Dr. Saurav Das

Research Director | Farming Systems Trial | Rodale Institute | Soil Health, Biogeochemistry of Carbon & Nitrogen, Environmental Microbiology, and Data Science | Outreach & Extension | Vibe coding

Before diving into data manipulation, it's essential to grasp the concepts of data structure in R: observations, variables, and values.

Data wrangling

1. Pivoting Data

2. Sorting and Filtering Data

3. Splitting Text-to-Columns

4. Merging and Combining Data

5. Grouping Data

6. Text Functions with stringr

领英推荐

7. Transposing Data

8. Modifying Data with mutate

9. Deduplicating Data

Conclusion

R for Soil Science

2,632 位关注者

Dr. Saurav Das的更多文章

社区洞察

其他会员也浏览了

Data Visualisation: A Four-Step Process for Success

A simple guide to Cortex ML Functions: Anomaly Detection

Mastering Time Series Analysis from Scratch: A Data Scientist's Roadmap

Give Your Data Scientists a Hand

The Essential Guide to Data Cleaning and Preprocessing with Pandas

DATA MODELLING WITH GRAPH THEORY

How to make data scientists shine

The Data Scientist's Prayer: Finding Humour and Insight in the World of Data

Mastering Vector Embeddings: A Comprehensive Guide to Revolutionizing Data Science

22 tips for better data science

Before diving into data manipulation, it's essential to grasp the concepts of data structure in R: observations, variables, and values.

Data wrangling

1. Pivoting Data

2. Sorting and Filtering Data

3. Splitting Text-to-Columns

4. Merging and Combining Data

5. Grouping Data

6. Text Functions with stringr

领英推荐

7. Transposing Data

8. Modifying Data with mutate

9. Deduplicating Data

Conclusion

R for Soil Science

2,632 位关注者

Dr. Saurav Das的更多文章

Synthetic Data for Soil C Modeling

Bootstrapping

Ecosystem Service Dollar Valuation (Series - Rethinking ROI)

Redefining ROI for True Sustainability

Linear Plateau in R

R vs R-Studio

Backtransformation

Spectroscopic Methods and Use in Soil Organic Matter & Carbon Measurement

Regression & Classification

Vectorization over loop

社区洞察

其他会员也浏览了

Data Visualisation: A Four-Step Process for Success

A simple guide to Cortex ML Functions: Anomaly Detection

Mastering Time Series Analysis from Scratch: A Data Scientist's Roadmap

Give Your Data Scientists a Hand

The Essential Guide to Data Cleaning and Preprocessing with Pandas

DATA MODELLING WITH GRAPH THEORY

How to make data scientists shine

The Data Scientist's Prayer: Finding Humour and Insight in the World of Data

Mastering Vector Embeddings: A Comprehensive Guide to Revolutionizing Data Science

22 tips for better data science