FUN FACT: find those duplicates!
Samantha Bell
Veterinary Data Analysis | Dashboards & Reporting | LVT | E-commerce | Bioinformatics
Using duplicated() in R
I thought I would share this fun & helpful R function which can be used to easily find duplicate ids in a dataset (a common task in many data analysis roles).
The function is called duplicated and it comes with base R, so no need to download a new package in order to test it out.
The best part about this function is that it allows for some flexibility on how you want to mark duplicates:
1) Only AFTER the first instance
2) All EXCEPT the last instance
3) Include EVERY id that has a duplicate
The documentation can be found at https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/duplicated, but taking a look at this example I put together will give you a good idea of how it works.
Copy and paste the text below into R Studio to give it a try yourself:
library(tidyverse) # Create a sample dataframe with some repeat patient ids myData <- data.frame(c(rep(10, 50), 11:60), rnorm(1:100)) colnames(myData) <- c("Patient_Id", "Result") # Name the columns head(myData) # View head to verify it is set up correctly # Add variables to test different ways of marking duplicate Ids myData <- myData %>% mutate( test1 = duplicated(Patient_Id), # This will mark all repeats of the patient ID after the first instance test2 = duplicated(Patient_Id, fromLast = TRUE), # This works backwards, and will mark all repeats starting at the last instance test3 = duplicated(Patient_Id)|duplicated(Patient_Id, fromLast = TRUE), # This will mark ALL duplicates because it checks from both directions test4 = duplicated(Patient_Id)&duplicated(Patient_Id, fromLast = TRUE) # Why don't we want to use & instead of | ? This will leave off BOTH the first and last instance ) # We know that we should have 50 patients with Id "10" # Let's see what the different methods captured: table(myData$test1) table(myData$test2) table(myData$test3) table(myData$test4) # Take a look at the data and see how it works!
View(myData)
Have fun!
Sam
Analytics / Facilitator Decision Management Professional. MB ENTP who gets the big picture
4 年Hey thanks. That's pretty slick. Learn something new every day in R.