FUN FACT: find those duplicates!
Andrew Wulf @andreuuuw

FUN FACT: find those duplicates!

Using duplicated() in R


I thought I would share this fun & helpful R function which can be used to easily find duplicate ids in a dataset (a common task in many data analysis roles). 

The function is called duplicated and it comes with base R, so no need to download a new package in order to test it out.

The best part about this function is that it allows for some flexibility on how you want to mark duplicates:

1) Only AFTER the first instance
2) All EXCEPT the last instance
3) Include EVERY id that has a duplicate


The documentation can be found at https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/duplicated, but taking a look at this example I put together will give you a good idea of how it works. 

Copy and paste the text below into R Studio to give it a try yourself:

library(tidyverse)


# Create a sample dataframe with some repeat patient ids
myData <- data.frame(c(rep(10, 50), 11:60), rnorm(1:100))

colnames(myData) <- c("Patient_Id", "Result") # Name the columns

head(myData) # View head to verify it is set up correctly


# Add variables to test different ways of marking duplicate Ids

myData <- myData %>% mutate(

 test1 = duplicated(Patient_Id), # This will mark all repeats of the patient ID after the first instance

 test2 = duplicated(Patient_Id, fromLast = TRUE), # This works backwards, and will mark all repeats starting at the last instance

 test3 = duplicated(Patient_Id)|duplicated(Patient_Id, fromLast = TRUE), # This will mark ALL duplicates because it checks from both directions

 test4 = duplicated(Patient_Id)&duplicated(Patient_Id, fromLast = TRUE) # Why don't we want to use & instead of | ? This will leave off BOTH the first and last instance

)


# We know that we should have 50 patients with Id "10"

# Let's see what the different methods captured:

table(myData$test1)

table(myData$test2)

table(myData$test3)

table(myData$test4)




# Take a look at the data and see how it works!


View(myData)


Have fun!

No alt text provided for this image

Sam


Stephen Mack

Analytics / Facilitator Decision Management Professional. MB ENTP who gets the big picture

4 年

Hey thanks. That's pretty slick. Learn something new every day in R.

要查看或添加评论,请登录

Samantha Bell的更多文章

社区洞察

其他会员也浏览了