Is there a more elegant way to find duplicated records?

问题

I've got 81,000 records in my test frame, and duplicated is showing me that 2039 are identical matches. One answer to Find duplicated rows (based on 2 columns) in Data Frame in R suggests a method for creating a smaller frame of just the duplicate records. This works for me, too:

dup <- data.frame(as.numeric(duplicated(df$var))) #creates df with binary var for duplicated rows
colnames(dup) <- c("dup") #renames column for simplicity
df2 <- cbind(df, dup) #bind to original df
df3 <- subset(df2, dup == 1) #subsets df using binary var for duplicated`

But it seems, as the poster noted, inelegant. Is there a cleaner way to get the same result: a view of just those records that are duplicates?

In my case I'm working with scraped data and I need to figure out whether the duplicates exist in the original or were introduced by me scraping.

回答1:

duplicated(df) will give you a logical vector (all values consisting of either T/F), which you can then use as an index to your dataframe rows.

# indx will contain TRUE values wherever in df$var there is a duplicate
indx <- duplicated(df$var)
df[indx, ]  #note the comma

You can put it all together in one line

df[duplicated(df$var), ]  # again, the comma, to indicate we are selected rows

回答2:

doops <- which(duplicated(df$var)==TRUE)
uniques <- df[-doops,]
duplicates <- df[doops,]

Is the logic I generally use when I am trying to remove the duplicate entrys from a data frame.

来源：https://stackoverflow.com/questions/13594968/is-there-a-more-elegant-way-to-find-duplicated-records

标签

duplicates

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!