Checking duplicates cross two columns in R [duplicate]

主宰稳场 提交于 2019-12-11 06:55:31

问题


For example, my data set is like this:

  Var1 Var2 value
1  ABC  BCD   0.5
2  DEF  CDE   0.3
3  CDE  DEF   0.3
4  BCD  ABC   0.5

unique and duplicated may not able to detect the duplication of row 3 and 4.

Since my data set is quite large so is there any efficient way to only keep the unique rows? Like this:

  Var1 Var2 value
1  ABC  BCD   0.5
2  DEF  CDE   0.3

For your convince, you can use:

dat <- data.frame(Var1 = c("ABC", "DEF", "CDE", "BCD"),
                  Var2 = c("BCD", "CDE", "DEF", "ABC"),
                  value = c(0.5, 0.3, 0.3, 0.5))

Also, if possible is there any way to also produce a distribution table for the top 20 variables base on the Var1 (more than 10,000 levels).

P.S. I have tried dat$count <- dat(as.character(dat$Var1))[as.character(dat$Var1)], but it just take too long to run.


回答1:


Another option would be to sort columns Var1 and Var2 rowwise and then apply duplicated.

idx <- !duplicated(t(apply(dat[c("Var1", "Var2")], 1, sort)))
dat[idx, ]
#  Var1 Var2 value
#1  ABC  BCD   0.5
#2  DEF  CDE   0.3



回答2:


I would start with sorting the value1 and value2 first, and then use unique. When you have only two columns you can just use pman, and pmin:

dat <- data.frame(
   Var1 = c("ABC", "DEF", "CDE", "BCD"),
   Var2 = c("BCD", "CDE", "DEF", "ABC"),
   value = c(0.5, 0.3, 0.3, 0.5))


library(dplyr)
dat %>% mutate(v1 = pmax(as.character(Var1), as.character(Var2)),
               v2 = pmin(as.character(Var1), as.character(Var2))) %>%
  select(v1, v2, value) %>% unique()

#   v1  v2 value
# 1 BCD ABC   0.5
# 2 DEF CDE   0.3

However it might be a bit more complicated when you have more columns VarN.



来源:https://stackoverflow.com/questions/52613914/checking-duplicates-cross-two-columns-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!