Checking duplicates cross two columns in R [duplicate]

问题

For example, my data set is like this:

  Var1 Var2 value
1  ABC  BCD   0.5
2  DEF  CDE   0.3
3  CDE  DEF   0.3
4  BCD  ABC   0.5

unique and duplicated may not able to detect the duplication of row 3 and 4.

Since my data set is quite large so is there any efficient way to only keep the unique rows? Like this:

  Var1 Var2 value
1  ABC  BCD   0.5
2  DEF  CDE   0.3

For your convince, you can use:

dat <- data.frame(Var1 = c("ABC", "DEF", "CDE", "BCD"),
                  Var2 = c("BCD", "CDE", "DEF", "ABC"),
                  value = c(0.5, 0.3, 0.3, 0.5))

Also, if possible is there any way to also produce a distribution table for the top 20 variables base on the Var1 (more than 10,000 levels).

P.S. I have tried dat$count <- dat(as.character(dat$Var1))[as.character(dat$Var1)], but it just take too long to run.

回答1:

Another option would be to sort columns Var1 and Var2 rowwise and then apply duplicated.

idx <- !duplicated(t(apply(dat[c("Var1", "Var2")], 1, sort)))
dat[idx, ]
#  Var1 Var2 value
#1  ABC  BCD   0.5
#2  DEF  CDE   0.3

回答2:

I would start with sorting the value1 and value2 first, and then use unique. When you have only two columns you can just use pman, and pmin:

dat <- data.frame(
   Var1 = c("ABC", "DEF", "CDE", "BCD"),
   Var2 = c("BCD", "CDE", "DEF", "ABC"),
   value = c(0.5, 0.3, 0.3, 0.5))


library(dplyr)
dat %>% mutate(v1 = pmax(as.character(Var1), as.character(Var2)),
               v2 = pmin(as.character(Var1), as.character(Var2))) %>%
  select(v1, v2, value) %>% unique()

#   v1  v2 value
# 1 BCD ABC   0.5
# 2 DEF CDE   0.3

However it might be a bit more complicated when you have more columns VarN.

来源：https://stackoverflow.com/questions/52613914/checking-duplicates-cross-two-columns-in-r

标签

duplicates