问题
For example, my data set is like this:
Var1 Var2 value
1 ABC BCD 0.5
2 DEF CDE 0.3
3 CDE DEF 0.3
4 BCD ABC 0.5
unique
and duplicated
may not able to detect the duplication of row 3 and 4.
Since my data set is quite large so is there any efficient way to only keep the unique rows? Like this:
Var1 Var2 value
1 ABC BCD 0.5
2 DEF CDE 0.3
For your convince, you can use:
dat <- data.frame(Var1 = c("ABC", "DEF", "CDE", "BCD"),
Var2 = c("BCD", "CDE", "DEF", "ABC"),
value = c(0.5, 0.3, 0.3, 0.5))
Also, if possible is there any way to also produce a distribution table for the top 20 variables base on the Var1 (more than 10,000 levels).
P.S. I have tried dat$count <- dat(as.character(dat$Var1))[as.character(dat$Var1)]
, but it just take too long to run.
回答1:
Another option would be to sort columns Var1
and Var2
rowwise and then apply duplicated
.
idx <- !duplicated(t(apply(dat[c("Var1", "Var2")], 1, sort)))
dat[idx, ]
# Var1 Var2 value
#1 ABC BCD 0.5
#2 DEF CDE 0.3
回答2:
I would start with sorting the value1 and value2 first, and then use unique
. When you have only two columns you can just use pman
, and pmin
:
dat <- data.frame(
Var1 = c("ABC", "DEF", "CDE", "BCD"),
Var2 = c("BCD", "CDE", "DEF", "ABC"),
value = c(0.5, 0.3, 0.3, 0.5))
library(dplyr)
dat %>% mutate(v1 = pmax(as.character(Var1), as.character(Var2)),
v2 = pmin(as.character(Var1), as.character(Var2))) %>%
select(v1, v2, value) %>% unique()
# v1 v2 value
# 1 BCD ABC 0.5
# 2 DEF CDE 0.3
However it might be a bit more complicated when you have more columns VarN
.
来源:https://stackoverflow.com/questions/52613914/checking-duplicates-cross-two-columns-in-r