Finding unique combinations irrespective of position [duplicate]

空扰寡人 提交于 2019-11-29 06:48:15

Maybe something like that

indx <- !duplicated(t(apply(df, 1, sort))) # finds non - duplicates in sorted rows
df[indx, ] # selects only the non - duplicates according to that index
#   a b c
# 1 1 2 3
# 3 3 1 4

If your data.frame is quite big, the speed may be a matter for you. You can find duplicated sets much faster with the following idea.

Let's imaginary assign each possible value in rows a prime number and count products for each row. For example, for given df we can accept primenums = c(2,3,5,7) and count products c(30,30,70). Then duplicates in this products-vector correspond to duplicated sets in our data.frame. As multiplication is being computed much faster then any kinds of sorting, you can gain efficiency. The code is following.

require("numbers")
primenums <- Primes(100)[1:4]
dfmult <- apply(as.matrix(df), 1, function(z) prod(primenums[z]) )
my_indx <- !duplicated(dfmult)
df[my_indx,]

Here we initialize vector primenums with the help of function Primes from package numbers, but you can do manually in other way.

Take a look at the example. Here I show comparison of efficiency.

require("numbers")

# generate all unique combinations 10 out of 20
allcomb <- t(combn(20,10))
# make sample of 1 million rows
set.seed(789)
df <- allcomb[sample(nrow(allcomb), 1e6, T),]
# lets sort matrix to show we have duplicates
df <- df[do.call(order, lapply(1:ncol(df), function(i) df[, i])), ]
head(df, 10)
#       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#  [1,]    1    2    3    4    5    6    7    8    9    10
#  [2,]    1    2    3    4    5    6    7    8    9    10
#  [3,]    1    2    3    4    5    6    7    8    9    10
#  [4,]    1    2    3    4    5    6    7    8    9    10
#  [5,]    1    2    3    4    5    6    7    8    9    11
#  [6,]    1    2    3    4    5    6    7    8    9    11
#  [7,]    1    2    3    4    5    6    7    8    9    11
#  [8,]    1    2    3    4    5    6    7    8    9    11
#  [9,]    1    2    3    4    5    6    7    8    9    11
# [10,]    1    2    3    4    5    6    7    8    9    11

# to be fair need to permutate numbers in rows before searching for identical sets
df <- t(apply(df, 1, function(z) z[sample(10,10)] ))
df <- as.data.frame(df)
names(df) <- letters[1:10]
# how does it look like now?
head(df, 10)
#     a b c  d  e  f  g h  i j
# 1   2 3 7  9 10  1  4 8  5 6
# 2   4 2 6  3  8 10  9 1  5 7
# 3   4 2 6  8  5  1 10 7  3 9
# 4   6 8 5  4  2  1 10 9  7 3
# 5  11 2 7  6  8  1  9 4  5 3
# 6   9 6 3 11  4  2  8 7  5 1
# 7   5 2 3 11  1  8  6 9  7 4
# 8   3 9 7  1  2  5  4 8 11 6
# 9   6 2 8  3  4  1 11 5  9 7
# 10  4 6 3  9  7  2  1 5 11 8

# now lets shuffle rows to make df more plausible
df <- df[sample(nrow(df), nrow(df)),]

Now when data.frame is ready we can test different algorithms.

system.time(indx <- !duplicated(t(apply(df, 1, sort))) )
#   user  system elapsed 
# 119.75    0.06  120.03
# doesn't impress, frankly speaking

library(sets)
system.time(indx <- !duplicated(apply(df, 1, as.set)) )
#  user  system elapsed 
# 91.60    0.00   91.89
# better, but we want faster! =)

# now lets check out the method with prime numbers
primenums <- Primes(100)[1:20]
# [1]  2  3  5  7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71
system.time({
  dfmult <- apply(as.matrix(df), 1, function(z) prod(primenums[z]) )
  my_indx <- !duplicated(dfmult) })
# user  system elapsed 
# 6.44    0.16    6.61
# not bad, isn't it? but lets compare results
identical(indx, my_indx)
# [1] TRUE

# So, if there is no difference, why wait more? ;)

There is one important assumption here - we use as.matrix(df), but what if there are not only numeric variables in our data.frame? A more unificated solution will be as follows:

system.time({
  dfmult <- apply(
    apply(df, 2, function(colmn) as.integer(factor(colmn, 
                                                   levels = unique(c(as.matrix(df)))))),
    1, function(z) prod(primenums[z]) )
  my_indx <- !duplicated(dfmult) })
#  user  system elapsed 
# 27.48    0.34   27.84
# is distinctly slower but still much faster then previous methods

And what about if we have very much columns or very much different variables? In this case instead of prod() we can use sum(log()) (which is being computed probably even faster for large numbers). Take a look at this.

pr <- Primes(5e7)
length(pr)   
# [1] 3001134
system.time(N <- sum(log(pr)))
# user  system elapsed 
# 0.12    0.00    0.13
N
# [1] 49993718

It's hard to imagine df with 3 mln columns, but here it's ok. This way allows us to carry df of any incredibly huge size with as many columns our RAM can hold.

As an alternative approach, the package sets provides a fast way of checking for set equality:

library(sets)
df.sets <- apply(df, 1, as.set)
#[[1]]
#{1, 2, 3}
#[[2]]
#{1, 2, 3}
#[[3]]
#{1, 3, 4}
df[!duplicated(df.sets),]
#  a b c
#1 1 2 3
#3 3 1 4
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!