Remove duplicate tuples after sorting the tuple in R

问题

I have a question regarding removing duplicates after sorting within a tuple in R.

Let's say I have a dataframe of values

df<-cbind(c(1,2,7,8,5,1),c(5,6,3,4,1,8),c(1.2,1,-.5,5,1.2,1))

a and b

a=df[,1]
b=df[,2]
temp<-cbind(a,b)

What I am doing is uniquing based upon a sorted tuple. For example, I want to keep a=1,2,7,8,1 and b=5,6,3,4,8 with the entry a[5] and b[5] removed. This is basically for determining interactions between two objects. 1 vs 5, 2 vs 6 etc. but 5 vs 1 is the same as 1 vs 5, hence I want to remove it.

The route I started to take was as follows. I created a function that sorts each element and put the results back into a vector as such.

sortme<-function(i){sort(temp[i,])}
sorted<-t(sapply(1:nrow(temp),sortme))

and got the following results

     a b
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
[5,] 1 5
[6,] 1 8

I then unique the sorted result

unique(sorted)

which gives

     a b
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
[5,] 1 8

I also then use !duplicated to get a list of true/false results that I can use in my original dataset to pull out values from another separate column.

T_F<-!duplicated(sorted)
final_df<-df[T_F,]

What I want to know is if I'm going about this the right way for a very large dataset or if there is a built in function to do this already.

回答1:

Depending on what you mean by "a very large dataset", you might gain some speed by applying the sorting function only to those rows whose sums are duplicated.

theSums<-.rowSums(temp,m=nrow(temp),n=ncol(temp))

almostSorted <- do.call(rbind, tapply(seq_len(nrow(temp)), theSums,
  function(x) {
    if(length(x) == 1L) {
      return(cbind(x, temp[x, , drop = FALSE]))
    } else {
      return(cbind(x, t(apply(temp[x, ], 1, sort))))
    }
  }
))

(sorted <- almostSorted[order(almostSorted[, 1]), -1])

[1,] 1 5
[2,] 2 6
[3,] 7 3
[4,] 8 4
[5,] 1 5
[6,] 1 8

回答2:

I might replace your function, sortme and the sapply with sort and apply

sorted <- t(apply(df[, 1:2], 1, sort))

来源：https://stackoverflow.com/questions/10560395/remove-duplicate-tuples-after-sorting-the-tuple-in-r

标签

duplicates

tuples