问题
I have a question regarding removing duplicates after sorting within a tuple in R.
Let's say I have a dataframe of values
df<-cbind(c(1,2,7,8,5,1),c(5,6,3,4,1,8),c(1.2,1,-.5,5,1.2,1))
a and b
a=df[,1]
b=df[,2]
temp<-cbind(a,b)
What I am doing is uniquing based upon a sorted tuple. For example, I want to keep a=1,2,7,8,1 and b=5,6,3,4,8 with the entry a[5] and b[5] removed. This is basically for determining interactions between two objects. 1 vs 5, 2 vs 6 etc. but 5 vs 1 is the same as 1 vs 5, hence I want to remove it.
The route I started to take was as follows. I created a function that sorts each element and put the results back into a vector as such.
sortme<-function(i){sort(temp[i,])}
sorted<-t(sapply(1:nrow(temp),sortme))
and got the following results
a b
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
[5,] 1 5
[6,] 1 8
I then unique the sorted result
unique(sorted)
which gives
a b
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
[5,] 1 8
I also then use !duplicated to get a list of true/false results that I can use in my original dataset to pull out values from another separate column.
T_F<-!duplicated(sorted)
final_df<-df[T_F,]
What I want to know is if I'm going about this the right way for a very large dataset or if there is a built in function to do this already.
回答1:
Depending on what you mean by "a very large dataset", you might gain some speed by applying the sorting function only to those rows whose sums are duplicated.
theSums<-.rowSums(temp,m=nrow(temp),n=ncol(temp))
almostSorted <- do.call(rbind, tapply(seq_len(nrow(temp)), theSums,
function(x) {
if(length(x) == 1L) {
return(cbind(x, temp[x, , drop = FALSE]))
} else {
return(cbind(x, t(apply(temp[x, ], 1, sort))))
}
}
))
(sorted <- almostSorted[order(almostSorted[, 1]), -1])
[1,] 1 5
[2,] 2 6
[3,] 7 3
[4,] 8 4
[5,] 1 5
[6,] 1 8
回答2:
I might replace your function, sortme
and the sapply
with sort
and apply
sorted <- t(apply(df[, 1:2], 1, sort))
来源:https://stackoverflow.com/questions/10560395/remove-duplicate-tuples-after-sorting-the-tuple-in-r