Specific removing all duplicates with R [duplicate]

问题

This question already has answers here:

Finding ALL duplicate rows, including “elements with smaller subscripts” (5 answers)

Remove all duplicate rows including the “reference” row [duplicate] (3 answers)

Closed 4 years ago.

For example I have two columns:

And I need to display only values which are unique for column Var1:

I can do you like this:

 mydata=mydata[!unique(mydata$Var1),]

But when I use the same formula for my large data set with about 1 million observations, nothing happens - the sample size is still the same. Could you please explain my why?

Thank you!

回答1:

With data.table (as it seem to be tagged with it) I would do

indx <- setDT(DT)[, .I[.N == 1], by = Var1]$V1 
DT[indx]
#    Var1 Var2
# 1:    4    8
# 2:    5   67
# 3:    6   12

Or... as @eddi reminded me, you can simply do

DT[, if(.N == 1) .SD, by = Var1]

Or (per the mentioned duplicates) with v >= 1.9.5 you could also do something like

setDT(DT, key = "Var1")[!(duplicated(DT) | duplicated(DT, fromLast = TRUE))]

回答2:

You can use this:

df <- data.frame(Var1=c(1,1,2,2,3,3,4,5,6), Var2=c(12,65,68,98,49,24,8,67,12) );
df[ave(1:nrow(df),df$Var1,FUN=length)==1,];
##   Var1 Var2
## 7    4    8
## 8    5   67
## 9    6   12

This will work even if the Var1 column is not ordered, because ave() does the necessary work to collect groups of equal elements (even if they are non-consecutive in the grouping vector) and map the result of the function call (length() in this case) back to each element that was a member of the group.

Regarding your code, it doesn't work because this is what unique() and its negation returns:

unique(df$Var1);
## [1] 1 2 3 4 5 6
!unique(df$Var1);
## [1] FALSE FALSE FALSE FALSE FALSE FALSE

As you can see, unique() returns the actual unique values from the argument vector. Negation returns true for zero and false for everything else.

Thus, you end up row-indexing using a short logical vector (it will be short if there were any duplicates removed by unique()) consisting of TRUE where there were zeroes, and FALSE otherwise.

来源：https://stackoverflow.com/questions/29955163/specific-removing-all-duplicates-with-r

标签

duplicates

unique

data.table