Subsetting a data frame to the rows not appearing in another data frame

六眼飞鱼酱① 提交于 2019-12-20 04:10:02

问题


I have a data frame A with observations

    Var1   Var2  Var3
     1       3    4
     2       5    6
     4       5    7
     4       5    8
     6       7    9

and data frame B with observations

    Var1   Var2  Var3
     1       3    4
     2       5    6

which is basically a subset of A. Now I want to select observations in A NOT in B, i.e, the data frame C with observations

    Var1   Var2  Var3
     4       5    7
     4       5    8
     6       7    9

Is there a way I can do this in R? The data frames I've used are just arbitrary data.


回答1:


dplyr has a nice anti_join function that does exactly that:

> library(dplyr)
> anti_join(A, B)
Joining by: c("Var1", "Var2", "Var3")
  Var1 Var2 Var3
1    6    7    9
2    4    5    8
3    4    5    7



回答2:


Using sqldf is an option.

require(sqldf)

C <- sqldf('SELECT * FROM A EXCEPT SELECT * FROM B')



回答3:


One approach could be to paste all the columns of A and B together, limiting to the rows in A whose pasted representation doesn't appear in the pasted representation of B:

A[!(do.call(paste, A) %in% do.call(paste, B)),]
#   Var1 Var2 Var3
# 3    4    5    7
# 4    4    5    8
# 5    6    7    9

One obvious downside of this approach is that it assumes two rows with the same pasted representation are in fact identical. Here is a slightly more clunky approach that doesn't have this limitation:

combined <- rbind(B, A)
combined[!duplicated(combined) & seq_len(nrow(combined)) > length(B),]
#   Var1 Var2 Var3
# 5    4    5    7
# 6    4    5    8
# 7    6    7    9

Basically I used rbind to append A below B and then limited to rows that are both non-duplicated and that are not originally from B.




回答4:


Another option:

C <- rbind(A, B)
C[!(duplicated(C) | duplicated(C, fromLast = TRUE)), ]

Output:

  Var1 Var2 Var3
3    4    5    7
4    4    5    8
5    6    7    9



回答5:


Using data.table you could do an anti-join as follows:

library(data.table)
setDT(df1)[!df2, on = names(df1)]

which gives the desired result:

   Var1 Var2 Var3
1:    4    5    7
2:    4    5    8
3:    6    7    9


来源:https://stackoverflow.com/questions/35809923/subsetting-a-data-frame-to-the-rows-not-appearing-in-another-data-frame

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!