R selecting all rows from a data frame that don't appear in another

这一生的挚爱 提交于 2019-11-30 11:49:25

Here's another way:

x <- rbind(test2, test)
x[! duplicated(x, fromLast=TRUE) & seq(nrow(x)) <= nrow(test2), ]
#        number   fruit ID1 ID2
# item1 number1 papayas  22  33
# item3 number3 peaches 441  25
# item4 number4  apples 123  13

Edit: modified to preserve row names.

There are two ways to solve this, using data.table and sqldf

library(data.table)
test<- fread('
item number fruit ID1 ID2 
item1 "number1" "apples"  "22" "33"
item2 "number2" "oranges" "13" "33"
item3 "number3" "peaches" "44" "25"
item4 "number4" "apples"  "12" "13"
')
test2<- fread('
item number fruit ID1 ID2 
item1 "number1" "papayas" "22"  "33"
item2 "number2" "oranges" "13"  "33"
item3 "number3" "peaches" "441" "25"
item4 "number4" "apples"  "123" "13"
item5 "number3" "peaches" "44"  "25"
item6 "number4" "apples"  "12"  "13"
item7 "number1" "apples"  "22"  "33"
')

data.table approach, this enables you to select which columns you want to compare

setkey(test,item,number,fruit,ID1,ID2)
setkey(test2,item,number,fruit,ID1,ID2)
test[!test2]
item  number   fruit ID1 ID2
1: item1 number1  apples  22  33
2: item3 number3 peaches  44  25
3: item4 number4  apples  12  13

Sql approach

sqldf('select * from test except select * from test2')
item  number   fruit ID1 ID2
1: item1 number1  apples  22  33
2: item3 number3 peaches  44  25
3: item4 number4  apples  12  13

The following should get you there:

rows <- unique(unlist(mapply(function(x, y) 
          sapply(setdiff(x, y), function(d) which(x==d)), test2, test1)))
test2[rows, ]

What's happening here is:

  • mapply is used to do a column-wise comparison between the two datasets.
  • It uses setdiff to find any item which are in the former but not the latter
  • which identifies which row of the former is not present.
  • unique(unlist(....)) grabs all unique rows

  • Then we use that as a filter to the former, ie test2

Results:

       number   fruit ID1 ID2
item1 number1 papayas  22  33
item3 number3 peaches 441  25
item4 number4  apples 123  13

edit:

Make sure that your test & test2 are data.frames and not matrices, since mapply iterates over each element of a matrix, but over each column of a data.frame

test  <- as.data.frame(test,  stringsAsFactors=FALSE)
test2 <- as.data.frame(test2, stringsAsFactors=FALSE)

Make a new row-ID column in test2, merge the data frames, and select those rows whose IDs aren't in the merged result.

test2 <- cbind(test2, id=seq_len(nrow(test2)))

matches <- merge(test1, test2)$id

test2 <- test2[-matches, ]

Here's another approach, but I'm not sure how well it would scale.

test2[!apply(test2, 1, paste, collapse = "") %in% 
        apply(test, 1, paste, collapse = ""), ]
#       number    fruit     ID1   ID2 
# item1 "number1" "papayas" "22"  "33"
# item3 "number3" "peaches" "441" "25"
# item4 "number4" "apples"  "123" "13"

This would not delete all duplicates. Compare, for example, if test2 had duplicates:

test2 <- rbind(test2, test2[1:3, ])

## Matthew's answer: Duplicates dropped
x <- rbind(test2, test)
x[! duplicated(x, fromLast=TRUE) & seq(nrow(x)) <= nrow(test2), ]
#       number    fruit     ID1   ID2 
# item4 "number4" "apples"  "123" "13"
# item1 "number1" "papayas" "22"  "33"
# item3 "number3" "peaches" "441" "25"

## This one: Duplicates retained
test2[!apply(test2, 1, paste, collapse = "") %in%
  apply(test, 1, paste, collapse = ""), ]
#       number    fruit     ID1   ID2 
# item1 "number1" "papayas" "22"  "33"
# item3 "number3" "peaches" "441" "25"
# item4 "number4" "apples"  "123" "13"
# item1 "number1" "papayas" "22"  "33"
# item3 "number3" "peaches" "441" "25"
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!