问题
I'm still working through the lessons on DataCamp for R, so please forgive me if this question seems naïve.
Consider the following (very contrived) sample:
library(dplyr)
library(tibble)
type <- c("Dog", "Cat", "Cat", "Cat")
name <- c("Ella", "Arrow", "Gabby", "Eddie")
pets = tibble(name, type)
name <- c("Ella", "Arrow", "Dog")
type <- c("Dog", "Cat", "Calvin")
favorites = tibble(name, type)
anti_join(favorites, pets, by = "name")
setdiff(favorites, pets, by = "name")
Both of these return exactly the same data:
> anti_join(favorites, pets, by = "name")
# A tibble: 1 × 2
name type
<chr> <chr>
1 Dog Calvin
> setdiff(favorites, pets, by = "name")
# A tibble: 1 × 2
name type
<chr> <chr>
1 Dog Calvin
The documentation for each of them seems to indicate only a subtle difference: that setdiff
returns rows, but anti_join
does not. From my testing, this doesn't appear to be the case.
Can someone explain to me the true differences between these two, and perhaps provide a better example that illustrates the differences more clearly? (This is an area where DataCamp hasn't been particularly helpful.)
回答1:
Both subset the first parameter, but setdiff
requires the columns to be the same:
library(dplyr)
setdiff(mtcars, mtcars[1:30, ])
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
#> 2 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
setdiff(mtcars, mtcars[1:30, 1:6])
#> Error in setdiff_data_frame(x, y): not compatible: Cols in x but not y: `carb`, `gear`, `am`, `vs`, `qsec`.
whereas anti_join
is a join, so doesn't:
anti_join(mtcars, mtcars[1:30, 1:3])
#> Joining, by = c("mpg", "cyl", "disp")
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
#> 2 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
回答2:
One difference between setdiff and anti_join is that you can select columns in anti_join. For example:
df1 <- data.frame(a = c(1,3,5,4), b = c(5,6,7,8))
df2 <- data.frame( a = c(1,2,3,4), b = c(5,9,7,8))
#df1 looks like df2 look like
# a b a b
# 1 5 1 5
# 3 6 2 9
# 5 7 3 7
# 4 8 4 8
set_diff(x,y)
#The first and last rows of df1 and df2 are identical.
#So set_diff(df1,df2) will return the 2nd and 3rd row of df1
#a b
#3 6
#5 7
#If I do the same thing with anti_join, I get the same results
anti_join(df1,df2)
#a b
#3 6
#5 7
#However,...if I only care about values in df1 column 'b' that are different from the
# corresponding value in column b of df2... I can use the option "by" parameter..
anti_join(df1,df2, by = 'b')
#Since column the only number in column b of df1 that is different
#from the corresponding value in df2 is row two,
#this returns row 2 of df1
#a b
#3 6
Another difference is that in set_diff, the two dataframes must have the same columns.
#Keeping df1 identical to df1 in the previous example...
# and df2 the same but with an additional column
df1 <- data.frame(a = c(1,3,5,4), b = c(5,6,7,8))
df2 <- data.frame(a = c(1,2,3,4), b = c(5,9,7,8), l = c(9,9,9,9))
#df1 looks like df2 look like
# a b a b c
# 1 5 1 5 9
# 3 6 2 9 9
# 5 7 3 7 9
# 4 8 4 8 9
setdiff(df1,df2)
#Returns:
# Error in setdiff_data_frame(x, y) :
# not compatible: Cols in y but not x: `l`.
anti_join(df1,df2)
#Ignores column 3 of df2, since there is no corresponding column in df1.
#Returns: rows in df1 in which (a,b) are not equal to (a,b) in df2
#(which will be identical to the output when df2 didn't have
#a third column).
# a b
# 3 6
# 5 7
anti_join(df1,df2, by = 'b')
#Since column the only number in column b of df1 that is different
# from the corresponding value in df2 is row two, this returns row 2 of df1...
#...same as when df2 only had two columns
来源:https://stackoverflow.com/questions/46854072/in-dplyr-what-are-the-intrinsic-differences-between-setdiff-and-anti-join