问题
It seems to me that subset and filter (from dplyr) are having the same result. But my question is: is there at some point a potential difference, for ex. speed, data sizes it can handle etc? Are there occasions that it is better to use one or the other?
Example:
library(dplyr)
df1<-subset(airquality, Temp>80 & Month > 5)
df2<-filter(airquality, Temp>80 & Month > 5)
summary(df1$Ozone)
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 9.00 39.00 64.00 64.51 84.00 168.00 14
summary(df2$Ozone)
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 9.00 39.00 64.00 64.51 84.00 168.00 14
回答1:
They are, indeed, producing the same result, and they are very similar in concept.
The advantage of subset
is that it is part of base R and doesn't require any additional packages. With small sample sizes, it seems to be a bit faster than filter
(6 times faster in your example, but that's measured in microseconds).
As the data sets grow, filter
seems gains the upper hand in efficiency. At 15,000 records, filter
outpaces subset
by about 300 microseconds. And at 153,000 records, filter
is three times faster (measured in milliseconds).
So in terms of human time, I don't think there's much difference between the two.
The other advantage (and this is a bit of a niche advantage) is that filter
can operate on SQL databases without pulling the data into memory. subset
simply doesn't do that.
Personally, I tend to use filter
, but only because I'm already using the dplyr
framework. If you aren't working with out-of-memory data, it won't make much of a difference.
library(dplyr)
library(microbenchmark)
# Original example
microbenchmark(
df1<-subset(airquality, Temp>80 & Month > 5),
df2<-filter(airquality, Temp>80 & Month > 5)
)
Unit: microseconds
expr min lq mean median uq max neval cld
subset 95.598 107.7670 118.5236 119.9370 125.949 167.443 100 a
filter 551.886 564.7885 599.4972 571.5335 594.993 2074.997 100 b
# 15,300 rows
air <- lapply(1:100, function(x) airquality) %>% bind_rows
microbenchmark(
df1<-subset(air, Temp>80 & Month > 5),
df2<-filter(air, Temp>80 & Month > 5)
)
Unit: microseconds
expr min lq mean median uq max neval cld
subset 1187.054 1207.5800 1293.718 1216.671 1257.725 2574.392 100 b
filter 968.586 985.4475 1056.686 1023.862 1036.765 2489.644 100 a
# 153,000 rows
air <- lapply(1:1000, function(x) airquality) %>% bind_rows
microbenchmark(
df1<-subset(air, Temp>80 & Month > 5),
df2<-filter(air, Temp>80 & Month > 5)
)
Unit: milliseconds
expr min lq mean median uq max neval cld
subset 11.841792 13.292618 16.21771 13.521935 13.867083 68.59659 100 b
filter 5.046148 5.169164 10.27829 5.387484 6.738167 65.38937 100 a
回答2:
One additional difference not yet mentioned is that filter discards rownames, while subset doesn't:
filter(mtcars, gear == 5)
mpg cyl disp hp drat wt qsec vs am gear carb
1 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
2 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
3 15.8 4 351.0 264 4.22 3.170 14.5 0 1 5 4
4 19.7 4 145.0 175 3.62 2.770 15.5 0 1 5 6
5 15.0 4 301.0 335 3.54 3.570 14.6 0 1 5 8
subset(mtcars, gear == 5)
mpg cyl disp hp drat wt qsec vs am gear carb
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Ford Pantera L 15.8 4 351.0 264 4.22 3.170 14.5 0 1 5 4
Ferrari Dino 19.7 4 145.0 175 3.62 2.770 15.5 0 1 5 6
Maserati Bora 15.0 4 301.0 335 3.54 3.570 14.6 0 1 5 8
回答3:
In the main use cases they behave the same :
library(dplyr)
identical(
filter(starwars, species == "Wookiee"),
subset(starwars, species == "Wookiee"))
# [1] TRUE
But they have a quite a few differences, including (I was as exhaustive as possible but might have missed some) :
subset
can be used on matricesfilter
can be used on databasesfilter
drops row namessubset
has aselect
argumentsubset
recycles its condition argumentfilter
supports conditions as separate argumentsfilter
supports the.data
pronounfilter
supports somerlang
featuresfilter
supports groupingfilter
supportsn()
androw_number()
filter
is stricterfilter
is a bit faster when it countssubset
has methods in other packages
subset
can be used on matrices
subset(state.x77, state.x77[,"Population"] < 400)
# Population Income Illiteracy Life Exp Murder HS Grad Frost Area
# Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
# Wyoming 376 4566 0.6 70.29 6.9 62.9 173 97203
Though columns can't be used directly as variables in the subset
argument
subset(state.x77, Population < 400)
Error in subset.matrix(state.x77, Population < 400) : object 'Population' not found
Neither works with filter
filter(state.x77, state.x77[,"Population"] < 400)
Error in UseMethod("filter_") : no applicable method for 'filter_' applied to an object of class "c('matrix', 'double', 'numeric')"
filter(state.x77, Population < 400)
Error in UseMethod("filter_") : no applicable method for 'filter_' applied to an object of class "c('matrix', 'double', 'numeric')"
filter
can be used on databases
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "mtcars", mtcars)
tbl(con,"mtcars") %>%
filter(hp < 65)
# # Source: lazy query [?? x 11]
# # Database: sqlite 3.19.3 [:memory:]
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
subset
can't
tbl(con,"mtcars") %>%
subset(hp < 65)
Error in subset.default(., hp < 65) : object 'hp' not found
filter
drops row names
filter(mtcars, hp < 65)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
subset
doesn't
subset(mtcars, hp < 65)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
subset
has a select
argument
While dplyr
follows tidyverse
principles which aim at having each function doing one thing, so select
is a separate function.
identical(
subset(starwars, species == "Wookiee", select = c("name", "height")),
filter(starwars, species == "Wookiee") %>% select(name, height)
)
# [1] TRUE
It also has a drop
argument, that makes mostly sense in the context of using the select argument.
subset
recycles its condition argument
half_iris <- subset(iris,c(TRUE,FALSE))
dim(iris) # [1] 150 5
dim(half_iris) # [1] 75 5
filter
doesn't
half_iris <- filter(iris,c(TRUE,FALSE))
Error in filter_impl(.data, quo) : Result must have length 150, not 2
filter
supports conditions as separate arguments
Conditions are fed to ...
so we can have several conditions as different arguments, which is the same as using &
but might be more readable sometimes due to logical operator precedence and automatic identation.
identical(
subset(starwars,
(species == "Wookiee" | eye_color == "blue") &
mass > 120),
filter(starwars,
species == "Wookiee" | eye_color == "blue",
mass > 120)
)
filter
supports the use use of the .data
pronoun
mtcars %>% filter(.data[["hp"]] < 65)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
filter
supports some rlang
features
x <- "hp"
library(rlang)
mtcars %>% filter(!!sym(x) < 65)
# m pg cyl disp hp drat wt qsec vs am gear carb
# 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
filter65 <- function(data,var){
data %>% filter(!!enquo(var) < 65)
}
mtcars %>% filter65(hp)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
filter
supports grouping
iris %>%
group_by(Species) %>%
filter(Petal.Length < quantile(Petal.Length,0.01))
# # A tibble: 3 x 5
# # Groups: Species [3]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 4.6 3.6 1.0 0.2 setosa
# 2 5.1 2.5 3.0 1.1 versicolor
# 3 4.9 2.5 4.5 1.7 virginica
iris %>%
group_by(Species) %>%
subset(Petal.Length < quantile(Petal.Length,0.01))
# # A tibble: 2 x 5
# # Groups: Species [1]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 4.3 3.0 1.1 0.1 setosa
# 2 4.6 3.6 1.0 0.2 setosa
filter
supports n()
and row_number()
filter(iris, row_number() < n()/30)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
filter
is stricter
It trigger errors if the input is suspicious.
filter(iris, Species = "setosa")
# Error: `Species` (`Species = "setosa"`) must not be named, do you need `==`?
identical(subset(iris, Species = "setosa"), iris)
# [1] TRUE
df1 <- setNames(data.frame(a = 1:3, b=5:7),c("a","a"))
# df1
# a a
# 1 1 5
# 2 2 6
# 3 3 7
filter(df1, a > 2)
#Error: Column `a` must have a unique name
subset(df1, a > 2)
# a a.1
# 3 3 7
filter
is a bit faster when it counts
Borrowing the dataset that Benjamin built in his answer (153 k rows), it's twice faster, though it should rarely be a bottleneck.
air <- lapply(1:1000, function(x) airquality) %>% bind_rows
microbenchmark::microbenchmark(
subset = subset(air, Temp>80 & Month > 5),
filter = filter(air, Temp>80 & Month > 5)
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# subset 8.771962 11.551255 19.942501 12.576245 13.933290 108.0552 100 b
# filter 4.144336 4.686189 8.024461 6.424492 7.499894 101.7827 100 a
subset
has methods in other packages
subset
is an S3 generic, just as dplyr::filter
is, but subset
as a base function is more likely to have methods developed in other packages, one prominent example is zoo:::subset.zoo
.
回答4:
Interesting. I was trying to see the difference in terms of the resulting dataset and I coulnd't get an explanation to why the "[" operator behaved differently (i.e., to why it also returned NAs):
# Subset for year=2013
sub<-brfss2013 %>% filter(iyear == "2013")
dim(sub)
#[1] 486088 330
length(which(is.na(sub$iyear))==T)
#[1] 0
sub2<-filter(brfss2013, iyear == "2013")
dim(sub2)
#[1] 486088 330
length(which(is.na(sub2$iyear))==T)
#[1] 0
sub3<-brfss2013[brfss2013$iyear=="2013", ]
dim(sub3)
#[1] 486093 330
length(which(is.na(sub3$iyear))==T)
#[1] 5
sub4<-subset(brfss2013, iyear=="2013")
dim(sub4)
#[1] 486088 330
length(which(is.na(sub4$iyear))==T)
#[1] 0
回答5:
A difference is also that subset does more things than filter you can also select and drop while you have two different functions in dplyr
subset(df, select=c("varA", "varD"))
dplyr::select(df,varA, varD)
回答6:
An additional advantage of filter
is that it plays nice with grouped data. subset
ignores groupings.
So when the data is grouped, subset
will still make reference to the whole data, but filter
will only reference the group.
# setup
library(tidyverse)
data.frame(a = 1:2) %>% group_by(a) %>% subset(length(a) == 1)
# returns empty table
data.frame(a = 1:2) %>% group_by(a) %>% filter(length(a) == 1)
# returns all rows
来源:https://stackoverflow.com/questions/39882463/difference-between-subset-and-filter-from-dplyr