Difference between subset and filter from dplyr

前端 未结 6 1599
Happy的楠姐
Happy的楠姐 2020-12-14 06:03

It seems to me that subset and filter (from dplyr) are having the same result. But my question is: is there at some point a potential difference, for ex. speed, data sizes i

相关标签:
6条回答
  • 2020-12-14 06:28

    Interesting. I was trying to see the difference in terms of the resulting dataset and I coulnd't get an explanation to why the "[" operator behaved differently (i.e., to why it also returned NAs):

    # Subset for year=2013
    sub<-brfss2013 %>% filter(iyear == "2013")
    dim(sub)
    #[1] 486088    330
    length(which(is.na(sub$iyear))==T)
    #[1] 0
    
    sub2<-filter(brfss2013, iyear == "2013")
    dim(sub2)
    #[1] 486088    330
    length(which(is.na(sub2$iyear))==T)
    #[1] 0
    
    sub3<-brfss2013[brfss2013$iyear=="2013", ]
    dim(sub3)
    #[1] 486093    330
    length(which(is.na(sub3$iyear))==T)
    #[1] 5
    
    sub4<-subset(brfss2013, iyear=="2013")
    dim(sub4)
    #[1] 486088    330
    length(which(is.na(sub4$iyear))==T)
    #[1] 0
    
    0 讨论(0)
  • 2020-12-14 06:29

    An additional advantage of filter is that it plays nice with grouped data. subset ignores groupings.

    So when the data is grouped, subset will still make reference to the whole data, but filter will only reference the group.

    # setup
    library(tidyverse)
    
    data.frame(a = 1:2) %>% group_by(a) %>% subset(length(a) == 1) 
    # returns empty table
    
    data.frame(a = 1:2) %>% group_by(a) %>% filter(length(a) == 1) 
    # returns all rows
    
    0 讨论(0)
  • 2020-12-14 06:35

    They are, indeed, producing the same result, and they are very similar in concept.

    The advantage of subset is that it is part of base R and doesn't require any additional packages. With small sample sizes, it seems to be a bit faster than filter (6 times faster in your example, but that's measured in microseconds).

    As the data sets grow, filter seems gains the upper hand in efficiency. At 15,000 records, filter outpaces subset by about 300 microseconds. And at 153,000 records, filter is three times faster (measured in milliseconds).

    So in terms of human time, I don't think there's much difference between the two.

    The other advantage (and this is a bit of a niche advantage) is that filter can operate on SQL databases without pulling the data into memory. subset simply doesn't do that.

    Personally, I tend to use filter, but only because I'm already using the dplyr framework. If you aren't working with out-of-memory data, it won't make much of a difference.

    library(dplyr)
    library(microbenchmark)
    
    # Original example
    microbenchmark(
      df1<-subset(airquality, Temp>80 & Month > 5),
      df2<-filter(airquality, Temp>80 & Month > 5)
    )
    
    Unit: microseconds
       expr     min       lq     mean   median      uq      max neval cld
     subset  95.598 107.7670 118.5236 119.9370 125.949  167.443   100  a 
     filter 551.886 564.7885 599.4972 571.5335 594.993 2074.997   100   b
    
    
    # 15,300 rows
    air <- lapply(1:100, function(x) airquality) %>% bind_rows
    
    microbenchmark(
      df1<-subset(air, Temp>80 & Month > 5),
      df2<-filter(air, Temp>80 & Month > 5)
    )
    
    Unit: microseconds
       expr      min        lq     mean   median       uq      max neval cld
     subset 1187.054 1207.5800 1293.718 1216.671 1257.725 2574.392   100   b
     filter  968.586  985.4475 1056.686 1023.862 1036.765 2489.644   100  a 
    
    # 153,000 rows
    air <- lapply(1:1000, function(x) airquality) %>% bind_rows
    
    microbenchmark(
      df1<-subset(air, Temp>80 & Month > 5),
      df2<-filter(air, Temp>80 & Month > 5)
    )
    
    Unit: milliseconds
       expr       min        lq     mean    median        uq      max neval cld
     subset 11.841792 13.292618 16.21771 13.521935 13.867083 68.59659   100   b
     filter  5.046148  5.169164 10.27829  5.387484  6.738167 65.38937   100  a 
    
    0 讨论(0)
  • 2020-12-14 06:42

    In the main use cases they behave the same :

    library(dplyr)
    identical(
      filter(starwars, species == "Wookiee"),
      subset(starwars, species == "Wookiee"))
    # [1] TRUE
    

    But they have a quite a few differences, including (I was as exhaustive as possible but might have missed some) :

    • subset can be used on matrices
    • filter can be used on databases
    • filter drops row names
    • subset drop attributes other than class, names and row names.
    • subset has a select argument
    • subset recycles its condition argument
    • filter supports conditions as separate arguments
    • filter supports the .data pronoun
    • filter supports some rlang features
    • filter supports grouping
    • filter supports n() and row_number()
    • filter is stricter
    • filter is a bit faster when it counts
    • subset has methods in other packages

    subset can be used on matrices

    subset(state.x77, state.x77[,"Population"] < 400)
    #         Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
    # Alaska         365   6315        1.5    69.31   11.3    66.7   152 566432
    # Wyoming        376   4566        0.6    70.29    6.9    62.9   173  97203
    

    Though columns can't be used directly as variables in the subset argument

    subset(state.x77, Population < 400)
    

    Error in subset.matrix(state.x77, Population < 400) : object 'Population' not found

    Neither works with filter

    filter(state.x77, state.x77[,"Population"] < 400)
    

    Error in UseMethod("filter_") : no applicable method for 'filter_' applied to an object of class "c('matrix', 'double', 'numeric')"

    filter(state.x77, Population < 400)
    

    Error in UseMethod("filter_") : no applicable method for 'filter_' applied to an object of class "c('matrix', 'double', 'numeric')"

    filter can be used on databases

    library(DBI)
    con <- dbConnect(RSQLite::SQLite(), ":memory:")
    dbWriteTable(con, "mtcars", mtcars)
    tbl(con,"mtcars") %>% 
      filter(hp < 65)
    
    # # Source:   lazy query [?? x 11]
    # # Database: sqlite 3.19.3 [:memory:]
    #       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
    #     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    #   1  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
    #   2  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
    

    subset can't

    tbl(con,"mtcars") %>% 
      subset(hp < 65)
    

    Error in subset.default(., hp < 65) : object 'hp' not found

    filter drops row names

    filter(mtcars, hp < 65)
    #    mpg cyl  disp hp drat    wt  qsec vs am gear carb
    # 1 24.4   4 146.7 62 3.69 3.190 20.00  1  0    4    2
    # 2 30.4   4  75.7 52 4.93 1.615 18.52  1  1    4    2
    

    subset doesn't

    subset(mtcars, hp < 65)
    #              mpg cyl  disp hp drat    wt  qsec vs am gear carb
    # Merc 240D   24.4   4 146.7 62 3.69 3.190 20.00  1  0    4    2
    # Honda Civic 30.4   4  75.7 52 4.93 1.615 18.52  1  1    4    2
    

    subset drop attributes other than class, names and row names.

    cars_head <- head(cars)
    attr(cars_head, "info") <- "head of cars dataset"
    attributes(subset(cars_head, speed > 0))
    #> $names
    #> [1] "speed" "dist" 
    #> 
    #> $row.names
    #> [1] 1 2 3 4 5 6
    #> 
    #> $class
    #> [1] "data.frame"
    
    attributes(filter(cars_head, speed > 0))
    #> $names
    #> [1] "speed" "dist" 
    #> 
    #> $row.names
    #> [1] 1 2 3 4 5 6
    #> 
    #> $class
    #> [1] "data.frame"
    #> 
    #> $info
    #> [1] "head of cars dataset"
    

    subset has a select argument

    While dplyr follows tidyverse principles which aim at having each function doing one thing, so select is a separate function.

    identical(
    subset(starwars, species == "Wookiee", select = c("name", "height")),
    filter(starwars, species == "Wookiee") %>% select(name, height)
    )
    # [1] TRUE
    

    It also has a drop argument, that makes mostly sense in the context of using the select argument.

    subset recycles its condition argument

    half_iris <- subset(iris,c(TRUE,FALSE))
    dim(iris) # [1] 150   5
    dim(half_iris) # [1] 75  5
    

    filter doesn't

    half_iris <- filter(iris,c(TRUE,FALSE))
    

    Error in filter_impl(.data, quo) : Result must have length 150, not 2

    filter supports conditions as separate arguments

    Conditions are fed to ... so we can have several conditions as different arguments, which is the same as using & but might be more readable sometimes due to logical operator precedence and automatic identation.

    identical(
      subset(starwars, 
             (species == "Wookiee" | eye_color == "blue") &
               mass > 120),
      filter(starwars, 
             species == "Wookiee" | eye_color == "blue", 
             mass > 120)
    )
    

    filter supports the use use of the .data pronoun

    mtcars %>% filter(.data[["hp"]] < 65)
    
    #    mpg cyl  disp hp drat    wt  qsec vs am gear carb
    # 1 24.4   4 146.7 62 3.69 3.190 20.00  1  0    4    2
    # 2 30.4   4  75.7 52 4.93 1.615 18.52  1  1    4    2
    

    filter supports some rlang features

    x <- "hp"
    library(rlang)
    mtcars %>% filter(!!sym(x) < 65)
    # m   pg cyl  disp hp drat    wt  qsec vs am gear carb
    # 1 24.4   4 146.7 62 3.69 3.190 20.00  1  0    4    2
    # 2 30.4   4  75.7 52 4.93 1.615 18.52  1  1    4    2
    
    
    filter65 <- function(data,var){
      data %>% filter(!!enquo(var) < 65)
    }
    mtcars %>% filter65(hp)
    #    mpg cyl  disp hp drat    wt  qsec vs am gear carb
    # 1 24.4   4 146.7 62 3.69 3.190 20.00  1  0    4    2
    # 2 30.4   4  75.7 52 4.93 1.615 18.52  1  1    4    2
    

    filter supports grouping

    iris %>%
      group_by(Species) %>%
      filter(Petal.Length < quantile(Petal.Length,0.01))
    
    # # A tibble: 3 x 5
    # # Groups:   Species [3]
    #   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
    #          <dbl>       <dbl>        <dbl>       <dbl>     <fctr>
    # 1          4.6         3.6          1.0         0.2     setosa
    # 2          5.1         2.5          3.0         1.1 versicolor
    # 3          4.9         2.5          4.5         1.7  virginica
    
    iris %>%
      group_by(Species) %>%
      subset(Petal.Length < quantile(Petal.Length,0.01))
    
    # # A tibble: 2 x 5
    # # Groups:   Species [1]
    #     Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    #            <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
    #   1          4.3         3.0          1.1         0.1  setosa
    #   2          4.6         3.6          1.0         0.2  setosa
    

    filter supports n() and row_number()

    filter(iris, row_number() < n()/30)
    # Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    # 1          5.1         3.5          1.4         0.2  setosa
    # 2          4.9         3.0          1.4         0.2  setosa
    # 3          4.7         3.2          1.3         0.2  setosa
    # 4          4.6         3.1          1.5         0.2  setosa
    

    filter is stricter

    It trigger errors if the input is suspicious.

    filter(iris, Species = "setosa")
    # Error: `Species` (`Species = "setosa"`) must not be named, do you need `==`?
    
    identical(subset(iris, Species = "setosa"), iris)
    # [1] TRUE
    
    df1 <- setNames(data.frame(a = 1:3, b=5:7),c("a","a"))
    # df1
    # a a
    # 1 1 5
    # 2 2 6
    # 3 3 7
    
    filter(df1, a > 2)
    #Error: Column `a` must have a unique name
    subset(df1, a > 2)
    # a a.1
    # 3 3   7
    

    filter is a bit faster when it counts

    Borrowing the dataset that Benjamin built in his answer (153 k rows), it's twice faster, though it should rarely be a bottleneck.

    air <- lapply(1:1000, function(x) airquality) %>% bind_rows
    microbenchmark::microbenchmark(
      subset = subset(air, Temp>80 & Month > 5),
      filter = filter(air, Temp>80 & Month > 5)
    )
    
    # Unit: milliseconds
    #   expr      min        lq      mean    median        uq      max neval cld
    # subset 8.771962 11.551255 19.942501 12.576245 13.933290 108.0552   100   b
    # filter 4.144336  4.686189  8.024461  6.424492  7.499894 101.7827   100  a 
    

    subset has methods in other packages

    subset is an S3 generic, just as dplyr::filter is, but subset as a base function is more likely to have methods developed in other packages, one prominent example is zoo:::subset.zoo.

    0 讨论(0)
  • 2020-12-14 06:45

    One additional difference not yet mentioned is that filter discards rownames, while subset doesn't:

    filter(mtcars, gear == 5)
    
      mpg    cyl   disp      hp  drat wt    qsec  vs am   gear carb
    1 26.0   4     120.3     91  4.43 2.140 16.7  0  1    5    2
    2 30.4   4     95.1      113 3.77 1.513 16.9  1  1    5    2
    3 15.8   4     351.0     264 4.22 3.170 14.5  0  1    5    4
    4 19.7   4     145.0     175 3.62 2.770 15.5  0  1    5    6
    5 15.0   4     301.0     335 3.54 3.570 14.6  0  1    5    8
    
    subset(mtcars, gear == 5)
                   mpg    cyl   disp      hp  drat wt    qsec vs  am   gear carb
    Porsche 914-2  26.0   4     120.3     91  4.43 2.140 16.7  0  1    5    2
    Lotus Europa   30.4   4     95.1      113 3.77 1.513 16.9  1  1    5    2
    Ford Pantera L 15.8   4     351.0     264 4.22 3.170 14.5  0  1    5    4
    Ferrari Dino   19.7   4     145.0     175 3.62 2.770 15.5  0  1    5    6
    Maserati Bora  15.0   4     301.0     335 3.54 3.570 14.6  0  1    5    8
    
    0 讨论(0)
  • 2020-12-14 06:45

    A difference is also that subset does more things than filter you can also select and drop while you have two different functions in dplyr

    subset(df, select=c("varA", "varD"))
    
    dplyr::select(df,varA, varD)
    
    0 讨论(0)
提交回复
热议问题