Difference between subset and filter from dplyr

前端 未结 6 1595
Happy的楠姐
Happy的楠姐 2020-12-14 06:03

It seems to me that subset and filter (from dplyr) are having the same result. But my question is: is there at some point a potential difference, for ex. speed, data sizes i

6条回答
  •  隐瞒了意图╮
    2020-12-14 06:35

    They are, indeed, producing the same result, and they are very similar in concept.

    The advantage of subset is that it is part of base R and doesn't require any additional packages. With small sample sizes, it seems to be a bit faster than filter (6 times faster in your example, but that's measured in microseconds).

    As the data sets grow, filter seems gains the upper hand in efficiency. At 15,000 records, filter outpaces subset by about 300 microseconds. And at 153,000 records, filter is three times faster (measured in milliseconds).

    So in terms of human time, I don't think there's much difference between the two.

    The other advantage (and this is a bit of a niche advantage) is that filter can operate on SQL databases without pulling the data into memory. subset simply doesn't do that.

    Personally, I tend to use filter, but only because I'm already using the dplyr framework. If you aren't working with out-of-memory data, it won't make much of a difference.

    library(dplyr)
    library(microbenchmark)
    
    # Original example
    microbenchmark(
      df1<-subset(airquality, Temp>80 & Month > 5),
      df2<-filter(airquality, Temp>80 & Month > 5)
    )
    
    Unit: microseconds
       expr     min       lq     mean   median      uq      max neval cld
     subset  95.598 107.7670 118.5236 119.9370 125.949  167.443   100  a 
     filter 551.886 564.7885 599.4972 571.5335 594.993 2074.997   100   b
    
    
    # 15,300 rows
    air <- lapply(1:100, function(x) airquality) %>% bind_rows
    
    microbenchmark(
      df1<-subset(air, Temp>80 & Month > 5),
      df2<-filter(air, Temp>80 & Month > 5)
    )
    
    Unit: microseconds
       expr      min        lq     mean   median       uq      max neval cld
     subset 1187.054 1207.5800 1293.718 1216.671 1257.725 2574.392   100   b
     filter  968.586  985.4475 1056.686 1023.862 1036.765 2489.644   100  a 
    
    # 153,000 rows
    air <- lapply(1:1000, function(x) airquality) %>% bind_rows
    
    microbenchmark(
      df1<-subset(air, Temp>80 & Month > 5),
      df2<-filter(air, Temp>80 & Month > 5)
    )
    
    Unit: milliseconds
       expr       min        lq     mean    median        uq      max neval cld
     subset 11.841792 13.292618 16.21771 13.521935 13.867083 68.59659   100   b
     filter  5.046148  5.169164 10.27829  5.387484  6.738167 65.38937   100  a 
    

提交回复
热议问题