Why is using dplyr pipe (%>%) slower than an equivalent non-pipe expression, for high-cardinality group-by?

后端 未结 4 1027
孤城傲影
孤城傲影 2020-12-24 14:20

I thought that generally speaking using %>% wouldn\'t have a noticeable effect on speed. But in this case it runs 4x slower.

library(dplyr         


        
4条回答
  •  离开以前
    2020-12-24 15:06

    So, I finally got around to running the expressions in OP's question:

    set.seed(0)
    dummy_data <- dplyr::data_frame(
      id=floor(runif(100000, 1, 100000))
      , label=floor(runif(100000, 1, 4))
    )
    
    microbenchmark(dummy_data %>% group_by(id) %>% summarise(list(unique(label))))
    microbenchmark(dummy_data %>% group_by(id) %>% summarise(label %>% unique %>% list))
    

    This took so long that I thought I'd run into a bug, and force-interrupted R.

    Trying again, with the number of repetitions cut down, I got the following times:

    microbenchmark(
        b=dummy_data %>% group_by(id) %>% summarise(list(unique(label))),
        d=dummy_data %>% group_by(id) %>% summarise(label %>% unique %>% list),
        times=2)
    
    #Unit: seconds
    # expr      min       lq     mean   median       uq      max neval
    #    b 2.091957 2.091957 2.162222 2.162222 2.232486 2.232486     2
    #    d 7.380610 7.380610 7.459041 7.459041 7.537471 7.537471     2
    

    The times are in seconds! So much for milliseconds or microseconds. No wonder it seemed like R had hung at first, with the default value of times=100.

    But why is it taking so long? First, the way the dataset is constructed, the id column contains about 63000 values:

    length(unique(dummy_data$id))
    #[1] 63052
    

    Second, the expression that is being summarised over in turn contains several pipes, and each set of grouped data is going to be relatively small.

    This is essentially the worst-case scenario for a piped expression: it's being called very many times, and each time, it's operating over a very small set of inputs. This results in plenty of overhead, and not much computation for that overhead to be amortised over.

    By contrast, if we just switch the variables that are being grouped and summarized over:

    microbenchmark(
        b=dummy_data %>% group_by(label) %>% summarise(list(unique(id))),
        d=dummy_data %>% group_by(label) %>% summarise(id %>% unique %>% list),
        times=2)
    
    #Unit: milliseconds
    # expr      min       lq     mean   median       uq      max neval
    #    b 12.00079 12.00079 12.04227 12.04227 12.08375 12.08375     2
    #    d 10.16612 10.16612 12.68642 12.68642 15.20672 15.20672     2
    

    Now everything looks much more equal.

提交回复
热议问题