How to speed up subset by groups

后端 未结 2 878
被撕碎了的回忆
被撕碎了的回忆 2020-11-29 06:23

I used to achieve my data wrangling with dplyr, but some of the computations are \"slow\". In particular subset by groups, I read that dplyr is slow when there is a lot of g

相关标签:
2条回答
  • 2020-11-29 06:24

    Great question!

    I'll assume df and dt to be the names of objects for easy/quick typing.

    df = datas.tbl
    dt = datas.dt
    

    Comparison at -O3 level optimisation:

    First, here's the timing on my system on the current CRAN version of dplyr and devel version of data.table. The devel version of dplyr seems to suffer from performance regressions (and is being fixed by Romain).

    system.time(df %>% group_by(id1, id2) %>% filter(datetime == max(datetime)))
    #  25.291   0.128  25.610 
    
    system.time(dt[dt[, .I[datetime == max(datetime)], by = c("id1", "id2")]$V1])
    #  17.191   0.075  17.349 
    

    I ran this quite a few times, and dint seem to change. However, I compile all packages with -O3 optimisation flag (by setting ~/.R/Makevars appropriately). And I've observed that data.table performance gets much better than other packages I've compared it with at -O3.

    Grouping speed comparison

    Second, it is important to understand the reason for such slowness. First let's compare the time to just group.

    system.time(group_by(df, id1, id2))
    #   0.303   0.007   0.311 
    system.time(data.table:::forderv(dt, by = c("id1", "id2"), retGrp = TRUE))
    #   0.002   0.000   0.002 
    

    Even though there are a total of 250,000 rows your data size is around ~38MB. At this size, it's unlikely to see a noticeable difference in grouping speed.

    data.table's grouping is >100x faster here, it's clearly not the reason for such slowness...

    Why is it slow?

    So what's the reason? Let's turn on datatable.verbose option and check again:

    options(datatable.verbose = TRUE)
    dt[dt[, .I[datetime == max(datetime)], by = c("id1", "id2")]$V1]
    # Detected that j uses these columns: datetime 
    # Finding groups (bysameorder=TRUE) ... done in 0.002secs. bysameorder=TRUE and o__ is length 0
    # lapply optimization is on, j unchanged as '.I[datetime == max(datetime)]'
    # GForce is on, left j unchanged
    # Old mean optimization is on, left j unchanged.
    # Starting dogroups ... 
    #   memcpy contiguous groups took 0.097s for 230000 groups
    #   eval(j) took 17.129s for 230000 calls
    # done dogroups in 17.597 secs
    

    So eval(j) alone took ~97% of the time! The expression we've provided in j is evaluated for each group. Since you've 230,000 groups, and there's a penalty to the eval() call, that adds up.

    Avoiding the eval() penalty

    Since we're aware of this penalty, we've gone ahead and started implementing internal versions of some commonly used functions: sum, mean, min, max. This will/should be expanded to as many other functions as possible (when we find time).

    So, let's try computing the time for just obtaining max(datetime) first:

    dt.agg = dt[, .(datetime = max(datetime)), by = .(id1, id2)]
    # Detected that j uses these columns: datetime 
    # Finding groups (bysameorder=TRUE) ... done in 0.002secs. bysameorder=TRUE and o__ is length 0
    # lapply optimization is on, j unchanged as 'list(max(datetime))'
    # GForce optimized j to 'list(gmax(datetime))'
    

    And it's instant. Why? Because max() gets internally optimised to gmax() and there's no eval() call for each of the 230K groups.

    So why isn't datetime == max(datetime) instant? Because it's more complicated to parse such expressions and optimise internally, and we have not gotten to it yet.

    Workaround

    So now that we know the issue, and a way to get around it, let's use it.

    dt.agg = dt[, .(datetime = max(datetime)), by = .(id1, id2)]
    dt[dt.agg, on = c("id1", "id2", "datetime")] # v1.9.5+
    

    This takes ~0.14 seconds on my Mac.

    Note that this is only fast because the expression gets optimised to gmax(). Compare it with:

    dt[, .(datetime = base::max(datetime)), by = .(id1, id2)]
    

    I agree optimising more complicated expressions to avoid the eval() penalty would be the ideal solution, but we are not there yet.

    0 讨论(0)
  • 2020-11-29 06:24

    How about summarizing the data.table and join original data

    system.time({
      datas1 <- datas.dt[, list(datetime=max(datetime)), by = c("id1", "id2")] #summarize the data
      setkey(datas1, id1, id2, datetime)
      setkey(datas.dt, id1, id2, datetime)
      datas2 <- datas.dt[datas1]
    })
    #  user  system elapsed 
    # 0.083   0.000   0.084 
    

    which correctly filters the data

    system.time(dat1 <- datas.dt[datas.dt[, .I[datetime == max(datetime)], by = c("id1", "id2")]$V1])
    #   user  system elapsed 
    # 23.226   0.000  23.256 
    all.equal(dat1, datas2)
    # [1] TRUE
    

    Addendum

    setkey argument is superfluous if you are using the devel version of the data.table (Thanks to @akrun for the pointer)

    system.time({
      datas1 <- datas.dt[, list(datetime=max(datetime)), by = c("id1", "id2")] #summarize the data
      datas2 <- datas.dt[datas1, on=c('id1', 'id2', 'datetime')]
    })
    
    0 讨论(0)
提交回复
热议问题