Subsetting data.table by 2nd column only of a 2 column key, using binary search not vector scan

前端 未结 2 1749
广开言路
广开言路 2020-11-27 16:57

I recently discovered binary search in data.table. If the table is sorted on multiple keys it possible to search on the 2nd key only ?

DT = dat         


        
2条回答
  •  生来不讨喜
    2020-11-27 17:35

    Based on this email thread I wrote the following functions:

    create_index = function(dt, ..., verbose = getOption("datatable.verbose")) {
      cols = data.table:::getdots()
      res = dt[, cols, with=FALSE]
      res[, i:=1:nrow(dt)]
      setkeyv(res, cols, verbose = verbose)
    }
    
    JI = function(index, ...) {
      index[J(...),i]$i
    }
    

    Here are the results on my system with a larger DT (1e8 rows):

    > system.time(DT[J("c")])
       user  system elapsed 
      0.168   0.136   0.306 
    
    > system.time(DT[J(unique(x), 25)])
       user  system elapsed 
      2.472   1.508   3.980 
    > system.time(DT[y==25])
       user  system elapsed 
      4.532   2.149   6.674 
    
    > system.time(IDX_y <- create_index(DT, y))
       user  system elapsed 
      3.076   2.428   5.503 
    > system.time(DT[JI(IDX_y, 25)])
       user  system elapsed 
      0.512   0.320   0.831     
    

    If you are using the index multiple times it is worth it.

提交回复
热议问题