Can we do binary search in data.table with OR select queries

后端 未结 1 1804
时光取名叫无心
时光取名叫无心 2020-12-29 00:24

Following the previous question using data.table

DT = data.table(x=sample(letters,1e7,T),y=sample(1:25,1e7,T),rnorm(1e7))
setkey(DT,x,y)
         


        
相关标签:
1条回答
  • 2020-12-29 01:06

    Yes:
    In order to do a binary serach, we need the appropriate indices.

      indx <- rbind(DT[y==25, list(y=25), by=x], DT[.("a"), list(x="a"), by=y], use.names=TRUE)
      indx <- setdiff(indx, setdiff(indx, unique(DT[, key(DT), with=FALSE])))
      indx
    
      DT[.(indx)]
    

    Benchmarking:
    This gives us more than a 10x improvement over vectorized serach.

      identical(setkey(DT[.(indx)]), setkey(DT[x=="a" | y == 25]))
      # [1] TRUE
    
    
      library(microbenchmark)
      microbenchmark(UsingIndx = DT[.(indx)],  UsingVecSearch = DT[x=="a" | y == 25], times=100 )
    
      Unit: milliseconds
                  expr       min        lq    median        uq      max
      1      UsingIndx  34.27562  41.70119  48.13215  49.29752 231.1669
      2 UsingVecSearch 506.62670 545.85673 636.67701 680.93894 802.0842
    


    For convenience, we can wrap the "creating the index" portion of the code into a nice function, so that we can then call it in a single line. For example:

    DT[.(OrIndx("a", 25, DT))]
    

    Where OrIndx() is defined as follows:

    OrIndx <- function(xval, yval, DT)  {
      # TODO: Allow for arbitrary columns and column names
      if(!is.data.table(DT))
        stop("DT is not a data.table")
    
      # create all appropriate combinations
      indx <- rbind(DT[y==yval, list(y=yval), by=x], DT[.(xval), list(x=xval), by=y], use.names=TRUE)
    
      # take out any combinations in indx that are not actually present in DT and return 
      return( setdiff(indx, setdiff(indx, unique(DT[, key(DT), with=FALSE]))) )
    }
    

    Explanation:

    The idea here is that performing an "or" serach requires some form of combination.
    In a standard vector search, this combination is of the results of each individual vector serach.

    data.table offers some great speed improvements by allowing seraches such as

     DT[.(c("cdf", "tmb"), c(25, 3))] 
    

    Therefore, a natural solution to the question would be to use:

     DT[.(c(<all values of x>, "a"), c(25, <all values of y>))] 
    

    The only problem is that the recycling would not line up properly.
    It would be ideal to have an option like

     DT[.(  list( c(unique(x), y=25), c(x="a", y=unique(y) )  )]
    

    But as far as I can tell that has not been implemented (yet!)
    So instead, we can take appropriate combinations.
    The function OrIndx above does exactly that. (it s quick & dirty and there are more efficient ways of creating the index)


    Update with additional benchmarks

    As per @Aruns suggestion, we include

    rbind(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)])
    rbindlist(list( DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)] ))
    

    Tested on 1e6 and 1e7 rows:

      ## Using 1 Million rows
      > microbenchmark(Using_Indx = DT[.(indx)], Using_RbindList = rbindlist(list(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)])), Using_Rbind = rbind(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)]),  Using_VecSearch = DT[x=="a" | y == 25], times=70L )
      Unit: milliseconds
                   expr       min        lq    median        uq        max
      1      Using_Indx  4.865089  5.755615  5.813938  5.957352   6.880743
      2     Using_Rbind 42.657953 49.239558 49.682407 50.505977 139.770670
      3 Using_RbindList 36.319170 44.169151 44.484350 45.279158 155.361338
      4 Using_VecSearch 49.003307 64.030384 64.443666 65.123886 150.099946
    
    
      ## Using 10 Milliion rows
      Unit: milliseconds
                   expr       min       lq   median        uq      max
      1      Using_Indx  33.71108  47.5402  48.7574  50.75285 122.0950
      2     Using_rbind 492.38244 535.6062 565.8623 590.92841 727.3907
      3 Using_RbindList 436.29325 478.3626 507.4665 525.25980 657.6639
      4 Using_VecSearch 511.86248 607.8046 643.9822 688.36733 765.3997
    
      # Making sure all the same results: 
      > identical(setkey(DT[.(indx)]), setkey(DT[x=="a" | y == 25]))
      [1] TRUE
      > identical(setkey(DT[.(indx)]), setkey(rbind(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)])))
      [1] TRUE
    

    Note that for SMALL tabbles (less than 15K rows), vector search is faster (for really small tables, about twice as fast)

      ## Using  100 Rows
      > microbenchmark(Using_Indx = DT[.(indx)], Using_RbindList = rbindlist(list(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)])), Using_rbind = rbind(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)]),  Using_VecSearch = DT[x=="a" | y == 25], times=150L )
      Unit: microseconds
                   expr      min       lq    median       uq      max
      1      Using_Indx  884.819  901.854  917.3715  933.642 9740.046
      2     Using_rbind 2385.842 2424.893 2462.5210 2502.704 4266.637
      3 Using_RbindList 1962.504 2005.594 2027.4085 2069.516 4238.146
      4 Using_VecSearch  386.867  401.328  407.5730  420.647 2908.090
    

    This pattern holds until about 10,000 rows, at which point we start to see the gains:

      ## 10,000 Rows
      Unit: microseconds
                   expr      min       lq    median       uq      max
      1      Using_Indx  891.374  921.784  931.6585  956.737 3780.971
      4 Using_VecSearch  796.316  815.965  824.1480  845.151 2531.314
    
      ## 15,000 Rows
      Unit: microseconds
                   expr      min       lq   median       uq      max
      1      Using_Indx  913.963  939.198  954.518  986.609 2900.174
      4 Using_VecSearch 1018.830 1041.449 1053.098 1072.188 8418.470
    
      ## 30,000 Rows
      Unit: microseconds
                   expr      min       lq   median       uq       max
      1      Using_Indx  964.402  995.883 1018.535 1045.908  5999.390
      4 Using_VecSearch 1649.231 1709.090 1801.760 1927.976  8868.470
    
      ## 100,000 Rows
      Unit: milliseconds
                   expr      min       lq   median       uq       max
      1      Using_Indx 1.142318 1.181023 1.198611 1.268417  3.611945
      4 Using_VecSearch 4.663948 4.763179 5.052995 6.058354 12.133510
    
      ## 10,000,000 Rows   (only ran 30 reps for this one)
      Unit: milliseconds
                   expr       min        lq    median        uq      max
      1      Using_Indx  33.95004  42.24995  48.90363  50.15424 177.0991
      2 Using_VecSearch 512.34760 557.02867 622.37670 662.14323 861.3465
    
    0 讨论(0)
提交回复
热议问题