R fast single item lookup from list vs data.table vs hash

后端未结

关注

 2  639

无人及你 2020-12-14 23:34

One of the problems I often face is needing to look up an arbitrary row from a data.table. I ran into a problem yesterday where I was trying to speed up a loop and using

2条回答

青春惊慌失措 (楼主)

2020-12-15 00:27

For a non-vectorized access pattern, you might want to try the builtin environment objects:

require(microbenchmark)

test_lookup_env <- list2env(test_lookup_list)


x <- lookup_tests[[1]][1]
microbenchmark(
    lookup_hash[[x]],
    test_lookup_list[[x]],
    test_lookup_dt[x],
    test_lookup_env[[x]]
)

Here you can see it's even zippier than hash :

Unit: microseconds
                  expr      min        lq       mean    median        uq      max neval
      lookup_hash[[x]]   10.767   12.9070   22.67245   23.2915   26.1710   68.654   100
 test_lookup_list[[x]]  847.700  853.2545  887.55680  863.0060  893.8925 1369.395   100
     test_lookup_dt[x] 2652.023 2711.9405 2771.06400 2758.8310 2803.9945 3373.273   100
  test_lookup_env[[x]]    1.588    1.9450    4.61595    2.5255    6.6430   27.977   100

EDIT:

Stepping through data.table:::`[.data.table` is instructive why you are seeing dt slow down. When you index with a character and there is a key set, it does quite a bit of bookkeeping, then drops down into bmerge, which is a binary search. Binary search is O(log n) and will get slower as n increases.

Environments, on the other hand, use hashing (by default) and have constant access time with respect to n.

To work around, you can manually build a map and index through it:

x <- lookup_tests[[2]][2]

e <- list2env(setNames(as.list(1:nrow(test_lookup_dt)), test_lookup_dt$product_id))

#example access:
test_lookup_dt[e[[x]], ]

However, seeing so much bookkeeping code in the data.table method, I'd try out plain old data.frames as well:

test_lookup_df <- as.data.frame(test_lookup_dt)

rownames(test_lookup_df) <- test_lookup_df$product_id

If we are really paranoid, we could skip the [ methods altogether and lapply over the columns directly.

Here are some more timings (from a different machine than above):

> microbenchmark(
+   test_lookup_dt[x,],
+   test_lookup_dt[x],
+   test_lookup_dt[e[[x]],],
+   test_lookup_df[x,],
+   test_lookup_df[e[[x]],],
+   lapply(test_lookup_df, `[`, e[[x]]),
+   lapply(test_lookup_dt, `[`, e[[x]]),
+   lookup_hash[[x]]
+ )
Unit: microseconds
                                expr       min         lq        mean     median         uq       max neval
                 test_lookup_dt[x, ]  1658.585  1688.9495  1992.57340  1758.4085  2466.7120  2895.592   100
                   test_lookup_dt[x]  1652.181  1695.1660  2019.12934  1764.8710  2487.9910  2934.832   100
            test_lookup_dt[e[[x]], ]  1040.869  1123.0320  1356.49050  1280.6670  1390.1075  2247.503   100
                 test_lookup_df[x, ] 17355.734 17538.6355 18325.74549 17676.3340 17987.6635 41450.080   100
            test_lookup_df[e[[x]], ]   128.749   151.0940   190.74834   174.1320   218.6080   366.122   100
 lapply(test_lookup_df, `[`, e[[x]])    18.913    25.0925    44.53464    35.2175    53.6835   146.944   100
 lapply(test_lookup_dt, `[`, e[[x]])    37.483    50.4990    94.87546    81.2200   124.1325   241.637   100
                    lookup_hash[[x]]     6.534    15.3085    39.88912    49.8245    55.5680   145.552   100

Overall, to answer your questions, you are not using data.table "wrong" but you are also not using it in the way it was intended (vectorized access). However, you can manually build a map to index through and get most of the performance back.

0 讨论(0)

查看其它2个回答