R fast single item lookup from list vs data.table vs hash

后端未结

关注

 2  640

无人及你 2020-12-14 23:34

One of the problems I often face is needing to look up an arbitrary row from a data.table. I ran into a problem yesterday where I was trying to speed up a loop and using

2条回答

青春惊慌失措 (楼主)

2020-12-15 00:01

The approach you have taken seems to be very inefficient because you are querying multiple times the single value from the dataset.

It would be much more efficient to query all of them at once and then just loop on the whole batch, instead of querying 1e4 one by one.

See dt2 for a vectorized approach. Still it is hard for me to imagine the use case for that.

Another thing is 450K rows of data is quite few to make a reasonable benchmark, you may get totally different results for 4M or higher. In terms of hash approach you would probably also hit memory limits faster.

Additionally the Sys.time() may not be the best way to measure timing, read gc argument in ?system.time.

Here is the benchmark I've made using the system.nanotime() function from microbenchmarkCore package.

It is possible to speed up data.table approach even further by collapsing test_lookup_list into data.table and performing merge to test_lookup_dt, but to compare to hash solution I would also need to preprocess it.

library(microbenchmarkCore) # install.packages("microbenchmarkCore", repos="http://olafmersmann.github.io/drat")
library(data.table)
library(hash)

# Set seed to 42 to ensure repeatability
set.seed(42)

# Setting up test ------

# Generate product ids
product_ids = as.vector(
    outer(LETTERS[seq(1, 26, 1)],
          outer(outer(LETTERS[seq(1, 26, 1)], LETTERS[seq(1, 26, 1)], paste, sep=""),
                LETTERS[seq(1, 26, 1)], paste, sep = ""
          ), paste, sep = ""
    )
)

# Create test lookup data
test_lookup_list = lapply(product_ids, function(id) list(
    product_id = id,
    val_1 = rnorm(1),
    val_2 = rnorm(1),
    val_3 = rnorm(1),
    val_4 = rnorm(1),
    val_5 = rnorm(1),
    val_6 = rnorm(1),
    val_7 = rnorm(1),
    val_8 = rnorm(1)
))

# Set names of items in list
names(test_lookup_list) = sapply(test_lookup_list, `[[`, "product_id")

# Create lookup hash
lookup_hash = hash(names(test_lookup_list), test_lookup_list)

# Create data.table from list and set key of data.table to product_id field
test_lookup_dt <- rbindlist(test_lookup_list)
setkey(test_lookup_dt, product_id)

# Generate sample of keys to be used for speed testing
lookup_tests = lapply(1:10, function(x) sample(test_lookup_dt$product_id, 1e4))

native = lapply(lookup_tests, function(lookups) system.nanotime(for(lookup in lookups) test_lookup_list[[lookup]]))
dt1 = lapply(lookup_tests, function(lookups) system.nanotime(for(lookup in lookups) test_lookup_dt[lookup]))
hash = lapply(lookup_tests, function(lookups) system.nanotime(for(lookup in lookups) lookup_hash[[lookup]]))
dt2 = lapply(lookup_tests, function(lookups) system.nanotime(test_lookup_dt[lookups][, .SD, 1:length(product_id)]))

summary(sapply(native, `[[`, 3L))
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  27.65   28.15   28.47   28.97   28.78   33.45
summary(sapply(dt1, `[[`, 3L))
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  15.30   15.73   15.96   15.96   16.29   16.52
summary(sapply(hash, `[[`, 3L))
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 0.1209  0.1216  0.1221  0.1240  0.1225  0.1426 
summary(sapply(dt2, `[[`, 3L))
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#0.02421 0.02438 0.02445 0.02476 0.02456 0.02779

0 讨论(0)

查看其它2个回答