R fast single item lookup from list vs data.table vs hash

后端 未结 2 640
无人及你
无人及你 2020-12-14 23:34

One of the problems I often face is needing to look up an arbitrary row from a data.table. I ran into a problem yesterday where I was trying to speed up a loop and using

2条回答
  •  青春惊慌失措
    2020-12-15 00:01

    The approach you have taken seems to be very inefficient because you are querying multiple times the single value from the dataset.

    It would be much more efficient to query all of them at once and then just loop on the whole batch, instead of querying 1e4 one by one.

    See dt2 for a vectorized approach. Still it is hard for me to imagine the use case for that.

    Another thing is 450K rows of data is quite few to make a reasonable benchmark, you may get totally different results for 4M or higher. In terms of hash approach you would probably also hit memory limits faster.

    Additionally the Sys.time() may not be the best way to measure timing, read gc argument in ?system.time.

    Here is the benchmark I've made using the system.nanotime() function from microbenchmarkCore package.

    It is possible to speed up data.table approach even further by collapsing test_lookup_list into data.table and performing merge to test_lookup_dt, but to compare to hash solution I would also need to preprocess it.

    library(microbenchmarkCore) # install.packages("microbenchmarkCore", repos="http://olafmersmann.github.io/drat")
    library(data.table)
    library(hash)
    
    # Set seed to 42 to ensure repeatability
    set.seed(42)
    
    # Setting up test ------
    
    # Generate product ids
    product_ids = as.vector(
        outer(LETTERS[seq(1, 26, 1)],
              outer(outer(LETTERS[seq(1, 26, 1)], LETTERS[seq(1, 26, 1)], paste, sep=""),
                    LETTERS[seq(1, 26, 1)], paste, sep = ""
              ), paste, sep = ""
        )
    )
    
    # Create test lookup data
    test_lookup_list = lapply(product_ids, function(id) list(
        product_id = id,
        val_1 = rnorm(1),
        val_2 = rnorm(1),
        val_3 = rnorm(1),
        val_4 = rnorm(1),
        val_5 = rnorm(1),
        val_6 = rnorm(1),
        val_7 = rnorm(1),
        val_8 = rnorm(1)
    ))
    
    # Set names of items in list
    names(test_lookup_list) = sapply(test_lookup_list, `[[`, "product_id")
    
    # Create lookup hash
    lookup_hash = hash(names(test_lookup_list), test_lookup_list)
    
    # Create data.table from list and set key of data.table to product_id field
    test_lookup_dt <- rbindlist(test_lookup_list)
    setkey(test_lookup_dt, product_id)
    
    # Generate sample of keys to be used for speed testing
    lookup_tests = lapply(1:10, function(x) sample(test_lookup_dt$product_id, 1e4))
    

    native = lapply(lookup_tests, function(lookups) system.nanotime(for(lookup in lookups) test_lookup_list[[lookup]]))
    dt1 = lapply(lookup_tests, function(lookups) system.nanotime(for(lookup in lookups) test_lookup_dt[lookup]))
    hash = lapply(lookup_tests, function(lookups) system.nanotime(for(lookup in lookups) lookup_hash[[lookup]]))
    dt2 = lapply(lookup_tests, function(lookups) system.nanotime(test_lookup_dt[lookups][, .SD, 1:length(product_id)]))
    
    summary(sapply(native, `[[`, 3L))
    #   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    #  27.65   28.15   28.47   28.97   28.78   33.45
    summary(sapply(dt1, `[[`, 3L))
    #   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    #  15.30   15.73   15.96   15.96   16.29   16.52
    summary(sapply(hash, `[[`, 3L))
    #   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    # 0.1209  0.1216  0.1221  0.1240  0.1225  0.1426 
    summary(sapply(dt2, `[[`, 3L))
    #   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    #0.02421 0.02438 0.02445 0.02476 0.02456 0.02779
    

提交回复
热议问题