Speeding up searching for indices within a Large R Data Frame

问题

This may look like an innocuously simple problem but it takes a very long time to execute. Any ideas for speeding it up or vectorization etc. would be greatly appreciated.

I have a R data frame with 5 million rows and 50 columns : OriginalDataFrame

A list of Indices from that Frame : IndexList (55000 [numIndex] unique indices)

Its a time series so there are ~ 5 Million rows for 55K unique indices.

The OriginalDataFrame has been ordered by dataIndex. All the indices in IndexList are not present in OriginalDataFrame. The task is to find the indices that are present, and construct a new data frame : FinalDataFrame

Currently I am running this code using library(foreach):

FinalDataFrame <- foreach (i=1:numIndex, .combine="rbind") %dopar% {
  OriginalDataFrame[(OriginalDataFrame$dataIndex == IndexList[i]),]
}

I run this on a machine with 24 cores and 128GB RAM and yet this takes around 6 hours to complete.

Am I doing something exceedingly silly or are there better ways in R to do this?

回答1:

Here's a little benchmark comparing data.table to data.frame. If you know the special data table invocation for this case, it's about 7x faster, ignoring the cost of setting up the index (which is relatively small, and would typically be amortised across multiple calls). If you don't know the special syntax, it's only a little faster. (Note the problem size is a little smaller than the original to make it easier to explore)

library(data.table)
library(microbenchmark)
options(digits = 3)

# Regular data frame
df <- data.frame(id = 1:1e5, x = runif(1e5), y = runif(1e5))

# Data table, with index
dt <- data.table(df)
setkey(dt, "id")

ids <- sample(1e5, 1e4)

microbenchmark(
  df[df$id %in% ids , ], # won't preserve order
  df[match(ids, df$id), ],
  dt[id %in% ids, ],
  dt[match(ids, id), ],
  dt[.(ids)]
)
# Unit: milliseconds
#                     expr   min    lq median    uq   max neval
#     df[df$id %in% ids, ] 13.61 13.99  14.69 17.26 53.81   100
#  df[match(ids, df$id), ] 16.62 17.03  17.36 18.10 21.22   100
#        dt[id %in% ids, ]  7.72  7.99   8.35  9.23 12.18   100
#     dt[match(ids, id), ] 16.44 17.03  17.36 17.77 61.57   100
#               dt[.(ids)]  1.93  2.16   2.27  2.43  5.77   100

I had originally thought you might also be able to do this with rownames, which I thought built up a hash table and did the indexing efficiently. But that's obviously not the case:

df2 <- df
rownames(df2) <- as.character(df$id)
df2[as.character(ids), ],

microbenchmark(
  df[df$id %in% ids , ], # won't preserve order
  df2[as.character(ids), ],
  times = 1
)
# Unit: milliseconds
#                     expr    min     lq median     uq    max neval
#     df[df$id %in% ids, ]   15.3   15.3   15.3   15.3   15.3     1
# df2[as.character(ids), ] 3609.8 3609.8 3609.8 3609.8 3609.8     1

回答2:

If you have 5M rows and you use == to identify rows to subset, then for each pass of your loop, you are performing 5M comparisons. If you instead key your data (as it inherently is) then you can increase efficiency significantly:

library(data.table)
OriginalDT <- as.data.table(OriginalDataFrame)
setkey(OriginalDT, dataIndex)

# Now inside your foreach:
OriginalDT[ .( IndexList[[i]] ) ]

Note that the setkey function uses a very fast implementation of radix sort. However if your data is already guaranteed to be sorted, @eddi or @arun had posted a nice hack to simply set the attribute to the DT. (I can't find it right now, but perhaps someone can edit this answer and link to it).

You might try just collecting all the results into a list of data.tables then using rbindlist and compare the speed against using .combine=rbind (if you do, please feel free to post benchmark results). I've never tested .combine=rbindlist but that might work as well and would be interesting to try.

edit:

If the sole task is to index the data.frame, then simply use:

dataIndex[ .( IndexList ) ]

No foreach necessary and you still leverage the key's DT

回答3:

Check data.table package. It works just like data.frame but faster.

Like this (where df is your data frame):

table <- data.table(df)

and use table

来源：https://stackoverflow.com/questions/18101429/speeding-up-searching-for-indices-within-a-large-r-data-frame

标签

parallel-processing

dataframe