问题
This may look like an innocuously simple problem but it takes a very long time to execute. Any ideas for speeding it up or vectorization etc. would be greatly appreciated.
I have a R data frame with 5 million rows and 50 columns : OriginalDataFrame
A list of Indices from that Frame : IndexList
(55000 [numIndex
] unique indices)
Its a time series so there are ~ 5 Million rows for 55K unique indices.
The OriginalDataFrame
has been ordered by dataIndex
. All the indices in IndexList
are not present in OriginalDataFrame
. The task is to find the indices that are present, and construct a new data frame : FinalDataFrame
Currently I am running this code using library(foreach)
:
FinalDataFrame <- foreach (i=1:numIndex, .combine="rbind") %dopar% {
OriginalDataFrame[(OriginalDataFrame$dataIndex == IndexList[i]),]
}
I run this on a machine with 24 cores and 128GB RAM and yet this takes around 6 hours to complete.
Am I doing something exceedingly silly or are there better ways in R to do this?
回答1:
Here's a little benchmark comparing data.table to data.frame. If you know the special data table invocation for this case, it's about 7x faster, ignoring the cost of setting up the index (which is relatively small, and would typically be amortised across multiple calls). If you don't know the special syntax, it's only a little faster. (Note the problem size is a little smaller than the original to make it easier to explore)
library(data.table)
library(microbenchmark)
options(digits = 3)
# Regular data frame
df <- data.frame(id = 1:1e5, x = runif(1e5), y = runif(1e5))
# Data table, with index
dt <- data.table(df)
setkey(dt, "id")
ids <- sample(1e5, 1e4)
microbenchmark(
df[df$id %in% ids , ], # won't preserve order
df[match(ids, df$id), ],
dt[id %in% ids, ],
dt[match(ids, id), ],
dt[.(ids)]
)
# Unit: milliseconds
# expr min lq median uq max neval
# df[df$id %in% ids, ] 13.61 13.99 14.69 17.26 53.81 100
# df[match(ids, df$id), ] 16.62 17.03 17.36 18.10 21.22 100
# dt[id %in% ids, ] 7.72 7.99 8.35 9.23 12.18 100
# dt[match(ids, id), ] 16.44 17.03 17.36 17.77 61.57 100
# dt[.(ids)] 1.93 2.16 2.27 2.43 5.77 100
I had originally thought you might also be able to do this with rownames, which I thought built up a hash table and did the indexing efficiently. But that's obviously not the case:
df2 <- df
rownames(df2) <- as.character(df$id)
df2[as.character(ids), ],
microbenchmark(
df[df$id %in% ids , ], # won't preserve order
df2[as.character(ids), ],
times = 1
)
# Unit: milliseconds
# expr min lq median uq max neval
# df[df$id %in% ids, ] 15.3 15.3 15.3 15.3 15.3 1
# df2[as.character(ids), ] 3609.8 3609.8 3609.8 3609.8 3609.8 1
回答2:
If you have 5M rows and you use ==
to identify rows to subset, then for each pass of your loop, you are performing 5M comparisons. If you instead key your data (as it inherently is) then you can increase efficiency significantly:
library(data.table)
OriginalDT <- as.data.table(OriginalDataFrame)
setkey(OriginalDT, dataIndex)
# Now inside your foreach:
OriginalDT[ .( IndexList[[i]] ) ]
Note that the setkey
function uses a very fast implementation of radix sort. However if your data is already guaranteed to be sorted, @eddi or @arun had posted a nice hack to simply set the attribute to the DT. (I can't find it right now, but perhaps someone can edit this answer and link to it).
You might try just collecting all the results into a list of data.tables then using rbindlist
and compare the speed against using .combine=rbind
(if you do, please feel free to post benchmark results). I've never tested .combine=rbindlist
but that might work as well and would be interesting to try.
edit:
If the sole task is to index the data.frame, then simply use:
dataIndex[ .( IndexList ) ]
No foreach
necessary and you still leverage the key's DT
回答3:
Check data.table
package. It works just like data.frame
but faster.
Like this (where df is your data frame):
table <- data.table(df)
and use table
来源:https://stackoverflow.com/questions/18101429/speeding-up-searching-for-indices-within-a-large-r-data-frame