I am working on a large dataframe in R of 2,3 Million records that contain transactions of users at locations with starting and stop times. My goal is to create a new datafr
(I think posting this as an answer could benefit future readers who have interest in efficient coding.)
R is a vectorized language, thus operations by row are one of the most costly operations; Especially if you are evaluating lots of functions, dispatching methods, converting classes and creating new data set while you at it.
Hence, the first step is to reduce the "by" operations. By looking at your code, it seems that you are enlarging the size of your data set according to userID
, start
and end
- all the rest of the operations could come afterwords (and hence be vectorized). Also, running seq
(which isn't a very efficient function by itself) twice by row adds nothing. Lastly, calling explicitly seq.POSIXt
on a POSIXt
class will save you the overhead of method dispatching.
I'm not sure how to do this efficiently with dplyr
, because mutate
can't handle it and the do
function (IIRC) always proved it self to be highly inefficient. Hence, let's try the data.table
package that can handle this task easily
library(data.table)
res <- setDT(df.Sessions)[, seq.POSIXt(start, end, by = 3600), by = .(userID, start, end)]
Again, please note that I minimized "by row" operations to a single function call while avoiding methods dispatch
Now that we have the data set ready, we don't need any by row operations any more, everything can be vectorized from now on.
Though, vectorizing isn't the end of story. We also need to take into consideration classes conversions, method dispatching, etc. For instance, we can create both the hourlydate
and hournr
using either different Date
class functions or using format
or maybe even substr
. The trade off that needs to be taken in account is that, for instance, substr
will be the fastest, but the result will be a character
vector rather a Date
one - it's up to you to decide if you prefer the speed or the quality of the end product. Sometimes you can win both, but first you should check your options. Lets benchmark 3 different vectorized ways of calculating the hournr
variable
library(microbenchmark)
set.seed(123)
N <- 1e5
test <- as.POSIXlt(runif(N, 1, 1e5), origin = "1900-01-01")
microbenchmark("format" = format(test, "%H"),
"substr" = substr(test, 12L, 13L),
"data.table::hour" = hour(test))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# format 273874.784 274587.880 282486.6262 275301.78 286573.71 384505.88 100 b
# substr 486545.261 503713.314 529191.1582 514249.91 528172.32 667254.27 100 c
# data.table::hour 5.121 7.681 23.9746 27.84 33.44 55.36 100 a
data.table::hour
is the clear winner by both speed and quality (results are in an integer vector rather a character one), while improving the speed of your previous solution by factor of ~x12,000 (and I haven't even tested it against your by row implementation).
Now lets try 3 different ways for data.table::hour
microbenchmark("as.Date" = as.Date(test),
"substr" = substr(test, 1L, 10L),
"data.table::as.IDate" = as.IDate(test))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# as.Date 19.56285 20.09563 23.77035 20.63049 21.16888 50.04565 100 a
# substr 492.61257 508.98049 525.09147 515.58955 525.20586 663.96895 100 b
# data.table::as.IDate 19.91964 20.44250 27.50989 21.34551 31.79939 145.65133 100 a
Seems like the first and third options are pretty much the same speed-wise, while I prefer as.IDate
because of the integer
storage mode.
Now that we know where both efficiency and quality lies, we could simply finish the task by running
res[, `:=`(hourlydate = as.IDate(V1), hournr = hour(V1))]
(You can then easily remove the unnecessary columns using a similar syntax of res[, yourcolname := NULL]
which I'll leave to you)
There could be probably more efficient ways of solving this, but this demonstrates a possible way of how to make your code more efficient.
As a side note, if you want further to investigate data.table
syntax/features, here's a good read
https://github.com/Rdatatable/data.table/wiki/Getting-started