Why are my functions on lubridate dates so slow?

问题

I wrote this function which I use all the time:

# Give the previous day, or Friday if the previous day is Saturday or Sunday.
previous_business_date_if_weekend = function(my_date) {
    if (length(my_date) == 1) {
        if (weekdays(my_date) == "Sunday") { my_date = lubridate::as_date(my_date) - 2 }
        if (weekdays(my_date) == "Saturday") { my_date = lubridate::as_date(my_date) - 1 }
        return(lubridate::as_date(my_date))
    } else if (length(my_date) > 1) {
        my_date = lubridate::as_date(sapply(my_date, previous_business_date_if_weekend))
        return(my_date)
    }
}

Problems arise when I apply it to a date column of a dataframe with thousands of rows. It's ridiculously slow. Any thoughts as to why?

回答1:

You're looping over every single row. It's not surprising it is slow. You could essentially do one replacement operation instead where you take a fixed difference from each date: 0 for M-F, -1 for Sat and -2 for Sun.

# 'big' sample data
x <- Sys.Date() + 0:100000

bizdays <- function(x) x - match(weekdays(x), c("Saturday","Sunday"), nomatch=0)

# since `weekdays()` is locale-specific, you could also be defensive and do:
bizdays <- function(x) x - match(format(x, "%w"), c("6","0"), nomatch=0)

system.time(bizdays(x))
#   user  system elapsed 
#   0.36    0.00    0.35 

system.time(previous_business_date_if_weekend(x))
#   user  system elapsed 
#  45.45    0.00   45.57 

identical(bizdays(x), previous_business_date_if_weekend(x))
#[1] TRUE

回答2:

OP's question Why are my functions on lubridate dates so slow? and some generalizing statements like Lubridate is just kind of slow in my experience suggest that a particular package might be the cause for low performance.

I want to verify this with some benchmarks.

Penalty of using the double colon operator `::`

Frank mentioned in his comment that there is a penalty in using the double colon operator :: to access exported variables or functions in a namespace.

# creating data
n <- 10^1L
fmt <- "%F"
chr_dates <- format(Sys.Date() + seq_len(n), "%F")

# loading lubridate into namespace
library(lubridate) 
microbenchmark::microbenchmark(
  base1 = r1 <- as.Date(chr_dates),
  base2 = r2 <- base::as.Date(chr_dates),
  lubr1 = r3 <- as_date(chr_dates),
  lubr2 = r4 <- lubridate::as_date(chr_dates),
  times = 100L
)

Unit: microseconds
  expr     min       lq      mean  median       uq     max neval cld
 base1  87.977  89.1100  92.03587  89.865  90.9980 128.756   100 a  
 base2  94.018  95.7175 100.64848  97.039  99.3045 179.351   100  b 
 lubr1  92.508  94.2070  98.21307  95.151  97.7940 175.954   100  b 
 lubr2 101.569 103.0800 109.98974 104.024 107.9885 258.643   100   c

The penalty for using the double colon operator :: is about 10 microseconds.

This only matters if a function is called repeatedly (as it happens in OP's code using sapply()). IMHO, the pain of debugging namespace conflicts or maintaining code where the origin of functions is unclear is much higher. Your mileage may vary, of course.

The timings can be verified for n = 100,

Unit: microseconds
  expr     min       lq     mean   median       uq      max neval cld
 base1 556.933 561.0855 580.3382 562.9730 590.7250  812.176   100   a
 base2 564.483 568.2600 588.5695 570.9030 596.2010  989.262   100   a
 lubr1 562.596 565.9935 587.4443 568.4480 594.8790 1039.480   100   a
 lubr2 572.036 575.9995 597.1557 578.4545 601.1085 1230.159   100   a

Converting character dates to class Date

There is a number of packages which deal with the conversion of character dates given in different formats to class Date or POSIXct. Some of them aim at performance, others at convenience.

Here, base, lubridate, anytime, fasttime, and data.table (because it was mentioned in one of the answers) are compared.

Input are character dates in the standard unambiguous format YYYY-MM-DD. Time zones are ignored.

fasttime accepts only dates between 1970 and 2199, so the creation of sample data had to be modified in order to create a sample data set of 100 K dates.

n <- 10^5L
fmt <- "%F"
set.seed(123L)
chr_dates <- format(
  sample(
    seq(as.Date("1970-01-01"), as.Date("2199-12-31"), by = 1L), 
    n, replace = TRUE),
  "%F")

Because Frank had suspected that guessing formats could add a penalty, the functions are called with and without given format where possible. All functions are called using the double colon operator ::.

microbenchmark::microbenchmark(
  base_ = r1 <- base::as.Date(chr_dates),
  basef = r1 <- base::as.Date(chr_dates, fmt),
  lub1_ = r2 <- lubridate::as_date(chr_dates),
  lub1f = r2 <- lubridate::as_date(chr_dates, fmt),
  lub2_ = r3 <- lubridate::ymd(chr_dates),
  anyt_ = r4 <- anytime::anydate(chr_dates),
  idat_ = r5 <- data.table::as.IDate(chr_dates),
  idatf = r5 <- data.table::as.IDate(chr_dates, fmt),
  fast_ = r6 <- fasttime::fastPOSIXct(chr_dates),
  fastd = r6 <- as.Date(fasttime::fastPOSIXct(chr_dates)),
  times = 5L
)
# check results
all.equal(r1, r2)
all.equal(r1, r3)
all.equal(r1, c(r4)) # remove tzone attribute
all.equal(r1, as.Date(r5)) # convert IDate to Date
all.equal(r1, as.Date(r6)) # convert POSIXct to Date

Unit: milliseconds
  expr        min         lq       mean     median         uq        max neval  cld
 base_ 641.799082 645.008517 648.128466 648.791875 649.149444 655.893411     5    d
 basef  69.377419  69.937371  73.888828  71.403139  76.022083  82.704127     5  b  
 lub1_ 644.199361 645.217696 680.542327 649.855896 652.887492 810.551189     5    d
 lub1f  69.769726  69.947943  70.944605  70.795234  71.365759  72.844364     5  b  
 lub2_  18.672495  27.025711  26.990218  28.180730  29.944409  31.127747     5 ab  
 anyt_ 381.870316 384.513758 386.211134 384.992152 385.159043 394.520400     5   c 
 idat_ 643.386808 644.312259 649.385356 648.204359 651.666396 659.356958     5    d
 idatf  69.844109  71.188673  75.319481  77.142365  78.156923  80.265334     5  b  
 fast_   4.994637   5.363533   5.748137   5.601031   5.760370   7.021112     5 a   
 fastd   5.230625   6.296157   6.686500   6.345998   6.538941   9.020780     5 a

The timings show that

Frank's suspicion is correct. Guessing formats is costly. Passing the format as parameter to as.Date(), as_date(), and as.IDate() is ten times faster than calling without.
fasttime::fastPOSIXct() is the fastest, indeed. Even with the additional conversion from POSIXct to Date it is four times faster than the second fastest lubridate::ymd().

回答3:

Lubridate is just kind of slow in my experience. I suggest working with data.table and iDate.

Something like this should be pretty robust:

library(data.table)

#Make data.table of dates in string format
x = data.table(date = format(Sys.Date() + 0:100000,format='%d/%m/%Y'))

#Convert to IDate (by reference)
set(x, j = "date", value = as.IDate(strptime(x[,date], "%d/%m/%Y")))

#Day zero was a Thursday
originDate = as.IDate(strptime("01/01/1970", "%d/%m/%Y"))
as.integer(originDate)
#[1] 0
weekdays(originDate)
#[1] "Thursday"

previous_business_date_if_weekend_dt = function(x) {

  #Adjust dates so that Sat is 1, Sun is 2, and subtract by reference
  x[,adjustedDate := date]
  x[(as.integer(x[,date]-2) %% 7 + 1)<=2, adjustedDate := adjustedDate - (as.integer(date-2) %% 7 + 1)]

}

bizdays <- function(x) x - match(weekdays(x), c("Saturday","Sunday"), nomatch=0)

system.time(bizdays(y))
# user  system elapsed 
# 0.22    0.00    0.22 

system.time(previous_business_date_if_weekend_dt(x))
# user  system elapsed 
# 0       0       0

Also note that the part that takes the most time in this solution is probably pulling the dates from a string, you could reformat them to an integer format if you're concerned about that.

回答4:

Just to add another possibility: A pure R implementation is in the datetimetutils package (of which I am the author). The function previous_businessday converts to POSIXlt in order to extract the weekday. (The code compares the function's results with the function bizdays suggested by thelatemail.)

library("datetimeutils")

x <- Sys.Date() + 0:100000

system.time(bizdays(x))
## user  system elapsed 
## 0.25    0.00    0.25 

system.time(previous_businessday(x, shift = 0))
## user  system elapsed 
## 0.03    0.00    0.03 

identical(bizdays(x), previous_businessday(x, shift = 0))
## TRUE

A slightly-simplified version of previous_businessday would look as follows; it assumes that x is of class Date.

previous_bd <- function(x) {
    tmp <- as.POSIXlt(x)
    tmpi <- tmp$wday == 6L
    x[tmpi] <- x[tmpi] - 1L
    tmpi <- tmp$wday == 0L
    x[tmpi] <- x[tmpi] - 2L
    x
}

system.time(previous_bd(x))
## user  system elapsed 
## 0.03    0.00    0.03 


identical(bizdays(x), previous_bd(x))
## TRUE

来源：https://stackoverflow.com/questions/46046714/why-are-my-functions-on-lubridate-dates-so-slow

标签

date

lubridate