Why is as.Date slow on a character vector?

后端 未结 5 1051
囚心锁ツ
囚心锁ツ 2020-11-27 16:36

I started using data.table package in R to boost performance of my code. I am using the following code:

sp500 <- read.csv(\'../rawdata/GMTSP.csv\')
days &         


        
5条回答
  •  一整个雨季
    2020-11-27 17:25

    I think it's just that as.Date converts character to Date via POSIXlt, using strptime. And strptime is very slow, I believe.

    To trace it through yourself, type as.Date, then methods(as.Date), then look at the character method.

    > as.Date
    function (x, ...) 
    UseMethod("as.Date")
    
    
    
    > methods(as.Date)
    [1] as.Date.character as.Date.date      as.Date.dates     as.Date.default  
    [5] as.Date.factor    as.Date.IDate*    as.Date.numeric   as.Date.POSIXct  
    [9] as.Date.POSIXlt  
       Non-visible functions are asterisked
    
    > as.Date.character
    function (x, format = "", ...) 
    {
        charToDate <- function(x) {
            xx <- x[1L]
            if (is.na(xx)) {
                j <- 1L
                while (is.na(xx) && (j <- j + 1L) <= length(x)) xx <- x[j]
                if (is.na(xx)) 
                    f <- "%Y-%m-%d"
            }
            if (is.na(xx) || !is.na(strptime(xx, f <- "%Y-%m-%d", 
                tz = "GMT")) || !is.na(strptime(xx, f <- "%Y/%m/%d", 
                tz = "GMT"))) 
                return(strptime(x, f))
            stop("character string is not in a standard unambiguous format")
        }
        res <- if (missing(format)) 
            charToDate(x)
        else strptime(x, format, tz = "GMT")       ####  slow part, I think  ####
        as.Date(res)
    }
    
    
    > 
    

    Why is as.POSIXlt(Date)$year+1900 relatively fast? Again, trace it through :

    > as.POSIXct
    function (x, tz = "", ...) 
    UseMethod("as.POSIXct")
    
    
    
    > methods(as.POSIXct)
    [1] as.POSIXct.date    as.POSIXct.Date    as.POSIXct.dates   as.POSIXct.default
    [5] as.POSIXct.IDate*  as.POSIXct.ITime*  as.POSIXct.numeric as.POSIXct.POSIXlt
       Non-visible functions are asterisked
    
    > as.POSIXlt.Date
    function (x, ...) 
    {
        y <- .Internal(Date2POSIXlt(x))
        names(y$year) <- names(x)
        y
    }
    
    
    > 
    

    Intrigued, let's dig into Date2POSIXlt. For this bit we need to grep main/src to know which .c file to look at.

    ~/R/Rtrunk/src/main$ grep Date2POSIXlt *
    names.c:{"Date2POSIXlt",do_D2POSIXlt,   0,  11, 1,  {PP_FUNCALL, PREC_FN,   0}},
    $
    

    Now we know we need to look for D2POSIXlt :

    ~/R/Rtrunk/src/main$ grep D2POSIXlt *
    datetime.c:SEXP attribute_hidden do_D2POSIXlt(SEXP call, SEXP op, SEXP args, SEXP env)
    names.c:{"Date2POSIXlt",do_D2POSIXlt,   0,  11, 1,  {PP_FUNCALL, PREC_FN,   0}},
    $
    

    Oh, we could have guessed datetime.c. Anyway, so looking at latest live copy :

    datetime.c

    Search in there for D2POSIXlt and you'll see how simple it is to go from Date (numeric) to POSIXlt. You'll also see how POSIXlt is one real vector (8 bytes) plus seven integer vectors (4 bytes each). That's 40 bytes, per date!

    So the crux of the issue (I think) is why strptime is so slow, and maybe that can be improved in R. Or just avoid POSIXlt, either directly or indirectly.


    Here's a reproducible example using the number of items stated in question (3,000,000) :

    > Range = seq(as.Date("2000-01-01"),as.Date("2012-01-01"),by="days")
    > Date = format(sample(Range,3000000,replace=TRUE),"%m/%d/%Y")
    > system.time(as.Date(Date, "%m/%d/%Y"))
       user  system elapsed 
     21.681   0.060  21.760 
    > system.time(strptime(Date, "%m/%d/%Y"))
       user  system elapsed 
     29.594   8.633  38.270 
    > system.time(strptime(Date, "%m/%d/%Y", tz="GMT"))
       user  system elapsed 
     19.785   0.000  19.802 
    

    Passing tz appears to speed up strptime, which as.Date.character does. So maybe it depends on your locale. But strptime appears to be the culprit, not data.table. Perhaps rerun this example and see if it takes 90 seconds for you on your machine?

提交回复
热议问题