Efficiently convert a date column in data.table

后端未结

关注

 4  1024

I have a large data set with many columns containing dates in two different formats:

\"1996-01-04\" \"1996-01-05\" \"1996-01-08\" \"1996-01-09\" \"1996-01-10


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  半阙折子戏        
                
              
                            
                2020-12-16 21:55
              
            
            
                                                                       
According to this benchmark, the fastest method to convert character dates in  standard unambiguous format (YYYY-MM-DD) into class Date is to use as.Date(fasttime::fastPOSIXct()).
Unfortunately, this requires to test the format beforehand because your other format DD/MM/YYYY is misinterpreted by fasttime::fastPOSIXct().
So, if you don't want to bother about the format of each date column you may use the anytime::anydate() function:
# sample data
df <- data.frame(
    X1 = c("1996-01-04", "1996-01-05", "1996-01-08", "1996-01-09", "1996-01-10", "1996-01-11"), 
    X2 = c("02/01/1996", "03/01/1996", "04/01/1996", "05/01/1996", "08/01/1996", "09/01/1996"), 
    stringsAsFactors = FALSE)

library(data.table)
# convert date columns
date_cols <- c("X1", "X2")
setDT(df)[, (date_cols) := lapply(.SD, anytime::anydate), .SDcols = date_cols]
df


           X1         X2
1: 1996-01-04 1996-02-01
2: 1996-01-05 1996-03-01
3: 1996-01-08 1996-04-01
4: 1996-01-09 1996-05-01
5: 1996-01-10 1996-08-01
6: 1996-01-11 1996-09-01



The benchmark timings show that there is a trade off between the convenience offered by the anytime package and performance. So if speed is crucial, there is no other way to test the format of each column and to use the fastest conversion method available for the format.
The OP has used the try() function for this purpose. The solution below uses regular expressions to find all columns which match a given format (only row 1 is used to save time). This has the additional benefit that the names of the relevant columns are determined automatically and need not to be typed in.
# enhanced sample data with additional columns
df <- data.frame(
    X1 = c("1996-01-04", "1996-01-05", "1996-01-08", "1996-01-09", "1996-01-10", "1996-01-11"), 
    X2 = c("02/01/1996", "03/01/1996", "04/01/1996", "05/01/1996", "08/01/1996", "09/01/1996"), 
    X3 = "other data",
    X4 = 1:6,
    stringsAsFactors = FALSE)

library(data.table)
options(datatable.print.class = TRUE)

# coerce to data.table
setDT(df)[]
# convert date columns in standard unambiguous format YYYY-MM-DD
date_cols1 <- na.omit(names(df)[
  df[1, sapply(.SD, stringr::str_detect, pattern = "\\d{4}-\\d{2}-\\d{2}"),]])
# use fasttime package
df[, (date_cols1) := lapply(.SD, function(x) as.Date(fasttime::fastPOSIXct(x))), 
   .SDcols = date_cols1]
# convert date columns in DD/MM/YYYY format
date_cols2 <- na.omit(names(df)[
  df[1, sapply(.SD, stringr::str_detect, pattern = "\\d{2}/\\d{2}/\\d{4}"),]])
# use lubridate package
df[, (date_cols2) := lapply(.SD, lubridate::dmy), .SDcols = date_cols2]
df


           X1         X2         X3    X4
       <Date>     <Date>     <char> <int>
1: 1996-01-04 1996-01-02 other data     1
2: 1996-01-05 1996-01-03 other data     2
3: 1996-01-08 1996-01-04 other data     3
4: 1996-01-09 1996-01-05 other data     4
5: 1996-01-10 1996-01-08 other data     5
6: 1996-01-11 1996-01-09 other data     6


Caveat
In case one of the date columns does contain NA in the first row, this column may escape unconverted. To handle these cases, the above code needs to be amended.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  执笔经年        
                
              
                            
                2020-12-16 21:55
              
            
            
                                                                       
Since you know beforehand there are only two date formats, this is easy. The format argument to as.Date is vectorized:

as_date_either <- function(x) {
    format_vec <- rep_len("%Y-%m-%d", length(x))
    format_vec[grep("/", x, fixed = TRUE)] <- "%m/%d/%Y"
    as.Date(x, format = format_vec)
}


Edited: replaced ifelse with subset assignment, which is faster
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  天命终不由人        
                
              
                            
                2020-12-16 22:00
              
            
            
                                                                       
If there are any duplicated date fields in your dataset, then one way you could do is by setting up de-duplicated reference table then do the mapping on the smaller dataset. This will be faster than converting the date fields on all records. 

Data

df <- data.frame(
  X1 = c("1996-01-04", "1996-01-05", "1996-01-08", "1996-01-09", "1996-01-10", rep("1996-01-11", 100)), 
  X2 = c("02/01/1996", "03/01/1996", "04/01/1996", "05/01/1996", "08/01/1996", rep("09/01/1996", 100)), 
  stringsAsFactors = FALSE)


Create unique Date rows for mapping

date_mapping <- function(date_col){

  ref_df <- data.frame(date1 = unique(date_col), stringsAsFactors = FALSE)

  if(all(grepl("/", ref_df$date1))) {
    ref_df$date2 <- as.Date(ref_df$date1, format = "%d/%m/%Y")

  } else {
    ref_df$date2 <- as.Date(ref_df$date1)  
  }

  date_col_mapped <- ref_df[match(date_col, ref_df$date1), "date2"]

  return(date_col_mapped)

}


date_mapping(df$X1)
date_mapping(df$X2)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  庸人自扰        
                
              
                            
                2020-12-16 22:15
              
            
            
                                                                       
Your data

df <- data.frame(X1 = c("1996-01-04", "1996-01-05", "1996-01-08", "1996-01-09", "1996-01-10", "1996-01-11"), X2 = c("02/01/1996", "03/01/1996", "04/01/1996", "05/01/1996", "08/01/1996", "09/01/1996"), stringsAsFactors=F)

'data.frame':   6 obs. of  2 variables:
 $ X1: chr  "1996-01-04" "1996-01-05" "1996-01-08" "1996-01-09" ...
 $ X2: chr  "02/01/1996" "03/01/1996" "04/01/1996" "05/01/1996" ...


solution

library(dplyr)
library(lubridate)
ans <- df %>%
         mutate(X1 = ymd(X1), X2 = mdy(X2))

          X1         X2
1 1996-01-04 1996-02-01
2 1996-01-05 1996-03-01
3 1996-01-08 1996-04-01
4 1996-01-09 1996-05-01
5 1996-01-10 1996-08-01
6 1996-01-11 1996-09-01

str(ans)

'data.frame':   6 obs. of  2 variables:
 $ X1: Date, format: "1996-01-04" "1996-01-05" ...
 $ X2: Date, format: "1996-02-01" "1996-03-01" ...

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复