Parsing dates with different formats

问题

I have a csv file with a column containing dates, but the dates are in two different formats: "m/d/y H:M" and "y m d H:M:S". I want to make a new column with these dates all in one format (I don't care which one). I tried the parse_date_time function but it would only work for one of the formats, not both. How can I go about doing this?

Here is the code I was trying to use:

newdata <- mutate(data,
                        newcolumn = parse_date_time(x = data$date_column,
                                                        orders = c("m/d/y H:M", "y m d H:M:S"),
                                                        locale = "eng") )

Here are some example dates from the column:

x <- c("6/21/2006 0:00",
       "1889-06-13 00:00:00",
       "6/28/2012 0:00",
       "5/19/2015 0:00",
       "6/6/2016 0:00",
       "1884-05-24 00:00:00",
       "7/28/2013 0:00")

回答1:

The anytime package does just that -- heuristically evaluating plausible formats:

R> library(anytime)
R> x <- c("6/21/2006 0:00",
+        "1889-06-13 00:00:00",
+        "6/28/2012 0:00",
+        "5/19/2015 0:00",
+        "6/6/2016 0:00",
+        "1884-05-24 00:00:00",
+        "7/28/2013 0:00")
R> anytime(x)
[1] "2006-06-21 CDT" "1889-06-13 CST" "2012-06-28 CDT"
[4] "2015-05-19 CDT" NA               "1884-05-24 CST"
[7] "2013-07-28 CDT"
R>

It uses Boost's date_time library parser by default, and that one does not do single digit month/day, hence the NA on element six. But we also added R's parser as a fallback:

R> anytime(x, useR=TRUE)
[1] "2006-06-21 CDT" "1889-06-13 CST" "2012-06-28 CDT"
[4] "2015-05-19 CDT" "2016-06-06 CDT" "1884-05-24 CST"
[7] "2013-07-28 CDT"
R>

So here is all just works without a single format specification.

回答2:

Using lubridate::parse_date_time():

library(lubridate)
library(dplyr)

x <- c("6/21/2006 0:00",
       "1889-06-13 00:00:00",
       "6/28/2012 0:00",
       "5/19/2015 0:00",
       "6/6/2016 0:00",
       "1884-05-24 00:00:00",
       "7/28/2013 0:00")

df <- data_frame(date_column = x)
df_new <- df %>% 
  mutate(new_column = parse_date_time(date_column, orders = c('ymdHMS', "mdyHM")))

df_new
# A tibble: 7 x 2
  date_column         new_column         
  <chr>               <dttm>             
1 6/21/2006 0:00      2006-06-21 00:00:00
2 1889-06-13 00:00:00 1889-06-13 00:00:00
3 6/28/2012 0:00      2012-06-28 00:00:00
4 5/19/2015 0:00      2015-05-19 00:00:00
5 6/6/2016 0:00       2016-06-06 00:00:00
6 1884-05-24 00:00:00 1884-05-24 00:00:00
7 7/28/2013 0:00      2013-07-28 00:00:00

回答3:

So we start by separating the two:

x <- c("03/20/2018 10:42", "2018-03-20 10:37:02")
DF <- data.frame(x = x, stringsAsFactors = FALSE)
slash_index <- grep("/", DF$x)
slash <- DF$x[slash_index]
dash <- DF$x[-slash_index]

Then we convert them. I like lubridate, but you can use your method if you'd like

library(lubridate)
slash <- mdy_hm(slash)
dash <- ymd_hms(dash)

Then we put them into a date vector:

date_times <- integer(0)
date_times[slash_index] <- slash
date_times[seq_along(DF$x)[-slash_index]] <- dash
DF$x <- as.POSIXct(date_times, origin = "1970-01-01 00:00:00")
DF
#                     x
# 1 2018-03-20 03:42:02
# 2 2018-03-20 03:37:02

Note:
The tricky part here was re-assigning parts of a vector to a vector according to their index. When a portion of a vector was assigned to a POSIXct object, it had its attributes stripped, turning it into the internal integer code for the date time. This was resolved by stripping the attributes at the beginning, and then re-assigning the class at the end.

Here's the full thing with your example:

install.packages("lubridate")
library(lubridate)
x <- c("6/21/2006 0:00",
       "1889-06-13 00:00:00",
       "6/28/2012 0:00",
       "5/19/2015 0:00",
       "6/6/2016 0:00",
       "1884-05-24 00:00:00",
       "7/28/2013 0:00")
DF <- data.frame(x = x, stringsAsFactors = FALSE)
slash_index <- grep("/", DF$x)
slash <- DF$x[slash_index]
dash <- DF$x[-slash_index]


slash <- mdy_hm(slash)
dash <- ymd_hms(dash)


date_times <- integer(0)
date_times[slash_index] <- slash
date_times[seq_along(DF$x)[-slash_index]] <- dash
DF$x <- as.POSIXct(date_times, origin = "1970-01-01 00:00:00", tz = "UTC")
DF
#            x
# 1 2006-06-21
# 2 1889-06-13
# 3 2012-06-28
# 4 2015-05-19
# 5 2016-06-06
# 6 1884-05-24
# 7 2013-07-28

Because the times for these are all "00:00:00", they've been truncated. You can display them with the "00:00:00" using the method described in answers to this question.

来源：https://stackoverflow.com/questions/49390707/parsing-dates-with-different-formats

标签

date

parsing