I want to calculate length in different time dimensions but I have problems dealing with the two slightly different time formats in my data frame column.
The origin
or you can also use:
time<- c("2018-07-29T15:02:05Z",
"2018-07-29T14:46:57Z",
"2018-10-04T12:13:41.333Z",
"2018-10-04T12:13:45.479Z")
length<-c(15.8,132.1,12.5,33.2)
df<-data.frame(time,length)
library(lubridate)
# df$time2<-as_datetime(df$time)
df$time2 <-parse_date_time(df$time, "ymd_HMS")
df
We may use %OS
instead of %S
to account for decimals in seconds.
help("strptime")
Specific to R is
%OSn
, which for output gives the seconds truncated to 0 <= n <= 6 decimal places (and if %OS is not followed by a digit, it uses the setting of getOption("digits.secs"), or if that is unset, n = 0).
as.POSIXct(time, format="%Y-%m-%dT%H:%M:%OSZ")
# [1] "2018-07-29 15:02:05 CEST" "2018-07-29 14:46:57 CEST"
# [3] "2018-10-04 12:13:41 CEST" "2018-10-04 12:13:45 CEST"
This base R code is considerably faster than the package solutions, try it yourself.
time2 <- c("2018-09-01T12:42:37.000+02:00", "2018-10-01T11:42:37.000+03:00")
This one is trickier. ?strptime
says we should use %z
for offsets from UTC, but somehow it won't work with as.POSIXct
. Instead we could do this,
as.POSIXct(substr(time2, 1, 23), format="%Y-%m-%dT%H:%M:%OS") +
{os <- as.numeric(el(strsplit(substring(time2, 24), "\\:")))
(os[1]*60 + os[2])*60}
# [1] "2018-09-01 14:42:37 CEST" "2018-10-01 13:42:37 CEST"
which cuts the unreadable part from the string, converts it to seconds and adds it to the "POSIXct"
object.
If there are only hours as in time2
, we could also say:
as.POSIXct(substr(time2, 1, 23), format="%Y-%m-%dT%H:%M:%OS") +
as.numeric(substr(time2, 24, 26))*3600
# [1] "2018-09-01 14:42:37 CEST" "2018-10-01 13:42:37 CEST"
That the code is slightly longer now should not obscure the fact that it runs practically as fast as the one at top of the answer.
You could wrap the current three variants into a function with if (nchar(x) == 29) ... else
structure, such as this one:
fixDateTime <- function(x) {
s <- split(x, nchar(x))
if ("20" %in% names(s))
s$`20` <- as.POSIXct(s$`20` , format="%Y-%m-%dT%H:%M:%SZ")
else if ("24" %in% names(s))
s$`24` <- as.POSIXct(s$`24`, format="%Y-%m-%dT%H:%M:%OSZ")
else if ("29" %in% names(s))
s$`29` <- as.POSIXct(substr(s$`29`, 1, 23), format="%Y-%m-%dT%H:%M:%OS") +
{os <- as.numeric(el(strsplit(substring(s[[3]], 24), "\\:")))
(os[1]*60 + os[2])*60}
return(unsplit(s, nchar(x)))
}
res <- fixDateTime(time3)
res
# [1] "2018-07-29 15:02:05 CEST" "2018-10-04 00:00:00 CEST" "2018-10-01 00:00:00 CEST"
str(res)
# POSIXct[1:3], format: "2018-07-29 15:02:05" "2018-10-04 00:00:00" "2018-10-01 00:00:00"
Compared to the packages only fixDateTime
can handle all three defined date-time types. According to the concluding benchmark the function is still very fast.
Note: The function logically fails if different date formats have the same nchar
, and it should be customized in the case (e.g. by another split
condition)! Not tested: daylight saving time behavior when adding seconds to POSIXct
.
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# fixDateTime 35.46387 35.94761 40.07578 36.05923 39.54706 68.46211 10 c
# as.POSIXct 20.32820 20.45985 21.00461 20.62237 21.16019 23.56434 10 b # to compare
# lubridate 11.59311 11.68956 12.88880 12.01077 13.76151 16.54479 10 a # produces NAs!
# anytime 198.57292 201.06483 203.95131 202.91368 203.62130 212.83272 10 d # produces NAs!
time <- c("2018-07-29T15:02:05Z", "2018-07-29T14:46:57Z", "2018-10-04T12:13:41.333Z",
"2018-10-04T12:13:45.479Z")
time2 <- c("2018-07-29T15:02:05Z", "2018-07-29T15:02:05Z", "2018-07-29T15:02:05Z")
time3 <- c("2018-07-29T15:02:05Z", "2018-10-04T12:13:41.333Z",
"2018-10-01T11:42:37.000+03:00")
n <- 1e3
t1 <- sample(time2, n, replace=TRUE)
t2 <- sample(time3, n, replace=TRUE)
library(lubridate)
library(anytime)
microbenchmark::microbenchmark(fixDateTime=fixDateTime(t2),
as.POSIXct=as.POSIXct(t1, format="%Y-%m-%dT%H:%M:%OSZ"),
lubridate=parse_date_time(t2, "ymd_HMS"),
anytime=anytime(t2),
times=10L)
You can use library anytime
library(anytime)
time<- c("2018-07-29T15:02:05Z",
"2018-07-29T14:46:57Z",
"2018-10-04T12:13:41.333Z",
"2018-10-04T12:13:45.479Z")
anytime(time)
#[1] "2018-07-29 15:02:05 CEST" "2018-07-29 14:46:57 CEST" "2018-10-04 12:13:41 CEST" "2018-10-04 12:13:45 CEST"