R, inconsistent date format

試著忘記壹切 提交于 2021-02-11 14:01:11

问题


I have a date variable, which originally comes from an excel. However, it is so heterogeneous. Even though all look like yyyy/mm/dd in the excel, when read in R, the variable look like:

person_1  39257
person_2  2015/2/20
person_3  NA

How to clean up the date variable so that every and each shows yyyy/mm/dd format?


回答1:


Or an option with anydate and excel_numeric_to_date

library(janitor)
library(anytime)
library(dplyr)
coalesce( excel_numeric_to_date(as.numeric(dat$V2)), anydate(dat$V2))
#[1] "2007-06-24" "2015-02-20" NA   

data

dat <- structure(list(V1 = c("person_1", "person_2", "person_3"), V2 = c("39257", 
"2015/2/20", NA)), class = "data.frame", row.names = c(NA, -3L
))



回答2:


An iterative approach, similar to how packages like lubridate and others try to find a match. This uses a few including the excel model (which I think uses an origin of "1900-01-01", btw). The order is a little important: in the face of ambiguity, a better heuristic would find the one with the most matches and use that for all ... but that's over to you.

dat <- read.table(header=FALSE, stringsAsFactors=FALSE, text="
person_1  39257
person_2  2015/2/20
person_3  NA")

conv_dates <- function(dates, origin = "1900-01-01") {
  out <- Sys.Date()[rep(NA, length(dates))]
  notna0 <- !is.na(dates)
  allnum <- notna0 & grepl("^[.0-9]+$", dates)
  if (any(allnum)) out[allnum] <- suppressWarnings(as.Date(as.numeric(dates[allnum]), origin = origin))
  fmts <- c("%Y/%m/%d", "%d/%m/%Y", "%m/%d/%Y")
  for (fmt in fmts) {
    isna <- notna0 & is.na(out)
    if (!any(isna)) break
    out[isna] <- as.Date(dates[isna], format = fmt)
  }
  out
}

str(conv_dates(dat$V2))
#  Date[1:3], format: "2007-06-26" "2015-02-20" NA



回答3:


You can first change the dates using their appropriate format in YMD, then change the numeric excel date with their origin.

dat$date <- as.Date(dat$V2, '%Y/%m/%d')
#Can also use
#dat$V2 <- lubridate::ymd(dat$V2)
inds <- is.na(dat$date)
dat$date[inds] <- as.Date(as.numeric(dat$V2[inds]),origin = "1899-12-30")
dat

#        V1        V2       date
#1 person_1     39257 2007-06-24
#2 person_2 2015/2/20 2015-02-20
#3 person_3      <NA>       <NA>


来源:https://stackoverflow.com/questions/61689061/r-inconsistent-date-format

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!