Dealing with Messy Dates

后端 未结 5 759
南方客
南方客 2021-01-31 08:18

I hope you didn\'t think I was asking for relationship advice.

Infrequently, I have to offer survey respondents the ability to specify when an event occurred. What resu

5条回答
  •  青春惊慌失措
    2021-01-31 09:20

    I'm not going to try to write the function right now, but I have an idea that might work.

    Search each string for a 4-digit number to call the year.

    Use grep to search each string for the first 3 letters of the abbreviation for the months. It seems almost all of your data (at least above) has an identifier like that. I'd store the value which is found in a "months" vector, and put blanks wherever no value is found. Here's a really ugly version of the code (i'll make this more efficient later, and add the case when the month isn't capitalized!)

    mos <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")   
    blah <- lapply(1:12, function(i) grepl(mos[i], test))   
    lapply(blah, function(i) which(i))   
    months <- 0*(1:length(test))   
    for (i in 1:12) {   
      months[blah[[i]]] <- i   
    }  
    
    
       months
      [1]  5  0  0  4  0  4  4  4  4  4  4  4  4  4  4  4  4  0  4  4  4  3  3  0  0  0  0  2  0  1
     [31]  1  1  1  0  0  0  0  0  1  1  1  1  1  1  0  0  0  0  0  0  0  0  0  0  0 12 12 12 12  0
     [61]  0  0  0  0 12 12 12 12  0 12 12 12 12 12 12 12 12 12  0  0  0 12 12 12 12 11 11  0 11 11
     [91] 11  0 11  0 11  0 11  0  0 11 11 11  0 11  0 11 11 11  0 11 11 11 11  0 11  0  0  0 10 10
    [121] 10  0 10 10 10  0  0 10 10 10  0  0  0  0  0 10 10  0  0 10 10 10 10  0 10  0 10  0  0  0
    [151] 10  0 10 10 10 10 10  9  9  9  9  8  0  0 
    

    The "day" most commonly follows the word used for the month immediately. So if there is a one or 2 digit number after the month(which is character), extract that number and call it the day.

    Times most commonly have the ":" or "." symbol in them, and so search each string for that character. If found in a string, create a "Time" vector with all of the digits immediately before and after that character (in theory, including 2 before and 2 after should not cause a problem). Put blanks whenever the symbol is not present. It would be nice if all of the data were definitely confined to a <12 hour period, because then you won't have to worry about AM and PM. If not, Maybe search the string for "AM" and "PM" as well.

    Then, try to convert the strings which have all four of the above to POSIXct. The ones that don't convert, you'll have to manually enter of course. I think it would take me a few hours to code the function described above, and depending on the variability and size of your dataset it may or may not be worth the effort. Also, there is some risk for incorrect outputs, so adding an acceptable time range would help to avoid that.

    In summary, it sounds like you're going to have to code a function with a whole lot of exceptions and then end up hand-coding a good portion of the times anyway. I hope someone can provide a better solution for you, though.

    Good Luck!

提交回复
热议问题