Extracting dates following a specific word from a column of strings using dplyr

十年热恋 提交于 2021-01-29 14:31:03

问题


I am trying to extract the most recent date that a report was added in an R dataframe of reports. The text always looks like Date Ordered: M/DD/YYYY and may contain 0 many times in a given report. If it's repeating, I want the most recent (usually the last) instance, and I'm trying to convert it to a date in a mutated dplyr column.

Using the code below on my actual dataframe, I get the error:

Error in if (nchar(s) > 0 && substring(s, 1, 1) == "\002") { :
missing value where TRUE/FALSE needed

However, it runs fine on a single item making me think that it's trying to concatenate the entire column.

The test code doesn't give an error, but actually pulls the last date from the last report for all instances:

     lastdate
1 1999-04-15
2 1999-04-15
dataset=data.frame(cbind(ID=c(001,002),
                         Report=c("Blah Blah Date Ordered: 5/19/2000 test is positive. Date Ordered: 4/2/2005 Additional testing negative.",
                                  "Meh Date Ordered: 4/15/1999")),
                   stringsAsFactors = F)`

dataset %>% 
  mutate(lastdate = as.Date(last(gsub("Date Ordered:\\s+", "",
                                      strapplyc(Report, 
                                                "Date Ordered:\\s*\\d+/\\d+/\\d+", simplify = TRUE))),
                            "%m/%d/%Y"))

Desired output should be:

2005-4-2
1999-4-15

Actual for dataset:

Error in if (nchar(s) > 0 && substring(s, 1, 1) == "\002") { : 
  missing value where TRUE/FALSE needed

Actual for test data:

    lastdate
1 1999-04-15
2 1999-04-15

回答1:


I suggest a gsub like

dataset$lastsdate <- as.Date(gsub(".*Date Ordered:\\s*(\\d{1,2}/\\d{1,2}/\\d{4}).*|.*","\\1", dataset$Report),"%m/%d/%Y")

See the regex in action.

The regex matches:

  • .* - any 0+ chars as many as possible
  • Date Ordered: - a literal substring
  • \s* - 0+ whitespaces
  • (\d{1,2}/\d{1,2}/\d{4}) - Capturing group 1 (\1): 1 or 2 digits, /, 1 or 2 digits, /, 4 digits
  • .* - the rest of the string
  • | - or
  • .* - the entire string.


来源:https://stackoverflow.com/questions/58148676/extracting-dates-following-a-specific-word-from-a-column-of-strings-using-dplyr

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!