Parsing Dates from Text in R

问题

I repeatedly come across the problem to parse dates from relatively unstructured text documents where the date is embedded in the text and its position and format varies from case to case. Some example text is:

"Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100."

I would like to extract the date string "July 1st, 2015" from the text (step 1) and convert it to a format like, for example, 2015-07-01 UTC (step 2). Step 2 can be performed using, for example, parse_date_time from package lubridate (which is nice for multiple applicable date formats):

Case 1:

library(lubridate)
parse_date_time("July 1st, 2015", "b d Y", local="C")
[1] "2015-07-01 UTC"

For some cases parse_date_time also works on larger strings which include the date. For example:

Case 2:

parse_date_time("Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November", "b d Y" , local="C")
[1] "2015-07-01 UTC"

However, as far as I understand it, step 2 does not work directly on the full example text:

Case 3:

parse_date_time("Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100.", "b d Y" , local="C")
[1] NA

Apparently, some of the additional information in the text makes it cumbersome to parse the date directly from the full text. I can think of an approach where step 1 is performed using regex to extract a reduced string (similar to Case 1 or Case 2) that includes the date and for which parse_date_time works. However, using regex in connection with dates seems always a bit dirty as regex does not know whether it extracts a valid date.

Is there a way to directly perform step 2 (i.e., without a workaround based on regex) on unstructured texts as in the above example (Case 3)?

Any input is very much appreciated!

回答1:

Using this website, we can construct some regex code: (( [J, F, M, A, S, O, N, D])\w+ [1-31][th, st]\w+, [0-2100]\w+) but it doesn't work in R... :(

It does work if corrected.

> x = "Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100."
> m = regexpr(' [JFMASOND]\\w+ ([1-9]|[12][0-9]|3[0-1])(th|rd|nd|st), [12]\\d{3}', x)
> if (m > 0) substr(x, m, m + attr(m, 'match.length') - 1)
[1] " July 1st, 2015"

来源：https://stackoverflow.com/questions/34088964/parsing-dates-from-text-in-r

标签

regex

date

parsing

lubridate