Parsing Dates from Text in R

ぐ巨炮叔叔 提交于 2019-12-11 00:05:55

问题


I repeatedly come across the problem to parse dates from relatively unstructured text documents where the date is embedded in the text and its position and format varies from case to case. Some example text is:

"Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100."

I would like to extract the date string "July 1st, 2015" from the text (step 1) and convert it to a format like, for example, 2015-07-01 UTC (step 2). Step 2 can be performed using, for example, parse_date_time from package lubridate (which is nice for multiple applicable date formats):

Case 1:

library(lubridate)
parse_date_time("July 1st, 2015", "b d Y", local="C")
[1] "2015-07-01 UTC"

For some cases parse_date_time also works on larger strings which include the date. For example:

Case 2:

parse_date_time("Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November", "b d Y" , local="C")
[1] "2015-07-01 UTC"

However, as far as I understand it, step 2 does not work directly on the full example text:

Case 3:

parse_date_time("Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100.", "b d Y" , local="C")
[1] NA

Apparently, some of the additional information in the text makes it cumbersome to parse the date directly from the full text. I can think of an approach where step 1 is performed using regex to extract a reduced string (similar to Case 1 or Case 2) that includes the date and for which parse_date_time works. However, using regex in connection with dates seems always a bit dirty as regex does not know whether it extracts a valid date.

Is there a way to directly perform step 2 (i.e., without a workaround based on regex) on unstructured texts as in the above example (Case 3)?

Any input is very much appreciated!


回答1:


Using this website, we can construct some regex code: (( [J, F, M, A, S, O, N, D])\w+ [1-31][th, st]\w+, [0-2100]\w+) but it doesn't work in R... :(

It does work if corrected.

> x = "Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100."
> m = regexpr(' [JFMASOND]\\w+ ([1-9]|[12][0-9]|3[0-1])(th|rd|nd|st), [12]\\d{3}', x)
> if (m > 0) substr(x, m, m + attr(m, 'match.length') - 1)
[1] " July 1st, 2015"


来源:https://stackoverflow.com/questions/34088964/parsing-dates-from-text-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!