问题
Input:
- Valid for ticketing and travelling Starting from Mar 27 2016 to Dec 31 2016
- Effective Period Tickets must be issued on before 18 FEB 16
- Effective Period Ticket must be issued on before 29 FEB 2016
- TRAVELING DATES NOW - FEB 10 2016 FEB 22 2016 - MAY 12 2016
- Ticketing Effective Period on before 31 Jan 2016
(Note: The input has been preprocessed to this stage by some Python codes so that it will be easier to process using some Python packages.)
Expected output:
- from 2016-03-27 to 2016-12-31
- on before 2016-02-18
- on before 2016-02-29
- now - 2016-02-10 2016-02-22 - 2016-05-12
- on before 2016-01-31
I have tried dateutil. However it can only extract one date, right? Even for this situation, extraction of both preposition and date is also a problem.
I also looked at dateparser and datefinder. It seems they both use dateutil.
Dates can be YYYY-MM-DD, DDMMYYYY, etc., as long as in the same format.
Output doesn't have to be identical to the above one, as long as it reflects accurate information.
Finally, thanks for your time and thoughts. I will also keep trying.
回答1:
This is a typical usecase for the excellent dateparser
library. Just read the docs and you should be able to do it.
回答2:
After a few days of research, I come up with the following approaches which solve the extraction problem.
- Recognize the propositions and then recognize months and do the extraction.
- Recognize '-' and then recognize months and do the extraction.
Part the codes are shown below. (An excerpt which need dependencies in context)
new_w = new_s.split()
for j in range(len(new_w)):
if new_w[j] in prepositions and (new_w[j+1].isdecimal() or new_w[j+1].lower() in months):
# Process case like "Starting from Mar27, 2016 to Dec31, 2016"
if j+7 in range(len(new_w)) and new_w[j+4] in prepositions:
if new_w[j+5].isdecimal() or new_w[j+5].lower() in months:
u = ' '.join(new_w[j:j+8])
print(label_class[i] + ': ' + u)
break
# Process case like "Ticket must be issued on/before 29FEB, 2016"
elif new_w[j-1] in prepositions:
u = ' '.join(new_w[j-1:j+4])
print(label_class[i] + ': ' + u)
break
# Process case like "Ticketing valid until 18FEB16"
else:
u = ' '.join(new_w[j:j+4])
print(label_class[i] + ': ' + u)
break
# Process case like "TICKETING PERIOD: NOW - FEB 02, 2016"
# Process case like "TRAVELING DATES: NOW - FEB 10,2016 FEB 22,2016 - MAY 12,2016"
if new_w[j] in ['-'] and (new_w[j+1].lower() in months or new_w[j+2].lower() in months):
if new_w[j-1].lower() == 'now':
u = released_date + ' - ' + ' '.join(new_w[j+1:j+4])
print(label_class[i] + ': ' + u)
elif new_w[j-3].lower() in months or new_w[j-2].lower() in months:
u = ' '.join(new_w[j-3:j+4])
print(label_class[i] + ': ' + u)
来源:https://stackoverflow.com/questions/44040268/how-to-extract-time-date-period-information-from-raw-sentences-in-python