Extraction of some date formats failed when using Dateutil in Python

≯℡__Kan透↙ 提交于 2019-12-06 15:40:33

This kind of problem is always going to need tweeking with new edge cases, but the following approach is fairly robust:

from itertools import groupby, izip_longest
from datetime import datetime, timedelta
import calendar
import string
import re


def get_date_part(x):
    if x.lower() in month_list:
        return x

    day = re.match(r'(\d+)(\b|st|nd|rd|th)', x, re.I)

    if day:
        return day.group(1)

    return False


def month_full(month):
    try:
        return datetime.strptime(month, '%B').strftime('%b')
    except:
        return datetime.strptime(month, '%b').strftime('%b')

tests = [
    'I want to visit from May 16-May 18',
    'I want to visit from May 16-18',
    'I want to visit from May 6 May 18',
    'May 6,7,8,9,10',
    '8 May to 10 June',
    'July 10/20/30',
    'from June 1, july 5 to aug 5 please',
    '2nd March to the 3rd January',
    '15 march, 10 feb, 5 jan',
    '1 nov 2017',
    '27th Oct 2010 until 1st jan',
    '27th Oct 2010 until 1st jan 2012'
    ]

cur_year = 2017    

month_list = [m.lower() for m in list(calendar.month_name) + list(calendar.month_abbr) if len(m)]
remove_punc = string.maketrans(string.punctuation, ' ' * len(string.punctuation))

for date in tests:
    date_parts = [get_date_part(part) for part in date.translate(remove_punc).split() if get_date_part(part)]

    days = []
    months = []
    years = []

    for k, g in groupby(sorted(date_parts, key=lambda x: x.isdigit()), lambda y: not y.isdigit()):
        values = list(g)

        if k:
            months = map(month_full, values)
        else:
            for v in values:
                if 1900 <= int(v) <= 2100:
                    years.append(int(v))
                else:
                    days.append(v)

        if days and months:
            if years:
                dates_raw = [datetime.strptime('{} {} {}'.format(m, d, y), '%b %d %Y') for m, d, y in izip_longest(months, days, years, fillvalue=years[0])]            
            else:
                dates_raw = [datetime.strptime('{} {}'.format(m, d), '%b %d').replace(year=cur_year) for m, d in izip_longest(months, days, fillvalue=months[0])]
                years = [cur_year]

            # Fix for jumps in year
            dates = []
            start_date = datetime(years[0], 1, 1)
            next_year = years[0] + 1

            for d in dates_raw:
                if d < start_date:
                    d = d.replace(year=next_year)
                    next_year += 1
                start_date = d
                dates.append(d)

            print "{}  ->  {}".format(date, ', '.join(d.strftime("%d/%m/%Y") for d in dates))

This converts the test strings as follows:

I want to visit from May 16-May 18  ->  16/05/2017, 18/05/2017
I want to visit from May 16-18  ->  16/05/2017, 18/05/2017
I want to visit from May 6 May 18  ->  06/05/2017, 18/05/2017
May 6,7,8,9,10  ->  06/05/2017, 07/05/2017, 08/05/2017, 09/05/2017, 10/05/2017
8 May to 10 June  ->  08/05/2017, 10/06/2017
July 10/20/30  ->  10/07/2017, 20/07/2017, 30/07/2017
from June 1, july 5 to aug 5 please  ->  01/06/2017, 05/07/2017, 05/08/2017
2nd March to the 3rd January  ->  02/03/2017, 03/01/2018
15 march, 10 feb, 5 jan  ->  15/03/2017, 10/02/2018, 05/01/2019
1 nov 2017  ->  01/11/2017
27th Oct 2010 until 1st jan  ->  27/10/2010, 01/01/2011
27th Oct 2010 until 1st jan 2012  ->  27/10/2010, 01/01/2012

This works as follows:

  1. First create a list of valid months names, i.e. both full and abbreviated.

  2. Make a translation table to make it easy to quickly remove any punctuation from the text.

  3. Split the text, and extract only the date parts by using a function with a regular expression to spot days or months.

  4. Sort the list based on whether or not the part is a digit, this will group months to the front and digits to the end.

  5. Take the first and last part of each list. Convert months into full form e.g. Aug to August and convert each into datetime objects.

  6. If a date appears to be before the previous one, add a whole year.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!