How to determine appropriate strftime format from a date string?

后端未结

关注

 4  1063

误落风尘

The dateutil parser does a great job of correctly guessing the date and time from a wide variety of sources.

We are processing files in which each file

相关标签:

4条回答

暖寄归人

2020-12-09 09:00

You can try dateinfer module. It's pretty nice and deals with all your cases easily. Though if you need more fine control over the code and are ready to write it from scratch, regex does look like the best option.

0 讨论(0)
发布评论:

提交评论
- 加载中...

遥遥无期

2020-12-09 09:05

You can write your own parser:

import datetime

class DateFormatFinder:
    def __init__(self):
        self.fmts = []

    def add(self,fmt):
        self.fmts.append(fmt)

    def find(self, ss):
        for fmt in self.fmts:            
            try:
                datetime.datetime.strptime(ss, fmt)
                return fmt
            except:
                pass
        return None

You can use it as follows:

>>> df = DateFormatFinder()
>>> df.add('%m/%d/%y %H:%M')
>>> df.add('%m/%d/%y')
>>> df.add('%H:%M')

>>> df.find("01/02/06 16:30")
'%m/%d/%y %H:%M'
>>> df.find("01/02/06")
'%m/%d/%y'
>>> df.find("16:30")
'%H:%M'
>>> df.find("01/02/06 16:30")
'%m/%d/%y %H:%M'
>>> df.find("01/02/2006")

However, It is not so simple as dates can be ambiguous and their format can not be determined without some context.

>>> datetime.strptime("01/02/06 16:30", "%m/%d/%y %H:%M") # us format
datetime.datetime(2006, 1, 2, 16, 30)
>>> datetime.strptime("01/02/06 16:30", "%d/%m/%y %H:%M") # european format
datetime.datetime(2006, 2, 1, 16, 30)

0 讨论(0)

北荒

2020-12-09 09:09

This is a tricky one. My approach makes use of regular expressions and the (?(DEFINE)...) syntax which is only supported by the newer regex module.

Essentially, DEFINE let us define subroutines prior to matching them, so first of all we define all needed bricks for our date guessing function:

    (?(DEFINE)
        (?P<year_def>[12]\d{3})
        (?P<year_short_def>\d{2})
        (?P<month_def>January|February|March|April|May|June|
        July|August|September|October|November|December)
        (?P<month_short_def>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
        (?P<day_def>(?:0[1-9]|[1-9]|[12][0-9]|3[01]))
        (?P<weekday_def>(?:Mon|Tue|Wednes|Thurs|Fri|Satur|Sun)day)
        (?P<weekday_short_def>Mon|Tue|Wed|Thu|Fri|Sat|Sun)
        (?P<hms_def>\d{2}:\d{2}:\d{2})
        (?P<hm_def>\d{2}:\d{2})
            (?P<ms_def>\d{5,6})
            (?P<delim_def>([-/., ]+|(?<=\d|^)T))
        )
        # actually match them
        (?P<hms>^(?&hms_def)$)|(?P<year>^(?&year_def)$)|(?P<month>^(?&month_def)$)|(?P<month_short>^(?&month_short_def)$)|(?P<day>^(?&day_def)$)|
        (?P<weekday>^(?&weekday_def)$)|(?P<weekday_short>^(?&weekday_short_def)$)|(?P<hm>^(?&hm_def)$)|(?P<delim>^(?&delim_def)$)|(?P<ms>^(?&ms_def)$)
        """, re.VERBOSE)

After this, we need to think of possible delimiters:

# delim
delim = re.compile(r'([-/., ]+|(?<=\d)T)')

Format mapping:

# formats
formats = {'ms': '%f', 'year': '%Y', 'month': '%B', 'month_dec': '%m', 'day': '%d', 'weekday': '%A', 'hms': '%H:%M:%S', 'weekday_short': '%a', 'month_short': '%b', 'hm': '%H:%M', 'delim': ''}

The function GuessFormat() splits the parts with the help of the delimiters, tries to match them and outputs the corresponding code for strftime():

def GuessFormat(datestring):

    # define the bricks
    bricks = re.compile(r"""
            (?(DEFINE)
                (?P<year_def>[12]\d{3})
                (?P<year_short_def>\d{2})
                (?P<month_def>January|February|March|April|May|June|
                July|August|September|October|November|December)
                (?P<month_short_def>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
                (?P<day_def>(?:0[1-9]|[1-9]|[12][0-9]|3[01]))
                (?P<weekday_def>(?:Mon|Tue|Wednes|Thurs|Fri|Satur|Sun)day)
                (?P<weekday_short_def>Mon|Tue|Wed|Thu|Fri|Sat|Sun)
                (?P<hms_def>T?\d{2}:\d{2}:\d{2})
                (?P<hm_def>T?\d{2}:\d{2})
                (?P<ms_def>\d{5,6})
                (?P<delim_def>([-/., ]+|(?<=\d|^)T))
            )
            # actually match them
            (?P<hms>^(?&hms_def)$)|(?P<year>^(?&year_def)$)|(?P<month>^(?&month_def)$)|(?P<month_short>^(?&month_short_def)$)|(?P<day>^(?&day_def)$)|
            (?P<weekday>^(?&weekday_def)$)|(?P<weekday_short>^(?&weekday_short_def)$)|(?P<hm>^(?&hm_def)$)|(?P<delim>^(?&delim_def)$)|(?P<ms>^(?&ms_def)$)
            """, re.VERBOSE)

    # delim
    delim = re.compile(r'([-/., ]+|(?<=\d)T)')

    # formats
    formats = {'ms': '%f', 'year': '%Y', 'month': '%B', 'month_dec': '%m', 'day': '%d', 'weekday': '%A', 'hms': '%H:%M:%S', 'weekday_short': '%a', 'month_short': '%b', 'hm': '%H:%M', 'delim': ''}

    parts = delim.split(datestring)
    out = []
    for index, part in enumerate(parts):
        try:
            brick = dict(filter(lambda x: x[1] is not None, bricks.match(part).groupdict().items()))
            key = next(iter(brick))

            # ambiguities
            if key == 'day' and index == 2:
                key = 'month_dec'

            item = part if key == 'delim' else formats[key]
            out.append(item)
        except AttributeError:
            out.append(part)

    return "".join(out)

A test in the end:

import regex as re

datestrings = [datetime.now().isoformat(), '2006-11-02', 'Thursday, 10 August 2006 08:42:51', 'August 9, 1995', 'Aug 9, 1995', 'Thu, 01 Jan 1970 00:00:00', '21/11/06 16:30', 
'06 Jun 2017 20:33:10']

# test
for dt in datestrings:
    print("Date: {}, Format: {}".format(dt, GuessFormat(dt)))

This yields:

Date: 2017-06-07T22:02:05.001811, Format: %Y-%m-%dT%H:%M:%S.%f
Date: 2006-11-02, Format: %Y-%m-%d
Date: Thursday, 10 August 2006 08:42:51, Format: %A, %m %B %Y %H:%M:%S
Date: August 9, 1995, Format: %B %m, %Y
Date: Aug 9, 1995, Format: %b %m, %Y
Date: Thu, 01 Jan 1970 00:00:00, Format: %a, %m %b %Y %H:%M:%S
Date: 21/11/06 16:30, Format: %d/%m/%d %H:%M
Date: 06 Jun 2017 20:33:10, Format: %d %b %Y %H:%M:%S

0 讨论(0)

梦如初夏

2020-12-09 09:13
I don't have a ready-made solution, but this is a very tricky problem and since too many brain-hours have already been spent on dateutil, instead of trying to replace that, I'll propose an approach that incorporates it:
1. Read the first N records and parse each date using dateutil
2. For each date part, note where in the string the value shows up
3. If all (or >90%) date part locations match (like "YYYY is always after DD, separated by a comma and space"), turn that info into a strptime format string
4. Switch to using datetime.strptime() with a relatively good level of confidence that it will work with the rest of the file
Since you stated that "each file uses only one date/time format", this approach should work (assuming you have different dates in each file so that mm/dd ambiguity can be resolved by comparing multiple date values).
0 讨论(0)
发布评论:

提交评论
- 加载中...