How to determine appropriate strftime format from a date string?

后端 未结 4 1058
误落风尘
误落风尘 2020-12-09 08:33

The dateutil parser does a great job of correctly guessing the date and time from a wide variety of sources.

We are processing files in which each file

相关标签:
4条回答
  • 2020-12-09 09:00

    You can try dateinfer module. It's pretty nice and deals with all your cases easily. Though if you need more fine control over the code and are ready to write it from scratch, regex does look like the best option.

    0 讨论(0)
  • 2020-12-09 09:05

    You can write your own parser:

    import datetime
    
    class DateFormatFinder:
        def __init__(self):
            self.fmts = []
    
        def add(self,fmt):
            self.fmts.append(fmt)
    
        def find(self, ss):
            for fmt in self.fmts:            
                try:
                    datetime.datetime.strptime(ss, fmt)
                    return fmt
                except:
                    pass
            return None
    

    You can use it as follows:

    >>> df = DateFormatFinder()
    >>> df.add('%m/%d/%y %H:%M')
    >>> df.add('%m/%d/%y')
    >>> df.add('%H:%M')
    
    >>> df.find("01/02/06 16:30")
    '%m/%d/%y %H:%M'
    >>> df.find("01/02/06")
    '%m/%d/%y'
    >>> df.find("16:30")
    '%H:%M'
    >>> df.find("01/02/06 16:30")
    '%m/%d/%y %H:%M'
    >>> df.find("01/02/2006")
    

    However, It is not so simple as dates can be ambiguous and their format can not be determined without some context.

    >>> datetime.strptime("01/02/06 16:30", "%m/%d/%y %H:%M") # us format
    datetime.datetime(2006, 1, 2, 16, 30)
    >>> datetime.strptime("01/02/06 16:30", "%d/%m/%y %H:%M") # european format
    datetime.datetime(2006, 2, 1, 16, 30)
    
    0 讨论(0)
  • 2020-12-09 09:09

    This is a tricky one. My approach makes use of regular expressions and the (?(DEFINE)...) syntax which is only supported by the newer regex module.


    Essentially, DEFINE let us define subroutines prior to matching them, so first of all we define all needed bricks for our date guessing function:

        (?(DEFINE)
            (?P<year_def>[12]\d{3})
            (?P<year_short_def>\d{2})
            (?P<month_def>January|February|March|April|May|June|
            July|August|September|October|November|December)
            (?P<month_short_def>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
            (?P<day_def>(?:0[1-9]|[1-9]|[12][0-9]|3[01]))
            (?P<weekday_def>(?:Mon|Tue|Wednes|Thurs|Fri|Satur|Sun)day)
            (?P<weekday_short_def>Mon|Tue|Wed|Thu|Fri|Sat|Sun)
            (?P<hms_def>\d{2}:\d{2}:\d{2})
            (?P<hm_def>\d{2}:\d{2})
                (?P<ms_def>\d{5,6})
                (?P<delim_def>([-/., ]+|(?<=\d|^)T))
            )
            # actually match them
            (?P<hms>^(?&hms_def)$)|(?P<year>^(?&year_def)$)|(?P<month>^(?&month_def)$)|(?P<month_short>^(?&month_short_def)$)|(?P<day>^(?&day_def)$)|
            (?P<weekday>^(?&weekday_def)$)|(?P<weekday_short>^(?&weekday_short_def)$)|(?P<hm>^(?&hm_def)$)|(?P<delim>^(?&delim_def)$)|(?P<ms>^(?&ms_def)$)
            """, re.VERBOSE)
    

    After this, we need to think of possible delimiters:

    # delim
    delim = re.compile(r'([-/., ]+|(?<=\d)T)')
    

    Format mapping:

    # formats
    formats = {'ms': '%f', 'year': '%Y', 'month': '%B', 'month_dec': '%m', 'day': '%d', 'weekday': '%A', 'hms': '%H:%M:%S', 'weekday_short': '%a', 'month_short': '%b', 'hm': '%H:%M', 'delim': ''}
    

    The function GuessFormat() splits the parts with the help of the delimiters, tries to match them and outputs the corresponding code for strftime():

    def GuessFormat(datestring):
    
        # define the bricks
        bricks = re.compile(r"""
                (?(DEFINE)
                    (?P<year_def>[12]\d{3})
                    (?P<year_short_def>\d{2})
                    (?P<month_def>January|February|March|April|May|June|
                    July|August|September|October|November|December)
                    (?P<month_short_def>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
                    (?P<day_def>(?:0[1-9]|[1-9]|[12][0-9]|3[01]))
                    (?P<weekday_def>(?:Mon|Tue|Wednes|Thurs|Fri|Satur|Sun)day)
                    (?P<weekday_short_def>Mon|Tue|Wed|Thu|Fri|Sat|Sun)
                    (?P<hms_def>T?\d{2}:\d{2}:\d{2})
                    (?P<hm_def>T?\d{2}:\d{2})
                    (?P<ms_def>\d{5,6})
                    (?P<delim_def>([-/., ]+|(?<=\d|^)T))
                )
                # actually match them
                (?P<hms>^(?&hms_def)$)|(?P<year>^(?&year_def)$)|(?P<month>^(?&month_def)$)|(?P<month_short>^(?&month_short_def)$)|(?P<day>^(?&day_def)$)|
                (?P<weekday>^(?&weekday_def)$)|(?P<weekday_short>^(?&weekday_short_def)$)|(?P<hm>^(?&hm_def)$)|(?P<delim>^(?&delim_def)$)|(?P<ms>^(?&ms_def)$)
                """, re.VERBOSE)
    
        # delim
        delim = re.compile(r'([-/., ]+|(?<=\d)T)')
    
        # formats
        formats = {'ms': '%f', 'year': '%Y', 'month': '%B', 'month_dec': '%m', 'day': '%d', 'weekday': '%A', 'hms': '%H:%M:%S', 'weekday_short': '%a', 'month_short': '%b', 'hm': '%H:%M', 'delim': ''}
    
        parts = delim.split(datestring)
        out = []
        for index, part in enumerate(parts):
            try:
                brick = dict(filter(lambda x: x[1] is not None, bricks.match(part).groupdict().items()))
                key = next(iter(brick))
    
                # ambiguities
                if key == 'day' and index == 2:
                    key = 'month_dec'
    
                item = part if key == 'delim' else formats[key]
                out.append(item)
            except AttributeError:
                out.append(part)
    
        return "".join(out)
    

    A test in the end:

    import regex as re
    
    datestrings = [datetime.now().isoformat(), '2006-11-02', 'Thursday, 10 August 2006 08:42:51', 'August 9, 1995', 'Aug 9, 1995', 'Thu, 01 Jan 1970 00:00:00', '21/11/06 16:30', 
    '06 Jun 2017 20:33:10']
    
    # test
    for dt in datestrings:
        print("Date: {}, Format: {}".format(dt, GuessFormat(dt)))
    

    This yields:

    Date: 2017-06-07T22:02:05.001811, Format: %Y-%m-%dT%H:%M:%S.%f
    Date: 2006-11-02, Format: %Y-%m-%d
    Date: Thursday, 10 August 2006 08:42:51, Format: %A, %m %B %Y %H:%M:%S
    Date: August 9, 1995, Format: %B %m, %Y
    Date: Aug 9, 1995, Format: %b %m, %Y
    Date: Thu, 01 Jan 1970 00:00:00, Format: %a, %m %b %Y %H:%M:%S
    Date: 21/11/06 16:30, Format: %d/%m/%d %H:%M
    Date: 06 Jun 2017 20:33:10, Format: %d %b %Y %H:%M:%S
    
    0 讨论(0)
  • 2020-12-09 09:13

    I don't have a ready-made solution, but this is a very tricky problem and since too many brain-hours have already been spent on dateutil, instead of trying to replace that, I'll propose an approach that incorporates it:

    1. Read the first N records and parse each date using dateutil
    2. For each date part, note where in the string the value shows up
    3. If all (or >90%) date part locations match (like "YYYY is always after DD, separated by a comma and space"), turn that info into a strptime format string
    4. Switch to using datetime.strptime() with a relatively good level of confidence that it will work with the rest of the file

    Since you stated that "each file uses only one date/time format", this approach should work (assuming you have different dates in each file so that mm/dd ambiguity can be resolved by comparing multiple date values).

    0 讨论(0)
提交回复
热议问题