How to determine appropriate strftime format from a date string?

后端 未结 4 1079
误落风尘
误落风尘 2020-12-09 08:33

The dateutil parser does a great job of correctly guessing the date and time from a wide variety of sources.

We are processing files in which each file

4条回答
  •  北荒
    北荒 (楼主)
    2020-12-09 09:09

    This is a tricky one. My approach makes use of regular expressions and the (?(DEFINE)...) syntax which is only supported by the newer regex module.


    Essentially, DEFINE let us define subroutines prior to matching them, so first of all we define all needed bricks for our date guessing function:

        (?(DEFINE)
            (?P[12]\d{3})
            (?P\d{2})
            (?PJanuary|February|March|April|May|June|
            July|August|September|October|November|December)
            (?PJan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
            (?P(?:0[1-9]|[1-9]|[12][0-9]|3[01]))
            (?P(?:Mon|Tue|Wednes|Thurs|Fri|Satur|Sun)day)
            (?PMon|Tue|Wed|Thu|Fri|Sat|Sun)
            (?P\d{2}:\d{2}:\d{2})
            (?P\d{2}:\d{2})
                (?P\d{5,6})
                (?P([-/., ]+|(?<=\d|^)T))
            )
            # actually match them
            (?P^(?&hms_def)$)|(?P^(?&year_def)$)|(?P^(?&month_def)$)|(?P^(?&month_short_def)$)|(?P^(?&day_def)$)|
            (?P^(?&weekday_def)$)|(?P^(?&weekday_short_def)$)|(?P^(?&hm_def)$)|(?P^(?&delim_def)$)|(?P^(?&ms_def)$)
            """, re.VERBOSE)
    

    After this, we need to think of possible delimiters:

    # delim
    delim = re.compile(r'([-/., ]+|(?<=\d)T)')
    

    Format mapping:

    # formats
    formats = {'ms': '%f', 'year': '%Y', 'month': '%B', 'month_dec': '%m', 'day': '%d', 'weekday': '%A', 'hms': '%H:%M:%S', 'weekday_short': '%a', 'month_short': '%b', 'hm': '%H:%M', 'delim': ''}
    

    The function GuessFormat() splits the parts with the help of the delimiters, tries to match them and outputs the corresponding code for strftime():

    def GuessFormat(datestring):
    
        # define the bricks
        bricks = re.compile(r"""
                (?(DEFINE)
                    (?P[12]\d{3})
                    (?P\d{2})
                    (?PJanuary|February|March|April|May|June|
                    July|August|September|October|November|December)
                    (?PJan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
                    (?P(?:0[1-9]|[1-9]|[12][0-9]|3[01]))
                    (?P(?:Mon|Tue|Wednes|Thurs|Fri|Satur|Sun)day)
                    (?PMon|Tue|Wed|Thu|Fri|Sat|Sun)
                    (?PT?\d{2}:\d{2}:\d{2})
                    (?PT?\d{2}:\d{2})
                    (?P\d{5,6})
                    (?P([-/., ]+|(?<=\d|^)T))
                )
                # actually match them
                (?P^(?&hms_def)$)|(?P^(?&year_def)$)|(?P^(?&month_def)$)|(?P^(?&month_short_def)$)|(?P^(?&day_def)$)|
                (?P^(?&weekday_def)$)|(?P^(?&weekday_short_def)$)|(?P^(?&hm_def)$)|(?P^(?&delim_def)$)|(?P^(?&ms_def)$)
                """, re.VERBOSE)
    
        # delim
        delim = re.compile(r'([-/., ]+|(?<=\d)T)')
    
        # formats
        formats = {'ms': '%f', 'year': '%Y', 'month': '%B', 'month_dec': '%m', 'day': '%d', 'weekday': '%A', 'hms': '%H:%M:%S', 'weekday_short': '%a', 'month_short': '%b', 'hm': '%H:%M', 'delim': ''}
    
        parts = delim.split(datestring)
        out = []
        for index, part in enumerate(parts):
            try:
                brick = dict(filter(lambda x: x[1] is not None, bricks.match(part).groupdict().items()))
                key = next(iter(brick))
    
                # ambiguities
                if key == 'day' and index == 2:
                    key = 'month_dec'
    
                item = part if key == 'delim' else formats[key]
                out.append(item)
            except AttributeError:
                out.append(part)
    
        return "".join(out)
    

    A test in the end:

    import regex as re
    
    datestrings = [datetime.now().isoformat(), '2006-11-02', 'Thursday, 10 August 2006 08:42:51', 'August 9, 1995', 'Aug 9, 1995', 'Thu, 01 Jan 1970 00:00:00', '21/11/06 16:30', 
    '06 Jun 2017 20:33:10']
    
    # test
    for dt in datestrings:
        print("Date: {}, Format: {}".format(dt, GuessFormat(dt)))
    

    This yields:

    Date: 2017-06-07T22:02:05.001811, Format: %Y-%m-%dT%H:%M:%S.%f
    Date: 2006-11-02, Format: %Y-%m-%d
    Date: Thursday, 10 August 2006 08:42:51, Format: %A, %m %B %Y %H:%M:%S
    Date: August 9, 1995, Format: %B %m, %Y
    Date: Aug 9, 1995, Format: %b %m, %Y
    Date: Thu, 01 Jan 1970 00:00:00, Format: %a, %m %b %Y %H:%M:%S
    Date: 21/11/06 16:30, Format: %d/%m/%d %H:%M
    Date: 06 Jun 2017 20:33:10, Format: %d %b %Y %H:%M:%S
    

提交回复
热议问题