The dateutil
parser does a great job of correctly guessing the date and time from a wide variety of sources.
We are processing files in which each file
You can try dateinfer module. It's pretty nice and deals with all your cases easily. Though if you need more fine control over the code and are ready to write it from scratch, regex does look like the best option.
You can write your own parser:
import datetime
class DateFormatFinder:
def __init__(self):
self.fmts = []
def add(self,fmt):
self.fmts.append(fmt)
def find(self, ss):
for fmt in self.fmts:
try:
datetime.datetime.strptime(ss, fmt)
return fmt
except:
pass
return None
You can use it as follows:
>>> df = DateFormatFinder()
>>> df.add('%m/%d/%y %H:%M')
>>> df.add('%m/%d/%y')
>>> df.add('%H:%M')
>>> df.find("01/02/06 16:30")
'%m/%d/%y %H:%M'
>>> df.find("01/02/06")
'%m/%d/%y'
>>> df.find("16:30")
'%H:%M'
>>> df.find("01/02/06 16:30")
'%m/%d/%y %H:%M'
>>> df.find("01/02/2006")
However, It is not so simple as dates can be ambiguous and their format can not be determined without some context.
>>> datetime.strptime("01/02/06 16:30", "%m/%d/%y %H:%M") # us format
datetime.datetime(2006, 1, 2, 16, 30)
>>> datetime.strptime("01/02/06 16:30", "%d/%m/%y %H:%M") # european format
datetime.datetime(2006, 2, 1, 16, 30)
This is a tricky one. My approach makes use of regular expressions and the (?(DEFINE)...)
syntax which is only supported by the newer regex module.
DEFINE
let us define subroutines prior to matching them, so first of all we define all needed bricks for our date guessing function:
(?(DEFINE)
(?P<year_def>[12]\d{3})
(?P<year_short_def>\d{2})
(?P<month_def>January|February|March|April|May|June|
July|August|September|October|November|December)
(?P<month_short_def>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
(?P<day_def>(?:0[1-9]|[1-9]|[12][0-9]|3[01]))
(?P<weekday_def>(?:Mon|Tue|Wednes|Thurs|Fri|Satur|Sun)day)
(?P<weekday_short_def>Mon|Tue|Wed|Thu|Fri|Sat|Sun)
(?P<hms_def>\d{2}:\d{2}:\d{2})
(?P<hm_def>\d{2}:\d{2})
(?P<ms_def>\d{5,6})
(?P<delim_def>([-/., ]+|(?<=\d|^)T))
)
# actually match them
(?P<hms>^(?&hms_def)$)|(?P<year>^(?&year_def)$)|(?P<month>^(?&month_def)$)|(?P<month_short>^(?&month_short_def)$)|(?P<day>^(?&day_def)$)|
(?P<weekday>^(?&weekday_def)$)|(?P<weekday_short>^(?&weekday_short_def)$)|(?P<hm>^(?&hm_def)$)|(?P<delim>^(?&delim_def)$)|(?P<ms>^(?&ms_def)$)
""", re.VERBOSE)
After this, we need to think of possible delimiters:
# delim
delim = re.compile(r'([-/., ]+|(?<=\d)T)')
Format mapping:
# formats
formats = {'ms': '%f', 'year': '%Y', 'month': '%B', 'month_dec': '%m', 'day': '%d', 'weekday': '%A', 'hms': '%H:%M:%S', 'weekday_short': '%a', 'month_short': '%b', 'hm': '%H:%M', 'delim': ''}
The function GuessFormat()
splits the parts with the help of the delimiters, tries to match them and outputs the corresponding code for strftime()
:
def GuessFormat(datestring):
# define the bricks
bricks = re.compile(r"""
(?(DEFINE)
(?P<year_def>[12]\d{3})
(?P<year_short_def>\d{2})
(?P<month_def>January|February|March|April|May|June|
July|August|September|October|November|December)
(?P<month_short_def>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
(?P<day_def>(?:0[1-9]|[1-9]|[12][0-9]|3[01]))
(?P<weekday_def>(?:Mon|Tue|Wednes|Thurs|Fri|Satur|Sun)day)
(?P<weekday_short_def>Mon|Tue|Wed|Thu|Fri|Sat|Sun)
(?P<hms_def>T?\d{2}:\d{2}:\d{2})
(?P<hm_def>T?\d{2}:\d{2})
(?P<ms_def>\d{5,6})
(?P<delim_def>([-/., ]+|(?<=\d|^)T))
)
# actually match them
(?P<hms>^(?&hms_def)$)|(?P<year>^(?&year_def)$)|(?P<month>^(?&month_def)$)|(?P<month_short>^(?&month_short_def)$)|(?P<day>^(?&day_def)$)|
(?P<weekday>^(?&weekday_def)$)|(?P<weekday_short>^(?&weekday_short_def)$)|(?P<hm>^(?&hm_def)$)|(?P<delim>^(?&delim_def)$)|(?P<ms>^(?&ms_def)$)
""", re.VERBOSE)
# delim
delim = re.compile(r'([-/., ]+|(?<=\d)T)')
# formats
formats = {'ms': '%f', 'year': '%Y', 'month': '%B', 'month_dec': '%m', 'day': '%d', 'weekday': '%A', 'hms': '%H:%M:%S', 'weekday_short': '%a', 'month_short': '%b', 'hm': '%H:%M', 'delim': ''}
parts = delim.split(datestring)
out = []
for index, part in enumerate(parts):
try:
brick = dict(filter(lambda x: x[1] is not None, bricks.match(part).groupdict().items()))
key = next(iter(brick))
# ambiguities
if key == 'day' and index == 2:
key = 'month_dec'
item = part if key == 'delim' else formats[key]
out.append(item)
except AttributeError:
out.append(part)
return "".join(out)
A test in the end:
import regex as re
datestrings = [datetime.now().isoformat(), '2006-11-02', 'Thursday, 10 August 2006 08:42:51', 'August 9, 1995', 'Aug 9, 1995', 'Thu, 01 Jan 1970 00:00:00', '21/11/06 16:30',
'06 Jun 2017 20:33:10']
# test
for dt in datestrings:
print("Date: {}, Format: {}".format(dt, GuessFormat(dt)))
This yields:
Date: 2017-06-07T22:02:05.001811, Format: %Y-%m-%dT%H:%M:%S.%f
Date: 2006-11-02, Format: %Y-%m-%d
Date: Thursday, 10 August 2006 08:42:51, Format: %A, %m %B %Y %H:%M:%S
Date: August 9, 1995, Format: %B %m, %Y
Date: Aug 9, 1995, Format: %b %m, %Y
Date: Thu, 01 Jan 1970 00:00:00, Format: %a, %m %b %Y %H:%M:%S
Date: 21/11/06 16:30, Format: %d/%m/%d %H:%M
Date: 06 Jun 2017 20:33:10, Format: %d %b %Y %H:%M:%S
I don't have a ready-made solution, but this is a very tricky problem and since too many brain-hours have already been spent on dateutil, instead of trying to replace that, I'll propose an approach that incorporates it:
Since you stated that "each file uses only one date/time format", this approach should work (assuming you have different dates in each file so that mm/dd ambiguity can be resolved by comparing multiple date values).