问题
I am trying to write a regular expression to catch different format of dates.
The sentences are in a series and each sample of the series contains only one date, but may have other numbers.
The format of dates is like this:
04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
For years that only have two digits we assume it is a 20th century year (i.e. 19nn)
Here is my regular expression:
df_dates = df.str.extract(r'((?:\d{1,2})?[-/\s,]{0,2}(?:\d{1,2})?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December)?[-/\s,]{0,2}(?:19|20)?\d{2})')
My regex produces these results:
input1
Lab: B12 969 2007\n
found1
12,969
input2
Contemplating jumping off building - 1973 - difficulty writing paper.\n
found2
1973
Question
How do I change my regex to obtain the desired results?
回答1:
I strongly believe that you should try to use several regular expressions to process your data instead of trying to do everything with a single one. That way, you'll have a way more flexible system, as adding new date formats would be way easier than trying to edit a difficult to read regex to make it even more obscure.
Given that you're using regex with a programming language, you can generate regex with code, so you don't duplicate strings. As an example, consider this quick, incomplete and dirty snippet:
import re
monthsShort="Jan|Feb"
monthsLong="January|February"
months="(" + monthsShort + "|" + monthsLong + ")"
separators = "[/-]"
days = "\d{2}"
years = "\d{4}"
regex1 = months + separators + days
regex2 = days + separators + months
print(re.search(regex1,"Jan/01"))
In the end, I have a couple of regex I can use to match two date formats. Completing the regular expressions is trivial, and adding more formats is easy. The whole thing is easier to read. Of course, you have to be careful when concatenating pieces of regex (as you may forget things like parenthesis), but I think that's way easier to do than dealing with obscure regular expressions.
EDIT: I forgot to mention something: after generating your regular expressions, you can add them, for example, to a list, so you can iterate them and apply them to your text within a single loop. Or, if you really want it, you can generate a single regex with all of them (by using parentheses and vertical bars) and apply them with a single statement.
来源:https://stackoverflow.com/questions/47877816/regular-expression-of-different-format-of-dates-in-python