问题
I have a list of tokenized text (list_of_words) that looks something like this:
list_of_words =
['08/20/2014',
'10:04:27',
'pm',
'complet',
'vendor',
'per',
'mfg/recommend',
'08/20/2014',
'10:04:27',
'pm',
'complet',
...]
and I'm trying to strip out all the instances of dates and times from this list. I've tried using the .remove() function, to no avail. I've tried passing wildcard characters, such as '../../...." to a list of stopwords I was sorting with, but that didn't work. I finally tried writing the following code:
for line in list_of_words:
if re.search('[0-9]{2}/[09]{2}/[0-9]{4}',line):
list_of_words.remove(line)
but that doesn't work either. How can I strip out everything formatted like a date or time from my list?
回答1:
Description
^(?:(?:[0-9]{2}[:\/,]){2}[0-9]{2,4}|am|pm)$
This regular expression will do the following:
- find strings which look like dates
12/23/2016
and times12:34:56
- find strings which also are also
am
orpm
which are probably part of the preceding time in the source list
Example
Live Demo
- Regex: https://regex101.com/r/yE8oB9/2
- Python: http://codepad.org/X9D3pd7s
Sample List
08/20/2014
10:04:27
pm
complete
vendor
per
mfg/recommend
08/20/2014
10:04:27
pm
complete
List After Processing
complete
vendor
per
mfg/recommend
complete
Sample Python Script
import re
SourceList = ['08/20/2014',
'10:04:27',
'pm',
'complete',
'vendor',
'per',
'mfg/recommend',
'08/20/2014',
'10:04:27',
'pm',
'complete']
OutputList = filter(
lambda ThisWord: not re.match('^(?:(?:[0-9]{2}[:\/,]){2}[0-9]{2,4}|am|pm)$', ThisWord),
SourceList)
for ThisValue in OutputList:
print ThisValue
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
(?: group, but do not capture (2 times):
----------------------------------------------------------------------
[0-9]{2} any character of: '0' to '9' (2 times)
----------------------------------------------------------------------
[:\/,] any character of: ':', '\/', ','
----------------------------------------------------------------------
){2} end of grouping
----------------------------------------------------------------------
[0-9]{2,4} any character of: '0' to '9' (between 2
and 4 times (matching the most amount
possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
am 'am'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
pm 'pm'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
回答2:
if you want math the time and date string in your list, maybe you can try below regex:
[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}
add the python code:
import re
list_of_words = [
'08/20/2014',
'10:04:27',
'pm',
'complet',
'vendor',
'per',
'mfg/recommend',
'08/20/2014',
'10:04:27',
'pm',
'complet'
]
new_list = [item for item in list_of_words if not re.search(r'[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}', item)]
回答3:
Try this:
import re
list_of_words = ['08/20/2014',
'10:04:27',
'pm',
'complet',
'vendor',
'per',
'mfg/recommend',
'08/20/2014',
'10:04:27',
'pm', 'complet']
list_of_words = filter(
lambda x: not re.match('[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}', x),
list_of_words)
来源:https://stackoverflow.com/questions/37473219/how-to-remove-dates-from-a-list-in-python