问题

I have a list of tokenized text (list_of_words) that looks something like this:

list_of_words = 
['08/20/2014',
 '10:04:27',
 'pm',
 'complet',
 'vendor',
 'per',
 'mfg/recommend',
 '08/20/2014',
 '10:04:27',
 'pm',
 'complet',
 ...]

and I'm trying to strip out all the instances of dates and times from this list. I've tried using the .remove() function, to no avail. I've tried passing wildcard characters, such as '../../...." to a list of stopwords I was sorting with, but that didn't work. I finally tried writing the following code:

for line in list_of_words:
    if re.search('[0-9]{2}/[09]{2}/[0-9]{4}',line):
        list_of_words.remove(line)

but that doesn't work either. How can I strip out everything formatted like a date or time from my list?

回答1:

Description

^(?:(?:[0-9]{2}[:\/,]){2}[0-9]{2,4}|am|pm)$

Regular expression visualization

This regular expression will do the following:

find strings which look like dates 12/23/2016 and times 12:34:56
find strings which also are also am or pm which are probably part of the preceding time in the source list

Example

Live Demo

Regex: https://regex101.com/r/yE8oB9/2
Python: http://codepad.org/X9D3pd7s

Sample List

08/20/2014
10:04:27
pm
complete
vendor
per
mfg/recommend
08/20/2014
10:04:27
pm
complete

List After Processing

complete
vendor
per
mfg/recommend
complete

Sample Python Script

import re

SourceList = ['08/20/2014',
                 '10:04:27',
                 'pm',
                 'complete',
                 'vendor',
                 'per',
                 'mfg/recommend',
                 '08/20/2014',
                 '10:04:27',
                 'pm', 
                 'complete']

OutputList = filter(
    lambda ThisWord: not re.match('^(?:(?:[0-9]{2}[:\/,]){2}[0-9]{2,4}|am|pm)$', ThisWord),
    SourceList)


for ThisValue in OutputList:
  print ThisValue

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
    (?:                      group, but do not capture (2 times):
----------------------------------------------------------------------
      [0-9]{2}                 any character of: '0' to '9' (2 times)
----------------------------------------------------------------------
      [:\/,]                   any character of: ':', '\/', ','
----------------------------------------------------------------------
    ){2}                     end of grouping
----------------------------------------------------------------------
    [0-9]{2,4}               any character of: '0' to '9' (between 2
                             and 4 times (matching the most amount
                             possible))
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    am                       'am'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    pm                       'pm'
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string
----------------------------------------------------------------------

回答2:

if you want math the time and date string in your list, maybe you can try below regex:

[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}

add the python code:

import re

list_of_words = [
 '08/20/2014',
 '10:04:27',
 'pm',
 'complet',
 'vendor',
 'per',
 'mfg/recommend',
 '08/20/2014',
 '10:04:27',
 'pm',
 'complet'
]
new_list = [item for item in list_of_words if not re.search(r'[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}', item)]

回答3:

Try this:

import re

list_of_words = ['08/20/2014',
                 '10:04:27',
                 'pm',
                 'complet',
                 'vendor',
                 'per',
                 'mfg/recommend',
                 '08/20/2014',
                 '10:04:27',
                 'pm', 'complet']

list_of_words = filter(
    lambda x: not re.match('[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}', x),
    list_of_words)

来源：https://stackoverflow.com/questions/37473219/how-to-remove-dates-from-a-list-in-python

标签

python

regex

nltk

How to remove dates from a list in Python