How to remove list of words from a list of strings

后端 未结 4 781
情深已故
情深已故 2020-12-13 11:19

Sorry if the question is bit confusing. This is similar to this question

I think this the above question is close to what I want, but in Clojure.

There is a

相关标签:
4条回答
  • 2020-12-13 11:57
    >>> import re
    >>> noise_words_list = ['of', 'the', 'in', 'for', 'at']
    >>> phrases = ['of New York', 'of the New York']
    >>> noise_re = re.compile('\\b(%s)\\W'%('|'.join(map(re.escape,noise_words_list))),re.I)
    >>> [noise_re.sub('',p) for p in phrases]
    ['New York', 'New York']
    
    0 讨论(0)
  • 2020-12-13 12:02

    Here is my stab at it. This uses regular expressions.

    import re
    pattern = re.compile("(of|the|in|for|at)\W", re.I)
    phrases = ['of New York', 'of the New York']
    map(lambda phrase: pattern.sub("", phrase),  phrases) # ['New York', 'New York']
    

    Sans lambda:

    [pattern.sub("", phrase) for phrase in phrases]
    

    Update

    Fix for the bug pointed out by gnibbler (thanks!):

    pattern = re.compile("\\b(of|the|in|for|at)\\W", re.I)
    phrases = ['of New York', 'of the New York', 'Spain has rain']
    [pattern.sub("", phrase) for phrase in phrases] # ['New York', 'New York', 'Spain has rain']
    

    @prabhu: the above change avoids snipping off the trailing "in" from "Spain". To verify run both versions of the regular expressions against the phrase "Spain has rain".

    0 讨论(0)
  • 2020-12-13 12:06

    Without regexp you could do like this:

    places = ['of New York', 'of the New York']
    
    noise_words_set = {'of', 'the', 'at', 'for', 'in'}
    stuff = [' '.join(w for w in place.split() if w.lower() not in noise_words_set)
             for place in places
             ]
    print stuff
    
    0 讨论(0)
  • 2020-12-13 12:15

    Since you would like to know what you are doing wrong, this line:

    stuff = [place.replace(w, "").strip() for w in noise_words_list if place.startswith(w)]
    

    takes place, and then begins to loop over words. First it checks for "of". Your place (e.g. "of the New York") is checked to see if it starts with "of". It is transformed (call to replace and strip) and added to the result list. The crucial thing here is that result is never examined again. For every word you iterate over in the comprehension, a new result is added to the result list. So the next word is "the" and your place ("of the New York") doesn't start with "the", so no new result is added.

    I assume the result you got eventually is the concatenation of your place variables. A simpler to read and understand procedural version would be (untested):

    results = []
    for place in places:
        for word in words:
            if place.startswith(word):
                place = place.replace(word, "").strip()
        results.append(place)
    

    Keep in mind that replace() will remove the word anywhere in the string, even if it occurs as a simple substring. You can avoid this by using regexes with a pattern something like ^the\b.

    0 讨论(0)
提交回复
热议问题