Faster way to remove stop words in Python

I am trying to remove stopwords from a string of text:

from nltk.corpus import stopwords
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in (stopwords.words('english'))])

I am processing 6 mil of such strings so speed is important. Profiling my code, the slowest part is the lines above, is there a better way to do this? I'm thinking of using something like regex's re.sub but I don't know how to write the pattern for a set of words. Can someone give me a hand and I'm also happy to hear other possibly faster methods.

Note: I tried someone's suggest of wrapping stopwords.words('english') with set() but that made no difference.

Thank you.

Try caching the stopwords object, as shown below. Constructing this each time you call the function seems to be the bottleneck.

    from nltk.corpus import stopwords

    cachedStopWords = stopwords.words("english")

    def testFuncOld():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])

    def testFuncNew():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in cachedStopWords])

    if __name__ == "__main__":
        for i in xrange(10000):
            testFuncOld()
            testFuncNew()

I ran this through the profiler: python -m cProfile -s cumulative test.py. The relevant lines are posted below.

nCalls Cumulative Time

10000 7.723 words.py:7(testFuncOld)

10000 0.140 words.py:11(testFuncNew)

So, caching the stopwords instance gives a ~70x speedup.

Use a regexp to remove all words which do not match:

import re
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
text = pattern.sub('', text)

This will probably be way faster than looping yourself, especially for large input strings.

If the last word in the text gets deleted by this, you may have trailing whitespace. I propose to handle this separately.

Krzysztof Szularz

First, you're creating stop words for each string. Create it once. Set would be great here indeed.

forbidden_words = set(stopwords.words('english'))

Later, get rid of [] inside join. Use generator instead.

' '.join([x for x in ['a', 'b', 'c']])

replace to

' '.join(x for x in ['a', 'b', 'c'])

Next thing to deal with would be to make .split() yield values instead of returning an array. ~~I believe regex would be good replacement here.~~ See thist hread for why s.split() is actually fast.

Lastly, do such a job in parallel (removing stop words in 6m strings). That is a whole different topic.

Sorry for late reply. Would prove useful for new users.

Create a dictionary of stopwords using collections library

Use that dictionary for very fast search (time = O(1)) rather than doing it on list (time = O(stopwords))

from collections import Counter
stop_words = stopwords.words('english')
stopwords_dict = Collections.counter(stop_words)
text = ' '.join([word for word in text.split() if word not in stopwords_dict])

来源：https://stackoverflow.com/questions/19560498/faster-way-to-remove-stop-words-in-python

标签

python

regex

stop-words