Faster way to remove stop words in Python

前端 未结 4 722
情歌与酒
情歌与酒 2020-12-04 09:34

I am trying to remove stopwords from a string of text:

from nltk.corpus import stopwords
text = \'hello bye the the hi\'
text = \' \'.join([word for word in          


        
4条回答
  •  无人及你
    2020-12-04 09:49

    Use a regexp to remove all words which do not match:

    import re
    pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
    text = pattern.sub('', text)
    

    This will probably be way faster than looping yourself, especially for large input strings.

    If the last word in the text gets deleted by this, you may have trailing whitespace. I propose to handle this separately.

提交回复
热议问题