Faster way to remove stop words in Python

前端 未结 4 721
情歌与酒
情歌与酒 2020-12-04 09:34

I am trying to remove stopwords from a string of text:

from nltk.corpus import stopwords
text = \'hello bye the the hi\'
text = \' \'.join([word for word in          


        
4条回答
  •  无人及你
    2020-12-04 09:52

    First, you're creating stop words for each string. Create it once. Set would be great here indeed.

    forbidden_words = set(stopwords.words('english'))
    

    Later, get rid of [] inside join. Use generator instead.

    ' '.join([x for x in ['a', 'b', 'c']])
    

    replace to

    ' '.join(x for x in ['a', 'b', 'c'])
    

    Next thing to deal with would be to make .split() yield values instead of returning an array. I believe regex would be good replacement here. See thist hread for why s.split() is actually fast.

    Lastly, do such a job in parallel (removing stop words in 6m strings). That is a whole different topic.

提交回复
热议问题