I am trying to remove stopwords from a string of text:
from nltk.corpus import stopwords
text = \'hello bye the the hi\'
text = \' \'.join([word for word in
First, you're creating stop words for each string. Create it once. Set would be great here indeed.
forbidden_words = set(stopwords.words('english'))
Later, get rid of [] inside join. Use generator instead.
' '.join([x for x in ['a', 'b', 'c']])
replace to
' '.join(x for x in ['a', 'b', 'c'])
Next thing to deal with would be to make .split() yield values instead of returning an array. I believe See thist hread for why regex would be good replacement here.s.split() is actually fast.
Lastly, do such a job in parallel (removing stop words in 6m strings). That is a whole different topic.