What is the best way to add/remove stop words with spacy? I am using token.is_stop function and would like to make some custome changes to the set. I was looking at the doccumentation but could not find anything regarding of stop words. Thanks!
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
You can edit them before processing your text like this (see this post):
>>> import spacy >>> nlp = spacy.load("en") >>> nlp.vocab["the"].is_stop = False >>> nlp.vocab["definitelynotastopword"].is_stop = True >>> sentence = nlp("the word is definitelynotastopword") >>> sentence[0].is_stop False >>> sentence[3].is_stop True Note: This seems to work <=v1.8. For newer versions, see other answers.
回答2:
For version 2.0 I used this:
from spacy.lang.en.stop_words import STOP_WORDS print(STOP_WORDS) # <- set of Spacy's default stop words STOP_WORDS.add("your_additional_stop_word_here") for word in STOP_WORDS: lexeme = nlp.vocab[word] lexeme.is_stop = True This loads all stop words into a set.
You can amend your stop words to STOP_WORDS or use your own list in the first place.
回答3:
For 2.0 use the following:
for word in nlp.Defaults.stop_words: lex = nlp.vocab[word] lex.is_stop = True