Add/remove custom stop words with spacy

前端 未结 6 2067
悲哀的现实
悲哀的现实 2020-12-07 13:04

What is the best way to add/remove stop words with spacy? I am using token.is_stop function and would like to make some custom changes to the set. I was looking at the docum

相关标签:
6条回答
  • 2020-12-07 13:07

    For version 2.0 I used this:

    from spacy.lang.en.stop_words import STOP_WORDS
    
    print(STOP_WORDS) # <- set of Spacy's default stop words
    
    STOP_WORDS.add("your_additional_stop_word_here")
    
    for word in STOP_WORDS:
        lexeme = nlp.vocab[word]
        lexeme.is_stop = True
    

    This loads all stop words into a set.

    You can amend your stop words to STOP_WORDS or use your own list in the first place.

    0 讨论(0)
  • 2020-12-07 13:16

    You can edit them before processing your text like this (see this post):

    >>> import spacy
    >>> nlp = spacy.load("en")
    >>> nlp.vocab["the"].is_stop = False
    >>> nlp.vocab["definitelynotastopword"].is_stop = True
    >>> sentence = nlp("the word is definitelynotastopword")
    >>> sentence[0].is_stop
    False
    >>> sentence[3].is_stop
    True
    

    Note: This seems to work <=v1.8. For newer versions, see other answers.

    0 讨论(0)
  • 2020-12-07 13:18

    This collects the stop words too :)

    spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

    0 讨论(0)
  • 2020-12-07 13:23

    Using Spacy 2.0.11, you can update its stopwords set using one of the following:

    To add a single stopword:

    import spacy    
    nlp = spacy.load("en")
    nlp.Defaults.stop_words.add("my_new_stopword")
    

    To add several stopwords at once:

    import spacy    
    nlp = spacy.load("en")
    nlp.Defaults.stop_words |= {"my_new_stopword1","my_new_stopword2",}
    

    To remove a single stopword:

    import spacy    
    nlp = spacy.load("en")
    nlp.Defaults.stop_words.remove("whatever")
    

    To remove several stopwords at once:

    import spacy    
    nlp = spacy.load("en")
    nlp.Defaults.stop_words -= {"whatever", "whenever"}
    

    Note: To see the current set of stopwords, use:

    print(nlp.Defaults.stop_words)
    

    Update : It was noted in the comments that this fix only affects the current execution. To update the model, you can use the methods nlp.to_disk("/path") and nlp.from_disk("/path") (further described at https://spacy.io/usage/saving-loading).

    0 讨论(0)
  • 2020-12-07 13:26

    For 2.0 use the following:

    for word in nlp.Defaults.stop_words:
        lex = nlp.vocab[word]
        lex.is_stop = True
    
    0 讨论(0)
  • 2020-12-07 13:26

    In latest version following would remove the word out of the list:

    spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
    spacy_stopwords.remove('not')
    
    0 讨论(0)
提交回复
热议问题