Adding words to scikit-learn's CountVectorizer's stop list

后端未结

关注

 1  655

离开以前 2020-12-08 04:23

Scikit-learn\'s CountVectorizer class lets you pass a string \'english\' to the argument stop_words. I want to add some things to this predefined list. Can anyone tell me

1条回答

攒了一身酷 (楼主)

2020-12-08 05:04
According to the source code for sklearn.feature_extraction.text, the full list (actually a frozenset, from stop_words) of ENGLISH_STOP_WORDS is exposed through __all__. Therefore if you want to use that list plus some more items, you could do something like:
```
from sklearn.feature_extraction import text 

stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words)
```
(where my_additional_stop_words is any sequence of strings) and use the result as the stop_words argument. This input to CountVectorizer.__init__ is parsed by _check_stop_list, which will pass the new frozenset straight through.
0 讨论(0)
发布评论:

提交评论
- 加载中...