Adding words to scikit-learn's CountVectorizer's stop list

…衆ロ難τιáo~ 提交于 2019-11-28 05:52:00

According to the source code for sklearn.feature_extraction.text, the full list (actually a frozenset, from stop_words) of ENGLISH_STOP_WORDS is exposed through __all__. Therefore if you want to use that list plus some more items, you could do something like:

from sklearn.feature_extraction import text 

stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words)

(where my_additional_stop_words is any sequence of strings) and use the result as the stop_words argument. This input to CountVectorizer.__init__ is parsed by _check_stop_list, which will pass the new frozenset straight through.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!