Understanding min_df and max_df in scikit CountVectorizer

前端未结

关注

 5  1618

生来不讨喜 2020-12-04 06:41

I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly

5条回答

挽巷 (楼主)

2020-12-04 07:08
The defaults for min_df and max_df are 1 and 1.0, respectively. These defaults really don't do anything at all.

That being said, I believe the currently accepted answer by @Ffisegydd answer isn't quite correct.

For example, run this using the defaults, to see that when min_df=1 and max_df=1.0, then

1) all tokens that appear in at least one document are used (e.g., all tokens!)

2) all tokens that appear in all documents are used (we'll test with one candidate: everywhere).
```
cv = CountVectorizer(min_df=1, max_df=1.0, lowercase=True) 
# here is just a simple list of 3 documents.
corpus = ['one two three everywhere', 'four five six everywhere', 'seven eight nine everywhere']
# below we call fit_transform on the corpus and get the feature names.
X = cv.fit_transform(corpus)
vocab = cv.get_feature_names()
print vocab
print X.toarray()
print cv.stop_words_
```
We get:
```
[u'eight', u'everywhere', u'five', u'four', u'nine', u'one', u'seven', u'six', u'three', u'two']
[[0 1 0 0 0 1 0 0 1 1]
 [0 1 1 1 0 0 0 1 0 0]
 [1 1 0 0 1 0 1 0 0 0]]
set([])
```
All tokens are kept. There are no stopwords.

Further messing around with the arguments will clarify other configurations.

For fun and insight, I'd also recommend playing around with stop_words = 'english' and seeing that, peculiarly, all the words except 'seven' are removed! Including `everywhere'.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...