Understanding min_df and max_df in scikit CountVectorizer

前端未结

关注

 5  1625

生来不讨喜 2020-12-04 06:41

I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly

5条回答

温柔的废话 (楼主)

2020-12-04 07:09

The goal of MIN_DF is to ignore words that have very few occurrences to be considered meaningful. For example, in your text you may have names of people that may appear in only 1 or two documents. In some applications, this may qualify as noise and could be eliminated from further analysis. Similarly, you can ignore words that are too common with MAX_DF.

Instead of using a minimum/maximum term frequency (total occurrences of a word) to eliminate words, MIN_DF and MAX_DF look at how many documents contained a term, better known as document frequency. The threshold values can be an absolute value (e.g. 1, 2, 3, 4) or a value representing proportion of documents (e.g. 0.25 meaning, ignore words that have appeared in 25% of the documents) .

See some usage examples here.

0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...