Understanding min_df and max_df in scikit CountVectorizer

前端 未结 5 1618
生来不讨喜
生来不讨喜 2020-12-04 06:41

I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly

5条回答
  •  挽巷
    挽巷 (楼主)
    2020-12-04 07:08

    The defaults for min_df and max_df are 1 and 1.0, respectively. These defaults really don't do anything at all.

    That being said, I believe the currently accepted answer by @Ffisegydd answer isn't quite correct.

    For example, run this using the defaults, to see that when min_df=1 and max_df=1.0, then

    1) all tokens that appear in at least one document are used (e.g., all tokens!)

    2) all tokens that appear in all documents are used (we'll test with one candidate: everywhere).

    cv = CountVectorizer(min_df=1, max_df=1.0, lowercase=True) 
    # here is just a simple list of 3 documents.
    corpus = ['one two three everywhere', 'four five six everywhere', 'seven eight nine everywhere']
    # below we call fit_transform on the corpus and get the feature names.
    X = cv.fit_transform(corpus)
    vocab = cv.get_feature_names()
    print vocab
    print X.toarray()
    print cv.stop_words_
    

    We get:

    [u'eight', u'everywhere', u'five', u'four', u'nine', u'one', u'seven', u'six', u'three', u'two']
    [[0 1 0 0 0 1 0 0 1 1]
     [0 1 1 1 0 0 0 1 0 0]
     [1 1 0 0 1 0 1 0 0 0]]
    set([])
    

    All tokens are kept. There are no stopwords.

    Further messing around with the arguments will clarify other configurations.

    For fun and insight, I'd also recommend playing around with stop_words = 'english' and seeing that, peculiarly, all the words except 'seven' are removed! Including `everywhere'.

提交回复
热议问题