Understanding min_df and max_df in scikit CountVectorizer

前端 未结 5 1621
生来不讨喜
生来不讨喜 2020-12-04 06:41

I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly

5条回答
  •  时光取名叫无心
    2020-12-04 06:59

    As per the CountVectorizer documentation here.

    When using a float in the range [0.0, 1.0] they refer to the document frequency. That is the percentage of documents that contain the term.

    When using an int it refers to absolute number of documents that hold this term.

    Consider the example where you have 5 text files (or documents). If you set max_df = 0.6 then that would translate to 0.6*5=3 documents. If you set max_df = 2 then that would simply translate to 2 documents.

    The source code example below is copied from Github here and shows how the max_doc_count is constructed from the max_df. The code for min_df is similar and can be found on the GH page.

    max_doc_count = (max_df
                     if isinstance(max_df, numbers.Integral)
                     else max_df * n_doc)
    

    The defaults for min_df and max_df are 1 and 1.0, respectively. This basically says "If my term is found in only 1 document, then it's ignored. Similarly if it's found in all documents (100% or 1.0) then it's ignored."

    max_df and min_df are both used internally to calculate max_doc_count and min_doc_count, the maximum and minimum number of documents that a term must be found in. This is then passed to self._limit_features as the keyword arguments high and low respectively, the docstring for self._limit_features is

    """Remove too rare or too common features.
    
    Prune features that are non zero in more samples than high or less
    documents than low, modifying the vocabulary, and restricting it to
    at most the limit most frequent.
    
    This does not prune samples with zero features.
    """
    

提交回复
热议问题