I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly
As per the CountVectorizer
documentation here.
When using a float in the range [0.0, 1.0]
they refer to the document frequency. That is the percentage of documents that contain the term.
When using an int it refers to absolute number of documents that hold this term.
Consider the example where you have 5 text files (or documents). If you set max_df = 0.6
then that would translate to 0.6*5=3
documents. If you set max_df = 2
then that would simply translate to 2 documents.
The source code example below is copied from Github here and shows how the max_doc_count
is constructed from the max_df
. The code for min_df
is similar and can be found on the GH page.
max_doc_count = (max_df
if isinstance(max_df, numbers.Integral)
else max_df * n_doc)
The defaults for min_df
and max_df
are 1 and 1.0, respectively. This basically says "If my term is found in only 1 document, then it's ignored. Similarly if it's found in all documents (100% or 1.0) then it's ignored."
max_df
and min_df
are both used internally to calculate max_doc_count
and min_doc_count
, the maximum and minimum number of documents that a term must be found in. This is then passed to self._limit_features as the keyword arguments high
and low
respectively, the docstring for self._limit_features
is
"""Remove too rare or too common features.
Prune features that are non zero in more samples than high or less
documents than low, modifying the vocabulary, and restricting it to
at most the limit most frequent.
This does not prune samples with zero features.
"""