TFIDFVectorizer takes so much memory ,vectorizing 470 MB of 100k documents takes over 6 GB , if we go 21 million documents it will not fit 60 GB of RAM we have.
So w
I would strongly recommend you to use the HashingVectorizer when fitting models on large dataset.
The HashingVectorizer
is data independent, only the parameters from vectorizer.get_params()
are important. Hence (un)pickling `HashingVectorizer instance should be very fast.
The vocabulary based vectorizers are better suited for exploratory analysis on small datasets.