How can i reduce memory usage of Scikit-Learn Vectorizers?

后端 未结 2 951
孤街浪徒
孤街浪徒 2020-12-18 11:05

TFIDFVectorizer takes so much memory ,vectorizing 470 MB of 100k documents takes over 6 GB , if we go 21 million documents it will not fit 60 GB of RAM we have.

So w

2条回答
  •  春和景丽
    2020-12-18 11:41

    I would strongly recommend you to use the HashingVectorizer when fitting models on large dataset.

    The HashingVectorizer is data independent, only the parameters from vectorizer.get_params() are important. Hence (un)pickling `HashingVectorizer instance should be very fast.

    The vocabulary based vectorizers are better suited for exploratory analysis on small datasets.

提交回复
热议问题