How can i reduce memory usage of Scikit-Learn Vectorizers?

后端 未结 2 975
孤街浪徒
孤街浪徒 2020-12-18 11:05

TFIDFVectorizer takes so much memory ,vectorizing 470 MB of 100k documents takes over 6 GB , if we go 21 million documents it will not fit 60 GB of RAM we have.

So w

2条回答
  •  余生分开走
    2020-12-18 11:45

    One way to overcome the inability of HashingVectorizer to account for IDF is to index your data into elasticsearch or lucene and retrieve termvectors from there using which you can calculate Tf-IDF.

提交回复
热议问题