How can i reduce memory usage of Scikit-Learn Vectorizers?

后端未结

关注

 2  951

孤街浪徒 2020-12-18 11:05

TFIDFVectorizer takes so much memory ,vectorizing 470 MB of 100k documents takes over 6 GB , if we go 21 million documents it will not fit 60 GB of RAM we have.

So w

2条回答

春和景丽 (楼主)

2020-12-18 11:41

I would strongly recommend you to use the HashingVectorizer when fitting models on large dataset.

The HashingVectorizer is data independent, only the parameters from vectorizer.get_params() are important. Hence (un)pickling `HashingVectorizer instance should be very fast.

The vocabulary based vectorizers are better suited for exploratory analysis on small datasets.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...