How to get tf-idf matrix of a large size corpus, where features are pre-specified?

问题

I have a corpus consisting 3,500,000 text documents. I want to construct a tf-idf matrix of (3,500,000 * 5,000) size. Here I have 5,000 distinct features (words).

I am using scikit sklearn in python. Where I am using TfidfVectorizer to do that. I have constructed a dictionary of 5000 size(one for each feature). While initializing the TfidfVectorizer I am setting the parameter vocabulary with the dictionary of features. But while calling the fit_transform, it is showing some memory-map and then "CORE DUMP".

Does TfidfVectorizer perform well for a fixed vocabulary and large corpus?
If not, then what are the other options?

回答1:

Other option can be gensim it is very efficient in terms of memory and is very fast. Here is the link to its tf-idf tutorial for your corpus.

来源：https://stackoverflow.com/questions/23015246/how-to-get-tf-idf-matrix-of-a-large-size-corpus-where-features-are-pre-specifie

标签

python

scikit-learn

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!