问题
I have a corpus consisting 3,500,000 text documents. I want to construct a tf-idf matrix of (3,500,000 * 5,000) size. Here I have 5,000 distinct features (words).
I am using scikit sklearn in python. Where I am using TfidfVectorizer to do that. I have constructed a dictionary of 5000 size(one for each feature). While initializing the TfidfVectorizer I am setting the parameter vocabulary with the dictionary of features. But while calling the fit_transform, it is showing some memory-map and then "CORE DUMP".
- Does
TfidfVectorizerperform well for a fixed vocabulary and large corpus? - If not, then what are the other options?
回答1:
Other option can be gensim it is very efficient in terms of memory and is very fast. Here is the link to its tf-idf tutorial for your corpus.
来源:https://stackoverflow.com/questions/23015246/how-to-get-tf-idf-matrix-of-a-large-size-corpus-where-features-are-pre-specifie