I am using sklearn on Python to do some clustering. I\'ve trained 200,000 data, and code below works well.
corpus = open(\"token_from_xml.txt\")
vectorizer =
a simpler solution, just use joblib libarary as document said:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.externals import joblib
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
feature_name = vectorizer.get_feature_names()
tfidf = TfidfTransformer()
tfidf.fit(X)
# save your model in disk
joblib.dump(transformer, 'tfidf.pkl')
# load your model
tfidf = joblib.load('tfidf.pkl')