I am using sklearn on Python to do some clustering. I\'ve trained 200,000 data, and code below works well.
corpus = open(\"token_from_xml.txt\")
vectorizer =
I successfully saved the feature list by saving vectorizer.vocabulary_, and reuse by CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)
Codes below:
corpus = np.array(["aaa bbb ccc", "aaa bbb ffffd"])
vectorizer = CountVectorizer(decode_error="replace")
vec_train = vectorizer.fit_transform(corpus)
#Save vectorizer.vocabulary_
pickle.dump(vectorizer.vocabulary_,open("feature.pkl","wb"))
#Load it later
transformer = TfidfTransformer()
loaded_vec = CountVectorizer(decode_error="replace",vocabulary=pickle.load(open("feature.pkl", "rb")))
tfidf = transformer.fit_transform(loaded_vec.fit_transform(np.array(["aaa ccc eee"])))
That works. tfidf will have same feature length as trained data.