I am using sklearn on Python to do some clustering. I\'ve trained 200,000 data, and code below works well.
corpus = open(\"token_from_xml.txt\")
vectorizer =
you can do the vectorization and tfidf transformation in one stage:
vec =TfidfVectorizer()
then fit and transform on the training data
tfidf = vec.fit_transform(training_data)
and use the tfidf model to transform
unseen_tfidf = vec.transform(unseen_data)
km = KMeans(30)
kmresult = km.fit(tfidf).predict(unseen_tfid)