Keep TFIDF result for predicting new content using Scikit for Python

后端 未结 5 516
有刺的猬
有刺的猬 2020-12-07 21:22

I am using sklearn on Python to do some clustering. I\'ve trained 200,000 data, and code below works well.

corpus = open(\"token_from_xml.txt\")
vectorizer =         


        
5条回答
  •  遥遥无期
    2020-12-07 21:46

    I successfully saved the feature list by saving vectorizer.vocabulary_, and reuse by CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)

    Codes below:

    corpus = np.array(["aaa bbb ccc", "aaa bbb ffffd"])
    vectorizer = CountVectorizer(decode_error="replace")
    vec_train = vectorizer.fit_transform(corpus)
    #Save vectorizer.vocabulary_
    pickle.dump(vectorizer.vocabulary_,open("feature.pkl","wb"))
    
    #Load it later
    transformer = TfidfTransformer()
    loaded_vec = CountVectorizer(decode_error="replace",vocabulary=pickle.load(open("feature.pkl", "rb")))
    tfidf = transformer.fit_transform(loaded_vec.fit_transform(np.array(["aaa ccc eee"])))
    

    That works. tfidf will have same feature length as trained data.

提交回复
热议问题