Bringing a classifier to production

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-07 05:00:31

问题


I've saved my classifier pipeline using joblib:

vec = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))
pac_clf = PassiveAggressiveClassifier(C=1)
vec_clf = Pipeline([('vectorizer', vec), ('pac', pac_clf)])
vec_clf.fit(X_train,y_train)
joblib.dump(vec_clf, 'class.pkl', compress=9)

Now i'm trying to use it in a production env:

def classify(title):

  #load classifier and predict
  classifier = joblib.load('class.pkl')

  #vectorize/transform the new title then predict
  vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))
  X_test = vectorizer.transform(title)
  predict = classifier.predict(X_test)
  return predict

The error i'm getting is: ValueError: Vocabulary wasn't fitted or is empty! I guess i should load the Vocabulary from te joblid but i can't get it to work


回答1:


Just replace:

  #load classifier and predict
  classifier = joblib.load('class.pkl')

  #vectorize/transform the new title then predict
  vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))
  X_test = vectorizer.transform(title)
  predict = classifier.predict(X_test)
  return predict

by:

  # load the saved pipeline that includes both the vectorizer
  # and the classifier and predict
  classifier = joblib.load('class.pkl')
  predict = classifier.predict(X_test)
  return predict

class.pkl includes the full pipeline, there is no need to create a new vectorizer instance. As the error message says you need to reuse the vectorizer that was trained in the first place because the feature mapping from token (string ngrams) to column index is saved in the vectorizer itself. This mapping is named the "vocabulary".



来源:https://stackoverflow.com/questions/25788151/bringing-a-classifier-to-production

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!