How to serialize a CountVectorizer with a custom tokenize function with joblib

问题

I use a CountVectorizer with a custom tokenize method. When I serialize it, then unserialize it, I get the following error message :

AttributeError: module '__main__' has no attribute 'tokenize'

How can I "serialize" the tokenize method ?

Here is a small example :

import nltk
from nltk.stem.snowball import FrenchStemmer
stemmer = FrenchStemmer()

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

tfidf_vec = TfidfVectorizer(tokenizer=tokenize, stop_words=stopwords.words('french'), ngram_range=(1,1))

clf = MLPClassifier(solver='lbfgs', alpha=0.02, hidden_layer_sizes=(400, 50))

pipeline = Pipeline([("tfidf", tfidf_vec),
                ("MLP", clf)])

joblib.dump(pipeline,"../models/classifier.pkl")

回答1:

joblib (and pickle which it uses under the hood) serializes functions this way: it just remembers a path to import a function from - module and function name. So if you define a function in an interactive session, there is no place to import this function from; it is destroyed as soon as process exits.

To make serialization work put this code to a Python module (save it to a .py file), and make sure this module is available (importable) when you're calling joblib.load.

来源：https://stackoverflow.com/questions/47017808/how-to-serialize-a-countvectorizer-with-a-custom-tokenize-function-with-joblib

标签

python

scikit-learn

joblib

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!