问题
I use a CountVectorizer with a custom tokenize method. When I serialize it, then unserialize it, I get the following error message :
AttributeError: module '__main__' has no attribute 'tokenize'
How can I "serialize" the tokenize
method ?
Here is a small example :
import nltk
from nltk.stem.snowball import FrenchStemmer
stemmer = FrenchStemmer()
def stem_tokens(tokens, stemmer):
stemmed = []
for item in tokens:
stemmed.append(stemmer.stem(item))
return stemmed
def tokenize(text):
tokens = nltk.word_tokenize(text)
stems = stem_tokens(tokens, stemmer)
return stems
tfidf_vec = TfidfVectorizer(tokenizer=tokenize, stop_words=stopwords.words('french'), ngram_range=(1,1))
clf = MLPClassifier(solver='lbfgs', alpha=0.02, hidden_layer_sizes=(400, 50))
pipeline = Pipeline([("tfidf", tfidf_vec),
("MLP", clf)])
joblib.dump(pipeline,"../models/classifier.pkl")
回答1:
joblib (and pickle which it uses under the hood) serializes functions this way: it just remembers a path to import a function from - module and function name. So if you define a function in an interactive session, there is no place to import this function from; it is destroyed as soon as process exits.
To make serialization work put this code to a Python module (save it to a .py file), and make sure this module is available (importable) when you're calling joblib.load
.
来源:https://stackoverflow.com/questions/47017808/how-to-serialize-a-countvectorizer-with-a-custom-tokenize-function-with-joblib