问题
I'm having trouble pickling a vectorizer after I customize it.
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
tfidf_vectorizer = TfidfVectorizer(analyzer=str.split)
pickle.dump(tfidf_vectorizer, open('test.pkl', "wb"))
this results in "TypeError: can't pickle method_descriptor objects"
However, if I don't customize the Analyzer, it pickles fine. Any ideas on how I can get around this problem? I need to persist the vectorizer if I'm going to use it more widely.
By the way, I've found that using the simple string split for analyzer and pre-processing the corpus to remove non-vocabulary and stop words is essential for decent run speed. Otherwise, most of the vectorizer run time is spent in "text.py:114(_word_ngrams)". Same goes for the HashingVectorizer
this is related to Persisting data in sklearn and http://scikit-learn.org/0.10/tutorial.html#model-persistence (by the way, sklearn.externals.joblib.dump doesn't help either)
thanks!
回答1:
This is not so much a scikit-learn problem as a general Python problem:
>>> pickle.dumps(str.split)
Traceback (most recent call last):
File "<ipython-input-7-7d3648c78b22>", line 1, in <module>
pickle.dumps(str.split)
File "/usr/lib/python2.7/pickle.py", line 1374, in dumps
Pickler(file, protocol).dump(obj)
File "/usr/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/usr/lib/python2.7/pickle.py", line 306, in save
rv = reduce(self.proto)
File "/usr/lib/python2.7/copy_reg.py", line 70, in _reduce_ex
raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle method_descriptor objects
The solution is to use a pickleable analyzer:
>>> def split(s):
... return s.split()
...
>>> pickle.dumps(split)
'c__main__\nsplit\np0\n.'
>>> tfidf_vectorizer = TfidfVectorizer(analyzer=split)
>>> type(pickle.dumps(tfidf_vectorizer))
<type 'str'>
来源:https://stackoverflow.com/questions/21717076/how-to-pickle-customized-vectorizer