How do I store a TfidfVectorizer for future use in scikit-learn?

后端 未结 3 975
不思量自难忘°
不思量自难忘° 2020-12-08 03:21

I have a TfidfVectorizer that vectorizes collection of articles followed by feature selection.

vectroizer = TfidfVectorizer()
X_train = vectroiz         


        
相关标签:
3条回答
  • 2020-12-08 03:36

    You can simply use the built in pickle library:

    pickle.dump(vectorizer, open("vectorizer.pickle", "wb"))
    pickle.dump(selector, open("selector.pickle", "wb"))
    

    and load it with:

    vectorizer = pickle.load(open("vectorizer.pickle", "rb"))
    selector = pickle.load(open("selector.pickle", "rb"))
    

    Pickle will serialize the objects to disk and load them in memory again when you need it

    pickle lib docs

    0 讨论(0)
  • 2020-12-08 03:38

    Here is my answer using joblib:

    joblib.dump(vectorizer, 'vectroizer.pkl')
    joblib.dump(selector, 'selector.pkl')
    

    Later, I can load it and ready to go:

    vectorizer = joblib.load('vectorizer.pkl')
    selector = joblib.load('selector.pkl')
    
    test = selector.trasnform(vectorizer.transform(['this is test']))
    
    0 讨论(0)
  • "Making an object persistent" basically means that you're going to dump the binary code stored in memory that represents the object in a file on the hard-drive, so that later on in your program or in any other program the object can be reloaded from the file in the hard drive into memory.

    Either scikit-learn included joblib or the stdlib pickle and cPickle would do the job. I tend to prefer cPickle because it is significantly faster. Using ipython's %timeit command:

    >>> from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF
    >>> t = TFIDF()
    >>> t.fit_transform(['hello world'], ['this is a test'])
    
    # generic serializer - deserializer test
    >>> def dump_load_test(tfidf, serializer):
    ...:    with open('vectorizer.bin', 'w') as f:
    ...:        serializer.dump(tfidf, f)
    ...:    with open('vectorizer.bin', 'r') as f:
    ...:        return serializer.load(f)
    
    # joblib has a slightly different interface
    >>> def joblib_test(tfidf):
    ...:    joblib.dump(tfidf, 'tfidf.bin')
    ...:    return joblib.load('tfidf.bin')
    
    # Now, time it!
    >>> %timeit joblib_test(t)
    100 loops, best of 3: 3.09 ms per loop
    
    >>> %timeit dump_load_test(t, pickle)
    100 loops, best of 3: 2.16 ms per loop
    
    >>> %timeit dump_load_test(t, cPickle)
    1000 loops, best of 3: 879 µs per loop
    

    Now if you want to store multiple objects in a single file, you can easily create a data structure to store them, then dump the data structure itself. This will work with tuple, list or dict. From the example of your question:

    # train
    vectorizer = TfidfVectorizer()
    X_train = vectorizer.fit_transform(corpus)
    selector = SelectKBest(chi2, k = 5000 )
    X_train_sel = selector.fit_transform(X_train, y_train)
    
    # dump as a dict
    data_struct = {'vectorizer': vectorizer, 'selector': selector}
    # use the 'with' keyword to automatically close the file after the dump
    with open('storage.bin', 'wb') as f: 
        cPickle.dump(data_struct, f)
    

    Later or in another program, the following statements will bring back the data structure in your program's memory:

    # reload
    with open('storage.bin', 'rb') as f:
        data_struct = cPickle.load(f)
        vectorizer, selector = data_struct['vectorizer'], data_struct['selector']
    
    # do stuff...
    vectors = vectorizer.transform(...)
    vec_sel = selector.transform(vectors)
    
    0 讨论(0)
提交回复
热议问题