What are the pitfalls of using Dill to serialise scikit-learn/statsmodels models?

前端 未结 3 500
灰色年华
灰色年华 2021-01-31 09:24

I need to serialise scikit-learn/statsmodels models such that all the dependencies (code + data) are packaged in an artefact and this artefact can be used to initialise the mode

3条回答
  •  野性不改
    2021-01-31 09:59

    I'm the dill author. dill was built to do exactly what you are doing… (to persist numerical fits within class instances for statistics) where these objects can then be distributed to different resources and run in an embarrassingly parallel fashion. So, the answer is yes -- I have run code like yours, using mystic and/or sklearn.

    Note that many of the authors of sklearn use cloudpickle for enabling parallel computing on sklearn objects, and not dill. dill can pickle more types of objects than cloudpickle, however cloudpickle is slightly better (at this time of writing) at pickling objects that make references to the global dictionary as part of a closure -- by default, dill does this by reference, while cloudpickle physically stores the dependencies. However, dill has a "recurse" mode, that acts like cloudpickle, so the difference when using this mode is minor. (To enable "recurse" mode, do dill.settings['recurse'] = True, or use recurse=True as a flag in dill.dump). Another minor difference is that cloudpickle contains special support for things like scikits.timeseries and PIL.Image, while dill does not.

    On the plus side, dill does not pickle classes by reference, so by pickling a class instance, it serializes the class object itself -- which is a big advantage, as it serializes instances of derived classes of classifiers, models, and etc from sklearn in their exact state at the time of pickling… so if you make modifications to the class object, the instance still unpicks correctly. There are other advantages of dill over cloudpickle, aside from the broader range of objects (and typically a smaller pickle) -- however, I won't list them here. You asked for pitfalls, so differences are not pitfalls.

    Major pitfalls:

    • You should have anything your classes refer to installed on the remote machine, just in case dill (or cloudpickle) pickles it by reference.

    • You should try to make your classes and class methods as self-contained as possible (e.g. don't refer to objects defined in the global scope from your classes).

    • sklearn objects can be big, so saving many of them to a single pickle is not always a good idea… you might want to use klepto which has a dict interface to caching and archiving, and enables you to configure the archive interface to store each key-value pair individually (e.g. one entry per file).

提交回复
热议问题