What is the recommended way to distribute a scikit learn classifier in spark?

不羁的心 提交于 2019-12-21 22:18:50

问题


I have built a classifier using scikit learn and now I would like to use spark to run predict_proba on a large dataset. I currently pickle the classifier once using:

import pickle
pickle.dump(clf, open('classifier.pickle', 'wb'))

and then in my spark code I broadcast this pickle using sc.broadcast for use in my spark code which has to load it in at each cluster node.

This works but the pickle is large (about 0.5GB) and it seems very inefficient.

Is there a better way to do this?


回答1:


This works but the pickle is large (about 0.5GB)

Note that the size of the forest will be O(M*N*Log(N)), where M is the number of trees and N is the number of samples. (source)

Is there a better way to do this?

There several options you can try to reduce the size of either your RandomForestClassifier model, or the serialized file:

  • reduce the size of the model by optimizing hyperparameters, in particular max_depth, max_leaf_nodes, min_samples_split as these parameters influence the size of the trees used in the ensemble

  • zip the pickle, e.g. as follows. Note there are several options and one might fit you better, so you'll need to try:

    with gzip.open('classifier.pickle', 'wb') as f:
        pickle.dump(clf, f)
    
  • use joblib instead of pickle, it compresses better and is also the recommended approach.

     from sklearn.externals import joblib
        joblib.dump(clf, 'filename.pkl') 
    

    The caveat here is that joblib will create multiple files in a directory, so you'll have to zip these up for transport.

  • last but not least you can also try reducing the size of the input by dimensionality reduction before you fit/predict using the RandomTreeClassifier, as mentioned in the practical tips on decision trees.

YMMV



来源:https://stackoverflow.com/questions/39672114/what-is-the-recommended-way-to-distribute-a-scikit-learn-classifier-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!