Multiprocessing using chunks does not work with predict_proba

妖精的绣舞 提交于 2020-01-01 19:59:12

问题


When I run predict_proba on a dataframe without multiprocessing I get the expected behavior. The code is as follows:

probabilities_data = classname.perform_model_prob_predictions_nc(prediction_model, vectorized_data)

where: perform_model_prob_predictions_nc is:

def perform_model_prob_predictions_nc(model, dataFrame): 
    try:
        return model.predict_proba(dataFrame)
    except AttributeError:
        logging.error("AttributeError occurred",exc_info=True)

But when I try to run the same function using chunks and multiprocessing:

probabilities_data = classname.perform_model_prob_predictions(prediction_model, chunks, cores)

where perform_model_prob_predictions is :

def perform_model_prob_predictions(model, dataFrame, cores=4): 
    try:
        with Pool(processes=cores) as pool:
            result = pool.map(model.predict_proba, dataFrame)
            return result
    except Exception:
        logging.error("Error occurred", exc_info=True)

I get the following error:

PicklingError: Can't pickle <function OneVsRestClassifier.predict_proba at 0x14b1d9730>: it's not the same object as sklearn.multiclass.OneVsRestClassifier.predict_proba

As reference:

cores = 4
vectorized_data = pd.DataFrame(...)
chunk_size = len(vectorized_data) // cores + cores
chunks = [df_chunk for g, df_chunk in vectorized_data.groupby(np.arange(len(vectorized_data)) // chunk_size)]

回答1:


Pool internally uses Queue and anything that goes there needs to be pickled. The error tells you that PicklingError: Can't pickle <function OneVsRestClassifier.predict_proba cannot be pickled.

You have several options, some are described in this SO post. Another option is to use joblib with loky backend. The latter uses cloudpickle that allows for serialisation of constructs not supported by default pickle.

The code will look more or less like this:

from joblib import Parallel, delayed

Parallel(n_jobs=4, backend='loky')(delayed(model.predict_proba)(dataFrame=dataFrame) for chunk in chunks)

Mind that classic pickling such methods on objects is in general not healthy idea. dill could work here well.



来源:https://stackoverflow.com/questions/54815305/multiprocessing-using-chunks-does-not-work-with-predict-proba

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!