问题
When I run predict_proba
on a dataframe without multiprocessing I get the expected behavior. The code is as follows:
probabilities_data = classname.perform_model_prob_predictions_nc(prediction_model, vectorized_data)
where: perform_model_prob_predictions_nc
is:
def perform_model_prob_predictions_nc(model, dataFrame):
try:
return model.predict_proba(dataFrame)
except AttributeError:
logging.error("AttributeError occurred",exc_info=True)
But when I try to run the same function using chunks and multiprocessing:
probabilities_data = classname.perform_model_prob_predictions(prediction_model, chunks, cores)
where perform_model_prob_predictions
is :
def perform_model_prob_predictions(model, dataFrame, cores=4):
try:
with Pool(processes=cores) as pool:
result = pool.map(model.predict_proba, dataFrame)
return result
except Exception:
logging.error("Error occurred", exc_info=True)
I get the following error:
PicklingError: Can't pickle <function OneVsRestClassifier.predict_proba at 0x14b1d9730>: it's not the same object as sklearn.multiclass.OneVsRestClassifier.predict_proba
As reference:
cores = 4
vectorized_data = pd.DataFrame(...)
chunk_size = len(vectorized_data) // cores + cores
chunks = [df_chunk for g, df_chunk in vectorized_data.groupby(np.arange(len(vectorized_data)) // chunk_size)]
回答1:
Pool
internally uses Queue and anything that goes there needs to be pickled. The error tells you that PicklingError: Can't pickle <function OneVsRestClassifier.predict_proba
cannot be pickled.
You have several options, some are described in this SO post. Another option is to use joblib with loky backend. The latter uses cloudpickle that allows for serialisation of constructs not supported by default pickle.
The code will look more or less like this:
from joblib import Parallel, delayed
Parallel(n_jobs=4, backend='loky')(delayed(model.predict_proba)(dataFrame=dataFrame) for chunk in chunks)
Mind that classic pickling such methods on objects is in general not healthy idea. dill could work here well.
来源:https://stackoverflow.com/questions/54815305/multiprocessing-using-chunks-does-not-work-with-predict-proba