Pyspark got TypeError: can’t pickle _abc_data objects

问题

I’m trying to generate predictions from a pickled model with pyspark, I get the model with the following command

model = deserialize_python_object(filename)

with deserialize_python_object(filename) defined as:

import pickle
def deserialize_python_object(filename):
try:
    with open(filename, ‘rb’) as f:
        obj = pickle.load(f)
except:
    obj = None
return obj

the error log looks like:

File “/Users/gmg/anaconda3/envs/env/lib**strong text**/python3.7/site-packages/pyspark/sql/udf.py”, line 189, in wrapper
    return self(*args)
  File “/Users/gmg/anaconda3/envs/env/lib/python3.7/site-packages/pyspark/sql/udf.py”, line 167, in __call__
    judf = self._judf
  File “/Users/gmg/anaconda3/envs/env/lib/python3.7/site-packages/pyspark/sql/udf.py”, line 151, in _judf
    self._judf_placeholder = self._create_judf()
  File “/Users/gmg/anaconda3/envs/env/lib/python3.7/site-packages/pyspark/sql/udf.py”, line 160, in _create_judf
    wrapped_func = _wrap_function(sc, self.func, self.returnType)
  File “/Users/gmg/anaconda3/envs/env/lib/python3.7/site-packages/pyspark/sql/udf.py”, line 35, in _wrap_function
    pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
  File “/Users/gmg/anaconda3/envs/env/lib/python3.7/site-packages/pyspark/rdd.py”, line 2420, in _prepare_for_python_RDD
    pickled_command = ser.dumps(command)
  File “/Users/gmg/anaconda3/envs/env/lib/python3.7/site-packages/pyspark/serializers.py”, line 600, in dumps
    raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: TypeError: can’t pickle _abc_data objects

回答1:

Seems that you are having the same problem like in this issue: https://github.com/cloudpipe/cloudpickle/issues/180

What is happening is that pyspark's cloudpickle library is outdated for python 3.7, you should fix the problem with this crafted patch by now until pyspark gets that module updated.

Try using this workaround:

Install cloudpickle pip install cloudpickle
Add this to your code:

import cloudpickle
import pyspark.serializers
pyspark.serializers.cloudpickle = cloudpickle

monkeypatch credit https://github.com/cloudpipe/cloudpickle/issues/305

来源：https://stackoverflow.com/questions/59058588/pyspark-got-typeerror-can-t-pickle-abc-data-objects

标签

python

pyspark