问题
I am trying to serialize a Spark RDD by pickling it, and read the pickled file directly into Python.
a = sc.parallelize(['1','2','3','4','5'])
a.saveAsPickleFile('test_pkl')
I then copy the test_pkl files to my local. How can I read them directly into Python? When I try the normal pickle package, it fails when I attempt to read the first pickle part of 'test_pkl':
pickle.load(open('part-00000','rb'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.6/pickle.py", line 1370, in load
return Unpickler(file).load()
File "/usr/lib64/python2.6/pickle.py", line 858, in load
dispatch[key](self)
File "/usr/lib64/python2.6/pickle.py", line 970, in load_string
raise ValueError, "insecure string pickle"
ValueError: insecure string pickle
I assume that the pickling method that spark is using is different than the python pickle method (correct me if I am wrong). Is there any way for me to pickle data from Spark and read this pickled object directly into python from the file?
回答1:
It is possible using sparkpickle project. As simple as
with open("/path/to/file", "rb") as f:
print(sparkpickle.load(f))
回答2:
A better method might be to pickle the data in each partition, encode it, and write it to a text file:
import cPickle
import base64
def partition_to_encoded_pickle_object(partition):
p = [i for i in partition] # convert the RDD partition to a list
p = cPickle.dumps(p, protocol=2) # pickle the list
return [base64.b64encode(p)] # base64 encode the list, and return it in an iterable
my_rdd.mapPartitions(partition_to_encoded_pickle_object).saveAsTextFile("your/hdfs/path/")
After you download the file(s) to your local directory, you can use the following code segment to read it in:
# you first need to download the file, this step is not shown
# afterwards, you can use
path = "your/local/path/to/downloaded/files/"
data = []
for part in os.listdir(path):
if part[0] != "_": # this prevents system generated files from getting read - e.g. "_SUCCESS"
data += cPickle.loads(base64.b64decode((open(part,'rb').read())))
回答3:
The problem is the format isn't a pickle file. It is a SequenceFile of pickled objects. The sequence file can be opened within Hadoop and Spark environments but isn't meant to be consumed in python and uses JVM-based serialization to serialize, what in this case is a list of strings.
来源:https://stackoverflow.com/questions/33808481/pickling-a-spark-rdd-and-reading-it-into-python