Pickling a Spark RDD and reading it into Python

I am trying to serialize a Spark RDD by pickling it, and read the pickled file directly into Python.

a = sc.parallelize(['1','2','3','4','5'])
a.saveAsPickleFile('test_pkl')

I then copy the test_pkl files to my local. How can I read them directly into Python? When I try the normal pickle package, it fails when I attempt to read the first pickle part of 'test_pkl':

pickle.load(open('part-00000','rb'))

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.6/pickle.py", line 1370, in load
    return Unpickler(file).load()
  File "/usr/lib64/python2.6/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib64/python2.6/pickle.py", line 970, in load_string
    raise ValueError, "insecure string pickle"
ValueError: insecure string pickle

I assume that the pickling method that spark is using is different than the python pickle method (correct me if I am wrong). Is there any way for me to pickle data from Spark and read this pickled object directly into python from the file?

It is possible using sparkpickle project. As simple as

with open("/path/to/file", "rb") as f:
    print(sparkpickle.load(f))

A better method might be to pickle the data in each partition, encode it, and write it to a text file:

import cPickle
import base64

def partition_to_encoded_pickle_object(partition):
    p = [i for i in partition] # convert the RDD partition to a list
    p = cPickle.dumps(p, protocol=2) # pickle the list
    return [base64.b64encode(p)] # base64 encode the list, and return it in an iterable

my_rdd.mapPartitions(partition_to_encoded_pickle_object).saveAsTextFile("your/hdfs/path/")

After you download the file(s) to your local directory, you can use the following code segment to read it in:

# you first need to download the file, this step is not shown
# afterwards, you can use 
path = "your/local/path/to/downloaded/files/"
data = []
for part in os.listdir(path):
    if part[0] != "_": # this prevents system generated files from getting read - e.g. "_SUCCESS"
        data += cPickle.loads(base64.b64decode((open(part,'rb').read())))

The problem is the format isn't a pickle file. It is a SequenceFile of pickled objects. The sequence file can be opened within Hadoop and Spark environments but isn't meant to be consumed in python and uses JVM-based serialization to serialize, what in this case is a list of strings.

来源：https://stackoverflow.com/questions/33808481/pickling-a-spark-rdd-and-reading-it-into-python

标签

python

apache-spark

pickle

pyspark