Pickling a Spark RDD and reading it into Python

↘锁芯ラ 提交于 2019-12-05 14:20:29

It is possible using sparkpickle project. As simple as

with open("/path/to/file", "rb") as f:
    print(sparkpickle.load(f))

A better method might be to pickle the data in each partition, encode it, and write it to a text file:

import cPickle
import base64

def partition_to_encoded_pickle_object(partition):
    p = [i for i in partition] # convert the RDD partition to a list
    p = cPickle.dumps(p, protocol=2) # pickle the list
    return [base64.b64encode(p)] # base64 encode the list, and return it in an iterable

my_rdd.mapPartitions(partition_to_encoded_pickle_object).saveAsTextFile("your/hdfs/path/")

After you download the file(s) to your local directory, you can use the following code segment to read it in:

# you first need to download the file, this step is not shown
# afterwards, you can use 
path = "your/local/path/to/downloaded/files/"
data = []
for part in os.listdir(path):
    if part[0] != "_": # this prevents system generated files from getting read - e.g. "_SUCCESS"
        data += cPickle.loads(base64.b64decode((open(part,'rb').read())))

The problem is the format isn't a pickle file. It is a SequenceFile of pickled objects. The sequence file can be opened within Hadoop and Spark environments but isn't meant to be consumed in python and uses JVM-based serialization to serialize, what in this case is a list of strings.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!