Saving RDD as sequence file in pyspark

孤街浪徒 提交于 2019-12-06 15:08:53

Sequence files are used to store key-value pairs so you cannot simply store RDD[String]. Given your data I guess you're looking for something like this:

rdd = sc.parallelize([
    "2,Fitness", "3,Footwear", "4,Apparel"
])
rdd.map(lambda x: tuple(x.split(",", 1))).saveAsSequenceFile("testSeq")

If you want to keep whole strings just use None keys:

rdd.map(lambda x: (None, x)).saveAsSequenceFile("testSeqNone")

To write to Sequence file you need the data in format of Hadoop API.

String as Text
Int as IntWritable

In Python :

data = [(1, ""),(1, "a"),(2, "bcdf")]
sc.parallelize(data).saveAsNewAPIHadoopFile(path,"org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat","org.apache.hadoop.io.IntWritable","org.apache.hadoop.io.Text")
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!