可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am able to run this script to save the file in text format, but when I try to run saveAsSequenceFile it is erroring out. If any one have idea about how to save the RDD as sequence file, please let me know the process. I tried looking for solution in "Learning Spark" as well as official Spark documentation.

This runs successfully

dataRDD = sc.textFile("/user/cloudera/sqoop_import/departments") dataRDD.saveAsTextFile("/user/cloudera/pyspark/departments")

This fails

dataRDD = sc.textFile("/user/cloudera/sqoop_import/departments") dataRDD.saveAsSequenceFile("/user/cloudera/pyspark/departmentsSeq")

Error: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsSequenceFile. : org.apache.spark.SparkException: RDD element of type java.lang.String cannot be used

Here is the data:

2,Fitness 3,Footwear 4,Apparel 5,Golf 6,Outdoors 7,Fan Shop 8,TESTING 8000,TESTING

回答1:

Sequence files are used to store key-value pairs so you cannot simply store RDD[String]. Given your data I guess you're looking for something like this:

rdd = sc.parallelize([     "2,Fitness", "3,Footwear", "4,Apparel" ]) rdd.map(lambda x: tuple(x.split(",", 1))).saveAsSequenceFile("testSeq")

If you want to keep whole strings just use None keys:

rdd.map(lambda x: (None, x)).saveAsSequenceFile("testSeqNone")

回答2:

To write to Sequence file you need the data in format of Hadoop API.

String as Text
Int as IntWritable

In Python :

data = [(1, ""),(1, "a"),(2, "bcdf")] sc.parallelize(data).saveAsNewAPIHadoopFile(path,"org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat","org.apache.hadoop.io.IntWritable","org.apache.hadoop.io.Text")

文章来源: Saving RDD as sequence file in pyspark

标签

rdd

spark

apache