How to save a spark dataframe as a text file without Rows in pyspark?

问题

I have a dataframe "df" with the columns ['name', 'age'] I saved the dataframe using df.rdd.saveAsTextFile("..") to save it as an rdd. I loaded the saved file and then collect() gives me the following result.

a = sc.textFile("\mee\sample")
a.collect()
Output:
    [u"Row(name=u'Alice', age=1)",
     u"Row(name=u'Alice', age=2)",
     u"Row(name=u'Joe', age=3)"]

This is not an rdd of Rows.

a.map(lambda g:g.age).collect()
AttributeError: 'unicode' object has no attribute 'age'

Is there any way to save the dataframe as a normal rdd without column names and Row keywords? I want to save the dataframe so that on loading the file and collect should give me as follows:

a.collect()   
[(Alice,1),(Alice,2),(Joe,3)]

回答1:

It is a normal RDD[Row]. Problem is you that when you saveAsTextFile and load with textFile what you get is a bunch of strings. If you want to save objects you should use some form of serialization. For example pickleFile:

from pyspark.sql import Row

df = sqlContext.createDataFrame(
   [('Alice', 1), ('Alice', 2), ('Joe', 3)],
   ("name", "age")
)

df.rdd.map(tuple).saveAsPickleFile("foo")
sc.pickleFile("foo").collect()

## [('Joe', 3), ('Alice', 1), ('Alice', 2)]

来源：https://stackoverflow.com/questions/34083871/how-to-save-a-spark-dataframe-as-a-text-file-without-rows-in-pyspark

标签

python

apache-spark

pyspark

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!