Can I convert pandas dataframe to spark rdd?

心不动则不痛 提交于 2019-12-21 20:24:50

问题


Pbm:

a) Read a local file into Panda dataframe say PD_DF b) Manipulate/Massge the PD_DF and add columns to dataframe c) Need to write PD_DF to HDFS using spark. How do I do it ?


回答1:


You can use the SQLContext object to invoke the createDataFrame method, which takes an input data which can optionally be a Pandas DataFrame object.




回答2:


Lets say dataframe is of type pandas.core.frame.DataFrame then in spark 2.1 - Pyspark I did this

rdd_data = spark.createDataFrame(dataframe)\
                .rdd

In case, if you want to rename any columns or select only few columns, you do them before use of .rdd

Hope it works for you also.




回答3:


I use Spark 1.6.0. First transform pandas dataframe into spark dataframe then spark dataframe spark rdd

sparkDF = sqlContext.createDataFrame(pandasDF)
sparkRDD = sparkDF.rdd.map(list)
type(sparkRDD)
pyspark.rdd.PipelinedRDD


来源:https://stackoverflow.com/questions/29635776/can-i-convert-pandas-dataframe-to-spark-rdd

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!