How to convert Spark RDD to pandas dataframe in ipython?

浪尽此生 提交于 2019-11-29 15:58:44

问题


I have a RDD and I want to convert it to pandas dataframe. I know that to convert and RDD to a normal dataframe we can do

df = rdd1.toDF()

But I want to convert the RDD to pandas dataframe and not a normal dataframe. How can I do it?


回答1:


You can use function toPandas():

Returns the contents of this DataFrame as Pandas pandas.DataFrame.

This is only available if Pandas is installed and available.

>>> df.toPandas()  
   age   name
0    2  Alice
1    5    Bob



回答2:


You'll have to use a Spark DataFrame as an intermediary step between your RDD and the desired Pandas DataFrame.

For example, let's say I have a text file, flights.csv, that has been read in to an RDD:

flights = sc.textFile('flights.csv')

You can check the type:

type(flights)
<class 'pyspark.rdd.RDD'>

If you just use toPandas() on the RDD, it won't work. Depending on the format of the objects in your RDD, some processing may be necessary to go to a Spark DataFrame first. In the case of this example, this code does the job:

# RDD to Spark DataFrame
sparkDF = flights.map(lambda x: str(x)).map(lambda w: w.split(',')).toDF()

#Spark DataFrame to Pandas DataFrame
pdsDF = sparkDF.toPandas()

You can check the type:

type(pdsDF)
<class 'pandas.core.frame.DataFrame'>



回答3:


I recommend a fast version of toPandas by joshlk

<script src="https://gist.github.com/joshlk/871d58e01417478176e7.js"></script>


来源:https://stackoverflow.com/questions/34817549/how-to-convert-spark-rdd-to-pandas-dataframe-in-ipython

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!