I have a RDD
and I want to convert it to pandas
dataframe
. I know that to convert and RDD
to a normal dataframe
we can do
df = rdd1.toDF()
But I want to convert the RDD
to pandas
dataframe
and not a normal dataframe
. How can I do it?
You can use function toPandas()
:
Returns the contents of this DataFrame as Pandas pandas.DataFrame.
This is only available if Pandas is installed and available.
>>> df.toPandas()
age name
0 2 Alice
1 5 Bob
You'll have to use a Spark DataFrame as an intermediary step between your RDD and the desired Pandas DataFrame.
For example, let's say I have a text file, flights.csv
, that has been read in to an RDD:
flights = sc.textFile('flights.csv')
You can check the type:
type(flights)
<class 'pyspark.rdd.RDD'>
If you just use toPandas()
on the RDD, it won't work. Depending on the format of the objects in your RDD, some processing may be necessary to go to a Spark DataFrame first. In the case of this example, this code does the job:
# RDD to Spark DataFrame
sparkDF = flights.map(lambda x: str(x)).map(lambda w: w.split(',')).toDF()
#Spark DataFrame to Pandas DataFrame
pdsDF = sparkDF.toPandas()
You can check the type:
type(pdsDF)
<class 'pandas.core.frame.DataFrame'>
I recommend a fast version of toPandas by joshlk
<script src="https://gist.github.com/joshlk/871d58e01417478176e7.js"></script>
来源:https://stackoverflow.com/questions/34817549/how-to-convert-spark-rdd-to-pandas-dataframe-in-ipython