How to convert Spark RDD to pandas dataframe in ipython?

后端 未结 3 720
迷失自我
迷失自我 2020-12-15 15:34

I have a RDD and I want to convert it to pandas dataframe. I know that to convert and RDD to a normal dataframe

相关标签:
3条回答
  • 2020-12-15 16:16

    I recommend a fast version of toPandas by joshlk

    <script src="https://gist.github.com/joshlk/871d58e01417478176e7.js"></script>

    0 讨论(0)
  • 2020-12-15 16:29

    You can use function toPandas():

    Returns the contents of this DataFrame as Pandas pandas.DataFrame.

    This is only available if Pandas is installed and available.

    >>> df.toPandas()  
       age   name
    0    2  Alice
    1    5    Bob
    
    0 讨论(0)
  • 2020-12-15 16:35

    You'll have to use a Spark DataFrame as an intermediary step between your RDD and the desired Pandas DataFrame.

    For example, let's say I have a text file, flights.csv, that has been read in to an RDD:

    flights = sc.textFile('flights.csv')
    

    You can check the type:

    type(flights)
    <class 'pyspark.rdd.RDD'>
    

    If you just use toPandas() on the RDD, it won't work. Depending on the format of the objects in your RDD, some processing may be necessary to go to a Spark DataFrame first. In the case of this example, this code does the job:

    # RDD to Spark DataFrame
    sparkDF = flights.map(lambda x: str(x)).map(lambda w: w.split(',')).toDF()
    
    #Spark DataFrame to Pandas DataFrame
    pdsDF = sparkDF.toPandas()
    

    You can check the type:

    type(pdsDF)
    <class 'pandas.core.frame.DataFrame'>
    
    0 讨论(0)
提交回复
热议问题