How to convert Spark RDD to pandas dataframe in ipython?

后端 未结 3 739
迷失自我
迷失自我 2020-12-15 15:34

I have a RDD and I want to convert it to pandas dataframe. I know that to convert and RDD to a normal dataframe

3条回答
  •  陌清茗
    陌清茗 (楼主)
    2020-12-15 16:35

    You'll have to use a Spark DataFrame as an intermediary step between your RDD and the desired Pandas DataFrame.

    For example, let's say I have a text file, flights.csv, that has been read in to an RDD:

    flights = sc.textFile('flights.csv')
    

    You can check the type:

    type(flights)
    
    

    If you just use toPandas() on the RDD, it won't work. Depending on the format of the objects in your RDD, some processing may be necessary to go to a Spark DataFrame first. In the case of this example, this code does the job:

    # RDD to Spark DataFrame
    sparkDF = flights.map(lambda x: str(x)).map(lambda w: w.split(',')).toDF()
    
    #Spark DataFrame to Pandas DataFrame
    pdsDF = sparkDF.toPandas()
    

    You can check the type:

    type(pdsDF)
    
    

提交回复
热议问题