Pyspark: display a spark data frame in a table format

前端 未结 4 1722
粉色の甜心
粉色の甜心 2020-12-25 09:50

I am using pyspark to read a parquet file like below:

my_df = sqlContext.read.parquet(\'hdfs://myPath/myDB.db/myTable/**\')

Then when I do

4条回答
  •  萌比男神i
    2020-12-25 10:38

    Let's say we have the following Spark DataFrame:

    df = sqlContext.createDataFrame(
        [
            (1, "Mark", "Brown"), 
            (2, "Tom", "Anderson"), 
            (3, "Joshua", "Peterson")
        ], 
        ('id', 'firstName', 'lastName')
    )
    

    There are typically three different ways you can use to print the content of the dataframe:

    Print Spark DataFrame

    The most common way is to use show() function:

    >>> df.show()
    +---+---------+--------+
    | id|firstName|lastName|
    +---+---------+--------+
    |  1|     Mark|   Brown|
    |  2|      Tom|Anderson|
    |  3|   Joshua|Peterson|
    +---+---------+--------+
    

    Print Spark DataFrame vertically

    Say that you have a fairly large number of columns and your dataframe doesn't fit in the screen. You can print the rows vertically - For example, the following command will print the top two rows, vertically, without any truncation.

    >>> df.show(n=2, truncate=False, vertical=True)
    -RECORD 0-------------
     id        | 1        
     firstName | Mark     
     lastName  | Brown    
    -RECORD 1-------------
     id        | 2        
     firstName | Tom      
     lastName  | Anderson 
    only showing top 2 rows
    

    Convert to Pandas and print Pandas DataFrame

    Alternatively, you can convert your Spark DataFrame into a Pandas DataFrame using .toPandas() and finally print() it.

    >>> df_pd = df.toPandas()
    >>> print(df_pd)
       id firstName  lastName
    0   1      Mark     Brown
    1   2       Tom  Anderson
    2   3    Joshua  Peterson
    

    Note that this is not recommended when you have to deal with fairly large dataframes, as Pandas needs to load all the data into memory. If this is the case, the following configuration will help when converting a large spark dataframe to a pandas one:

    spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
    

    For more details you can refer to my blog post Speeding up the conversion between PySpark and Pandas DataFrames

提交回复
热议问题