Pyspark: show histogram of a data frame column

前端 未结 5 1990
情书的邮戳
情书的邮戳 2020-12-14 01:04

In pandas data frame, I am using the following code to plot histogram of a column:

my_df.hist(column = \'field_1\')

Is there something that

5条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-12-14 01:56

    Another solution, without the need for extra imports, which should also be efficient; First, use window partition:

    import pyspark.sql.functions as F
    import pyspark.sql as SQL
    win = SQL.Window.partitionBy('column_of_values')
    

    Then all you need it to use count aggregation partitioned by the window:

    df.select(F.count('column_of_values').over(win).alias('histogram'))

    The aggregative operators happens on each partition of the cluster, and does not require an extra round-trip to the host.

提交回复
热议问题