发表新帖

发表新帖

Pyspark: show histogram of a data frame column

前端未结

关注

 5  1990

情书的邮戳 2020-12-14 01:04

In pandas data frame, I am using the following code to plot histogram of a column:

my_df.hist(column = \'field_1\')

Is there something that

5条回答

予麋鹿 (楼主)

2020-12-14 01:56
Another solution, without the need for extra imports, which should also be efficient; First, use window partition:
```
import pyspark.sql.functions as F
import pyspark.sql as SQL
win = SQL.Window.partitionBy('column_of_values')
```
Then all you need it to use count aggregation partitioned by the window:

df.select(F.count('column_of_values').over(win).alias('histogram'))

The aggregative operators happens on each partition of the cluster, and does not require an extra round-trip to the host.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题