In pandas data frame, I am using the following code to plot histogram of a column:
my_df.hist(column = \'field_1\')
Is there something that
Another solution, without the need for extra imports, which should also be efficient; First, use window partition:
import pyspark.sql.functions as F
import pyspark.sql as SQL
win = SQL.Window.partitionBy('column_of_values')
Then all you need it to use count aggregation partitioned by the window:
df.select(F.count('column_of_values').over(win).alias('histogram'))
The aggregative operators happens on each partition of the cluster, and does not require an extra round-trip to the host.