percentage count per group and pivot with pyspark

天涯浪子 提交于 2019-12-06 04:34:28

You can pivot with count and adjust the result. First some imports:

from pyspark.sql.functions import col, lit, coalesce
from itertools import chain

Find levels:

levels = [x for x in chain(*df.select("to").distinct().collect())]

pivot:

pivoted = df.groupBy("from").pivot("to", levels).count()

compute row count expression:

row_count = sum(coalesce(col(x), lit(0)) for x in levels)

create a list of adjusted columns:

adjusted = [(col(c) / row_count).alias(c) for c in levels]

and select:

pivoted.select(col("from"), *adjusted)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!