Pyspark Spark DataFrame - Aggregate and filter columns in map type column

时光总嘲笑我的痴心妄想 提交于 2019-12-06 04:51:27

You can window function to generate the count, then use inbuilt functions to get the final dataframe you desire by doing to following

from pyspark.sql import Window
windowSpec = Window.partitionBy("c1")

from pyspark.sql import functions as F
df.withColumn("cnt_orig", count('c1').over(windowSpec)).orderBy('c3').groupBy("c1", "c2", "cnt_orig").agg(first('c3').as('c3'))
    .withColumn("c2", F.regexp_replace(F.regexp_replace(F.array($"c2", $"c3").cast(StringType), "[\\[\\]]", ""), ",", " : "))
      .groupBy("c1", "cnt_orig").agg(F.collect_list("c2").as('map_category_room_date'))

You should get the following result

+---+--------+----------------------+
|c1 |cnt_orig|map_category_room_date|
+---+--------+----------------------+
|A  |4       |[b : 09:00, c : 22:00]|
|b  |1       |[c : 09:00]           |
+---+--------+----------------------+

Scala way

working code to get the desired output in scala is

val windowSpec = Window.partitionBy("c1")

df.withColumn("cnt_orig", count("c1").over(windowSpec)).orderBy("c3").groupBy("c1", "c2", "cnt_orig").agg(first("c3").as("c3"))
    .withColumn("c2", regexp_replace(regexp_replace(array($"c2", $"c3").cast(StringType), "[\\[\\]]", ""), ",", " : "))
      .groupBy("c1", "cnt_orig").agg(collect_list("c2").as("map_category_room_date"))
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!