Pyspark Spark DataFrame - Aggregate and filter columns in map type column

My DataFrame looks like:

| c1 | c2|  c3  |
|----+---+-------
| A  | b | 22:00| 
| A  | b | 23:00|
| A  | b | 09:00|
| A  | c | 22:00|
| B  | c | 09:30|

I would like to perform some aggregations and create a second DataFrame with 3 columns:

c1: is the column that I want to group by.

map_category_room_date: map type, key the c2 and value the lower/min value in c3.

cnt_orig: is the count on how many rows the original group had.

Result

|    c1    |  map_category_room_date | cnt_orig |
|----------+-------------------------+----------|
|   'A'    |{'b': 09:00, 'C': 22:00} |    4     |
|   'B'    |{'c': 09:30}             |    1     |

What aggregate functions can I use to archive this is the most simple way?

Thanks

You can window function to generate the count, then use inbuilt functions to get the final dataframe you desire by doing to following

from pyspark.sql import Window
windowSpec = Window.partitionBy("c1")

from pyspark.sql import functions as F
df.withColumn("cnt_orig", count('c1').over(windowSpec)).orderBy('c3').groupBy("c1", "c2", "cnt_orig").agg(first('c3').as('c3'))
    .withColumn("c2", F.regexp_replace(F.regexp_replace(F.array($"c2", $"c3").cast(StringType), "[\\[\\]]", ""), ",", " : "))
      .groupBy("c1", "cnt_orig").agg(F.collect_list("c2").as('map_category_room_date'))

You should get the following result

+---+--------+----------------------+
|c1 |cnt_orig|map_category_room_date|
+---+--------+----------------------+
|A  |4       |[b : 09:00, c : 22:00]|
|b  |1       |[c : 09:00]           |
+---+--------+----------------------+

Scala way

working code to get the desired output in scala is

val windowSpec = Window.partitionBy("c1")

df.withColumn("cnt_orig", count("c1").over(windowSpec)).orderBy("c3").groupBy("c1", "c2", "cnt_orig").agg(first("c3").as("c3"))
    .withColumn("c2", regexp_replace(regexp_replace(array($"c2", $"c3").cast(StringType), "[\\[\\]]", ""), ",", " : "))
      .groupBy("c1", "cnt_orig").agg(collect_list("c2").as("map_category_room_date"))

来源：https://stackoverflow.com/questions/45445077/pyspark-spark-dataframe-aggregate-and-filter-columns-in-map-type-column

标签

python

apache-spark

dataframe

aggregate-functions