My DataFrame looks like:
| c1 | c2| c3 |
|----+---+-------
| A | b | 22:00|
| A | b | 23:00|
| A | b | 09:00|
| A | c | 22:00|
| B | c | 09:30|
I would like to perform some aggregations and create a second DataFrame with 3 columns:
c1: is the column that I want to group by.
map_category_room_date: map type, key the c2 and value the lower/min value in c3.
cnt_orig: is the count on how many rows the original group had.
Result
| c1 | map_category_room_date | cnt_orig |
|----------+-------------------------+----------|
| 'A' |{'b': 09:00, 'C': 22:00} | 4 |
| 'B' |{'c': 09:30} | 1 |
What aggregate functions can I use to archive this is the most simple way?
Thanks
You can window function to generate the count, then use inbuilt functions to get the final dataframe you desire by doing to following
from pyspark.sql import Window
windowSpec = Window.partitionBy("c1")
from pyspark.sql import functions as F
df.withColumn("cnt_orig", count('c1').over(windowSpec)).orderBy('c3').groupBy("c1", "c2", "cnt_orig").agg(first('c3').as('c3'))
.withColumn("c2", F.regexp_replace(F.regexp_replace(F.array($"c2", $"c3").cast(StringType), "[\\[\\]]", ""), ",", " : "))
.groupBy("c1", "cnt_orig").agg(F.collect_list("c2").as('map_category_room_date'))
You should get the following result
+---+--------+----------------------+
|c1 |cnt_orig|map_category_room_date|
+---+--------+----------------------+
|A |4 |[b : 09:00, c : 22:00]|
|b |1 |[c : 09:00] |
+---+--------+----------------------+
Scala way
working code to get the desired output in scala is
val windowSpec = Window.partitionBy("c1")
df.withColumn("cnt_orig", count("c1").over(windowSpec)).orderBy("c3").groupBy("c1", "c2", "cnt_orig").agg(first("c3").as("c3"))
.withColumn("c2", regexp_replace(regexp_replace(array($"c2", $"c3").cast(StringType), "[\\[\\]]", ""), ",", " : "))
.groupBy("c1", "cnt_orig").agg(collect_list("c2").as("map_category_room_date"))
来源:https://stackoverflow.com/questions/45445077/pyspark-spark-dataframe-aggregate-and-filter-columns-in-map-type-column