问题
My DataFrame looks like:
| c1 | c2| c3 |
|----+---+-------
| A | b | 22:00|
| A | b | 23:00|
| A | b | 09:00|
| A | c | 22:00|
| B | c | 09:30|
I would like to perform some aggregations and create a second DataFrame with 3 columns:
c1: is the column that I want to group by.
map_category_room_date: map type, key the c2 and value the lower/min value in c3.
cnt_orig: is the count on how many rows the original group had.
Result
| c1 | map_category_room_date | cnt_orig |
|----------+-------------------------+----------|
| 'A' |{'b': 09:00, 'C': 22:00} | 4 |
| 'B' |{'c': 09:30} | 1 |
What aggregate functions can I use to archive this is the most simple way?
Thanks
回答1:
You can window function to generate the count, then use inbuilt functions to get the final dataframe you desire by doing to following
from pyspark.sql import Window
windowSpec = Window.partitionBy("c1")
from pyspark.sql import functions as F
df.withColumn("cnt_orig", count('c1').over(windowSpec)).orderBy('c3').groupBy("c1", "c2", "cnt_orig").agg(first('c3').as('c3'))
.withColumn("c2", F.regexp_replace(F.regexp_replace(F.array($"c2", $"c3").cast(StringType), "[\\[\\]]", ""), ",", " : "))
.groupBy("c1", "cnt_orig").agg(F.collect_list("c2").as('map_category_room_date'))
You should get the following result
+---+--------+----------------------+
|c1 |cnt_orig|map_category_room_date|
+---+--------+----------------------+
|A |4 |[b : 09:00, c : 22:00]|
|b |1 |[c : 09:00] |
+---+--------+----------------------+
Scala way
working code to get the desired output in scala is
val windowSpec = Window.partitionBy("c1")
df.withColumn("cnt_orig", count("c1").over(windowSpec)).orderBy("c3").groupBy("c1", "c2", "cnt_orig").agg(first("c3").as("c3"))
.withColumn("c2", regexp_replace(regexp_replace(array($"c2", $"c3").cast(StringType), "[\\[\\]]", ""), ",", " : "))
.groupBy("c1", "cnt_orig").agg(collect_list("c2").as("map_category_room_date"))
来源:https://stackoverflow.com/questions/45445077/pyspark-spark-dataframe-aggregate-and-filter-columns-in-map-type-column