How to aggregate over rolling time window with groups in Spark

后端 未结 1 1641
陌清茗
陌清茗 2020-12-03 01:54

I have some data that I want to group by a certain column, then aggregate a series of fields based on a rolling time window from the group.

Here is some example data

相关标签:
1条回答
  • 2020-12-03 02:29

    Revised answer:

    You can use a simple window functions trick here. A bunch of imports:

    from pyspark.sql.functions import coalesce, col, datediff, lag, lit, sum as sum_
    from pyspark.sql.window import Window
    

    window definition:

    w = Window.partitionBy("group_by").orderBy("date")
    

    Cast date to DateType:

    df_ = df.withColumn("date", col("date").cast("date"))
    

    Define following expressions:

    # Difference from the previous record or 0 if this is the first one
    diff = coalesce(datediff("date", lag("date", 1).over(w)), lit(0))
    
    # 0 if diff <= 30, 1 otherwise
    indicator = (diff > 30).cast("integer")
    
    # Cumulative sum of indicators over the window
    subgroup = sum_(indicator).over(w).alias("subgroup")
    

    Add subgroup expression to the table:

    df_.select("*", subgroup).groupBy("group_by", "subgroup").avg("get_avg")
    
    +--------+--------+------------+
    |group_by|subgroup|avg(get_avg)|
    +--------+--------+------------+
    |  group1|       0|         5.0|
    |  group2|       0|        20.0|
    |  group2|       1|         8.0|
    +--------+--------+------------+
    

    first is not meaningful with aggregations, but if column is monotonically increasing you can use min. Otherwise you'll have to use window functions as well.

    Tested using Spark 2.1. May require subqueries and Window instance when used with earlier Spark release.

    The original answer (not relevant in the specified scope)

    Since Spark 2.0 you should be able to use a window function:

    Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05).

    from pyspark.sql.functions import window
    
    df.groupBy(window("date", windowDuration="30 days")).count()
    

    but you can see from the result,

    +---------------------------------------------+-----+
    |window                                       |count|
    +---------------------------------------------+-----+
    |[2016-01-30 01:00:00.0,2016-02-29 01:00:00.0]|1    |
    |[2015-12-31 01:00:00.0,2016-01-30 01:00:00.0]|2    |
    |[2016-03-30 02:00:00.0,2016-04-29 02:00:00.0]|1    |
    +---------------------------------------------+-----+
    

    you'll have to be a bit careful when it comes to timezones.

    0 讨论(0)
提交回复
热议问题