pyspark: rolling average using timeseries data

前端 未结 4 1026
时光取名叫无心
时光取名叫无心 2020-12-02 11:38

I have a dataset consisting of a timestamp column and a dollars column. I would like to find the average number of dollars per week ending at the timestamp of each row. I

4条回答
  •  攒了一身酷
    2020-12-02 12:09

    It's worth noting, that if you don't care about the exact dates - but care to have the average of the last 30 days available you can use the rowsBetween function as follows:

    w = Window.orderBy('timestampGMT').rowsBetween(-7, 0)
    
    df = eurPrices.withColumn('rolling_average', F.avg('dollars').over(w))
    

    Since you order by the dates, it will take the last 7 occurrences. You save all the casting.

提交回复
热议问题