What does the 'pyspark.sql.functions.window' function's 'startTime' argument do?

前端 未结 2 1234
庸人自扰
庸人自扰 2021-01-02 18:00

In the offcial doc there is just a simple example:

The startTime is the offset with respect to 1970-01-01 00:00:00 UTC with which to start win

2条回答
  •  猫巷女王i
    2021-01-02 18:21

    Let's go steps by step.

    • Your data starts at 2017-01-09 09:00:10:

      df.orderBy("dt").show(3, False)
      
      +---------------------+---+
      |dt                   |val|
      +---------------------+---+
      |2017-01-09 09:00:10.0|1  |
      |2017-01-09 09:00:11.0|1  |
      |2017-01-09 09:00:12.0|1  |
      +---------------------+---+
      
    • The first full hour is 2017-01-09 09:00:00.0:

      from pyspark.sql.functions import min as min_, date_format
      (df
         .groupBy()
         .agg(date_format(min_("dt"), "yyyy-MM-dd HH:00:00"))
         .show(1, False))
      
      +-----------------------------------------+
      |date_format(min(dt), yyyy-MM-dd HH:00:00)|
      +-----------------------------------------+
      |2017-01-09 09:00:00                      |
      +-----------------------------------------+
      
    • Therefore the first window will start at 2017-01-09 09:03:00 which is 2017-01-09 09:00:00 + startTime (3 seconds) and end at 2017-01-09 09:08:00 (2017-01-09 09:00:00 + startTime + windowDuration).

      This window is empty (there is no data in range [09:03:00, 09:08:00)).

    • The first (and the second) data point will fall into the next window which is [09:00:07.0, 09:00:12.0) which starts at 2017-01-09 09:00:00 + startTime + 1 * slideDuration.

      win.orderBy("window.start").show(3, False)
      
      +---------------------------------------------+---+
      |window                                       |sum|
      +---------------------------------------------+---+
      |[2017-01-09 09:00:07.0,2017-01-09 09:00:12.0]|2  |
      |[2017-01-09 09:00:11.0,2017-01-09 09:00:16.0]|5  |
      |[2017-01-09 09:00:15.0,2017-01-09 09:00:20.0]|5  |
      +---------------------------------------------+---+
      

      Next windows start 2017-01-09 09:00:00 + startTime + n * slideDuration for n in 1..

提交回复
热议问题