What does the 'pyspark.sql.functions.window' function's 'startTime' argument do?

前端未结

关注

 2  1234

庸人自扰 2021-01-02 18:00

In the offcial doc there is just a simple example:

The startTime is the offset with respect to 1970-01-01 00:00:00 UTC with which to start win

2条回答

猫巷女王i (楼主)

2021-01-02 18:21

Let's go steps by step.

Your data starts at 2017-01-09 09:00:10:

df.orderBy("dt").show(3, False)

+---------------------+---+
|dt                   |val|
+---------------------+---+
|2017-01-09 09:00:10.0|1  |
|2017-01-09 09:00:11.0|1  |
|2017-01-09 09:00:12.0|1  |
+---------------------+---+

The first full hour is 2017-01-09 09:00:00.0:

from pyspark.sql.functions import min as min_, date_format
(df
   .groupBy()
   .agg(date_format(min_("dt"), "yyyy-MM-dd HH:00:00"))
   .show(1, False))

+-----------------------------------------+
|date_format(min(dt), yyyy-MM-dd HH:00:00)|
+-----------------------------------------+
|2017-01-09 09:00:00                      |
+-----------------------------------------+

Therefore the first window will start at 2017-01-09 09:03:00 which is 2017-01-09 09:00:00 + startTime (3 seconds) and end at 2017-01-09 09:08:00 (2017-01-09 09:00:00 + startTime + windowDuration).

This window is empty (there is no data in range [09:03:00, 09:08:00)).

The first (and the second) data point will fall into the next window which is [09:00:07.0, 09:00:12.0) which starts at 2017-01-09 09:00:00 + startTime + 1 * slideDuration.

win.orderBy("window.start").show(3, False)

+---------------------------------------------+---+
|window                                       |sum|
+---------------------------------------------+---+
|[2017-01-09 09:00:07.0,2017-01-09 09:00:12.0]|2  |
|[2017-01-09 09:00:11.0,2017-01-09 09:00:16.0]|5  |
|[2017-01-09 09:00:15.0,2017-01-09 09:00:20.0]|5  |
+---------------------------------------------+---+

Next windows start 2017-01-09 09:00:00 + startTime + n * slideDuration for n in 1..

0 讨论(0)

查看其它2个回答