In the offcial doc there is just a simple example:
The startTime is the offset with respect to 1970-01-01 00:00:00 UTC with which to start win
Let's go steps by step.
Your data starts at 2017-01-09 09:00:10:
df.orderBy("dt").show(3, False)
+---------------------+---+
|dt |val|
+---------------------+---+
|2017-01-09 09:00:10.0|1 |
|2017-01-09 09:00:11.0|1 |
|2017-01-09 09:00:12.0|1 |
+---------------------+---+
The first full hour is 2017-01-09 09:00:00.0:
from pyspark.sql.functions import min as min_, date_format
(df
.groupBy()
.agg(date_format(min_("dt"), "yyyy-MM-dd HH:00:00"))
.show(1, False))
+-----------------------------------------+
|date_format(min(dt), yyyy-MM-dd HH:00:00)|
+-----------------------------------------+
|2017-01-09 09:00:00 |
+-----------------------------------------+
Therefore the first window will start at 2017-01-09 09:03:00 which is 2017-01-09 09:00:00 + startTime (3 seconds) and end at 2017-01-09 09:08:00 (2017-01-09 09:00:00 + startTime + windowDuration).
This window is empty (there is no data in range [09:03:00, 09:08:00)).
The first (and the second) data point will fall into the next window which is [09:00:07.0, 09:00:12.0) which starts at 2017-01-09 09:00:00 + startTime + 1 * slideDuration.
win.orderBy("window.start").show(3, False)
+---------------------------------------------+---+
|window |sum|
+---------------------------------------------+---+
|[2017-01-09 09:00:07.0,2017-01-09 09:00:12.0]|2 |
|[2017-01-09 09:00:11.0,2017-01-09 09:00:16.0]|5 |
|[2017-01-09 09:00:15.0,2017-01-09 09:00:20.0]|5 |
+---------------------------------------------+---+
Next windows start 2017-01-09 09:00:00 + startTime + n * slideDuration for n in 1..