What does the 'pyspark.sql.functions.window' function's 'startTime' argument do?

前端 未结 2 1227
庸人自扰
庸人自扰 2021-01-02 18:00

In the offcial doc there is just a simple example:

The startTime is the offset with respect to 1970-01-01 00:00:00 UTC with which to start win

2条回答
  •  感情败类
    2021-01-02 18:16

    It has nothing to do with when your data start. Of course the first window will appear only until you have some data in that window. But the startTime has nothing to do with your data. As documentaiton says, the startTime is the offset with respect to 1970-01-01 19:00:00 UTC with which to start window intervals. if you create a window like this:
    w = F.window("date_field", "7 days", startTime='6 days')

    spark will generate the windows of 7 days starting from 1970-01-06:

    1970-01-06 19:00:00, 1970-01-13 19:00:00
    1970-01-13 19:00:00, 1970-01-20 19:00:00
    1970-01-20 19:00:00, 1970-01-27 19:00:00
    ...
    2017-05-16 19:00:00, 2017-05-23 19:00:00
    (if you continue calculating you get to this date) ...
    But you only will see the windows that are related to the dates of your dataframe. The 19:00:00 is because my timezone which is -05
    if you create a window like this:

    w = F.window("date_field", "7 days", startTime='2 days')

    spark will generate the windows of 7 days starting from 1970-01-02:

    1970-01-02 19:00:00, 1970-01-09 19:00:00
    1970-01-09 19:00:00, 1970-01-16 19:00:00
    ...
    2017-05-19 19:00:00, 2017-05-26 19:00:00
    (if you continue calculating you get to this date)
    ...

    Again you only will see the windows that are related to the dates of your dataframe.

    So, how can you calculate your startdate for the windows of your data?
    you just need to calculate the number of days of your startdate since 1970-01-01, then divided it by the length of your window and take the remainder. That will be the offset days starttime.


    I will explain it with an example: Asumming that you need your windows start at 2017-05-21 and the length of the windows is 7 days. I will create a dummy dataframe for the example.

    row = Row("id", "date_field", "value")
    df = sc.parallelize([
    row(1, "2017-05-23", 5.0),
    row(1, "2017-05-26", 10.0),
    row(1, "2017-05-29", 4.0),
    row(1, "2017-06-10", 3.0),]).toDF()
    
    start_date = datetime(2017, 5, 21, 19, 0, 0) # 19:00:00 because my 
    timezone 
    days_since_1970_to_start_date =int(time.mktime(start_date.timetuple())/86400)
    offset_days = days_since_1970_to_start_date % 7
    
    w = F.window("date_field", "7 days", startTime='{} days'.format(
                                            offset_days))
    
    df.groupby("id", w).agg(F.sum("value")).orderBy("window.start").show(10, False)
    

    you will get:

    +---+------------------------------------------+----------+
    |id |window                                    |sum(value)|
    +---+------------------------------------------+----------+
    |1  |[2017-05-21 19:00:00, 2017-05-28 19:00:00]|15.0      |
    |1  |[2017-05-28 19:00:00, 2017-06-04 19:00:00]|4.0       |
    |1  |[2017-06-04 19:00:00, 2017-06-11 19:00:00]|3.0       |
    +---+------------------------------------------+----------+
    

提交回复
热议问题