pyspark dataframe aggregate a column by sliding time window

和自甴很熟 提交于 2020-02-16 13:16:07

问题


I would like to transform a column to multiple columns in a pyspark dataframe.

the original dataframe:

 client_id   value1    name1   a_date
 dhd         589       ecdu     2020-1-1
 dhd         575       tygp     2020-1-1  
 dhd         821       rdsr     2020-1-1
 dhd         872       rgvd     2019-12-10
 dhd         619       bhnd     2019-12-10
 dhd         781       prti     2019-12-10

UPDATE: the range of the dates at two months may be less than 30 days. The range between two dates in two months is not fixed. It can be in the range between 30 days and 15 days. e.g. 2020-1-1 and 2019-12-18. There are millions of "client_id"s and each "client_id" can have a different range of the dates in the past 2 years.

Also, there are some "client_id"s that have no values for "name1" in some months.

What I need :

 id   value1   last_0_month            last_1_month                      last_2_month  ....
 dhd  589      [ecdu, tygp, rdsr]   [ecdu, tygp, rdsr, rgvd, bhnd, prti]  ...      

In the result table, I need to aggregate "name1" into an array by year and month. Different "client_id"s may have different "a_date" range, which means that a user may have "name1" values from 2019-12-1 to 2015-5-5. Another "client_id" may have "name1" values from 2017-10-7 to 2012-7-9.

Each "client_id" only get "name1"s once per month so date is not important. e.g, client "dhd" got [ecdu, tygp, rdsr] in 2015-12.

I need to get the results table from current year and month to the past 13 months. e.g, from 2020-1 to 2019-7. If a client has not values of "name1" during the time window, just fill in by "null".

So, it will have some columns like "last_0_month", "last_1_month", "last_2_month", "last_3_month", "last_6_month", "last_9_month" and "last_12_month".

I think pivot can be used to work on this but I do not know how to apply the time window to the data.

thanks

UPDATE

from pyspark import StorageLevel 
CORES = spark.sparkContext.defaultParallelism
df1 = df.repartition(CURRENT_CORES * 4).persist(StorageLevel.MEMORY_ONLY)

w = Window().partitionBy("client_id",  "value1").orderBy("Month_range")
df1 = df1.withColumn("Year",  F.year("a_date")).withColumn("Month",F.month("a_date")).withColumn("Month_range",F.trunc("a_date", "month").cast('timestamp')).withColumn("Rank",F.dense_rank().over(w)).orderBy(F.col("Month_range").desc())

windows, month_offsets = [], [1, 2, 3, 6, 9, 12]
windows.append(Window().partitionBy(F.col("client_id"), F.col("value1"), F.col("Year"),F.col("Month")))
for offset in month_offsets:
   windows.append(Window().partitionBy(F.col("userId"), F.col("value1")).orderBy(F.col("Rank")).rangeBetween(-(offset + 1), 0))

  df1 = df1.withColumn("last_0_month", F.collect_list(F.col("name1")).over(windows[0]))

for i in range(len(offsets)):
   df1 = df1.withColumn("last_" + str(offsets[i]) + "_month", F.collect_list(F.col("name1")).over(windows[i]))

回答1:


The approach should be to use collect_list with sliding windows.

(This will only work if each month has atleast one entry)

I have added more entries to check for more months if they work properly

from pyspark.sql import functions as F
from pyspark.sql.window import Window



data=  [['dhd',589,'ecdu','2020-1-5'],
        ['dhd',575,'tygp','2020-1-5'],  
        ['dhd',821,'rdsr','2020-1-5'],
        ['dhd',872,'rgvd','2019-12-1'],
        ['dhd',619,'bhnd','2019-12-15'],
        ['dhd',781,'prti','2019-12-18'],
        ['dhd',781,'prti1','2019-12-18'],
        ['dhd',781,'prti2','2019-11-18'],
        ['dhd',781,'prti3','2019-10-31'],
        ['dhd',781,'prti4','2019-09-30'],
        ['dhd',781,'prt1','2019-07-31'],
        ['dhd',781,'pr4','2019-06-30'],
        ['dhd',781,'pr2','2019-08-31'],
        ['dhd',781,'prt4','2019-01-31'],
        ['dhd',781,'prti6','2019-02-28'],
        ['dhd',781,'prti7','2019-02-02'],
        ['dhd',781,'prti8','2019-03-29'],
        ['dhd',781,'prti9','2019-04-29'],
        ['dhd',781,'prti10','2019-05-04'],
        ['dhd',781,'prti11','2019-03-01']]
columns= ['client_id','value1','name1','a_date']
df= spark.createDataFrame(data,columns)

Change a_date from string to timestamp for sliding window:

->I had to truncate the date for the window to only include year, month and day as 01 so that I could include all dates within that month

-> I used a window with dense_rank to create a rank column which I could use in the rangebetween function to go back by exact months(instead of varying days 30,31,29,28).

->NOTE: This code(dense_rank window) will only work if each and every month has an atleast one entry.

w=Window().partitionBy("client_id").orderBy("Month_range")
df1=df.withColumn("a_date", F.to_date("a_date")).withColumn("Year", F.year("a_date")).withColumn("Month",F.month("a_date")).withColumn("Month_range",F.trunc("a_date", "month").cast('timestamp'))\
.withColumn("Rank",F.dense_rank().over(w))
df1.show()

+---------+------+------+----------+----+-----+-------------------+----+
|client_id|value1| name1|    a_date|Year|Month|        Month_range|Rank|
+---------+------+------+----------+----+-----+-------------------+----+
|      dhd|   821|  rdsr|2020-01-05|2020|    1|2020-01-01 00:00:00|  13|
|      dhd|   575|  tygp|2020-01-05|2020|    1|2020-01-01 00:00:00|  13|
|      dhd|   589|  ecdu|2020-01-05|2020|    1|2020-01-01 00:00:00|  13|
|      dhd|   781|  prti|2019-12-18|2019|   12|2019-12-01 00:00:00|  12|
|      dhd|   781| prti1|2019-12-18|2019|   12|2019-12-01 00:00:00|  12|
|      dhd|   872|  rgvd|2019-12-01|2019|   12|2019-12-01 00:00:00|  12|
|      dhd|   619|  bhnd|2019-12-15|2019|   12|2019-12-01 00:00:00|  12|
|      dhd|   781| prti2|2019-11-18|2019|   11|2019-11-01 00:00:00|  11|
|      dhd|   781| prti3|2019-10-31|2019|   10|2019-10-01 00:00:00|  10|
|      dhd|   781| prti4|2019-09-30|2019|    9|2019-09-01 00:00:00|   9|
|      dhd|   781|   pr2|2019-08-31|2019|    8|2019-08-01 00:00:00|   8|
|      dhd|   781|  prt1|2019-07-31|2019|    7|2019-07-01 00:00:00|   7|
|      dhd|   781|   pr4|2019-06-30|2019|    6|2019-06-01 00:00:00|   6|
|      dhd|   781|prti10|2019-05-04|2019|    5|2019-05-01 00:00:00|   5|
|      dhd|   781| prti9|2019-04-29|2019|    4|2019-04-01 00:00:00|   4|
|      dhd|   781| prti8|2019-03-29|2019|    3|2019-03-01 00:00:00|   3|
|      dhd|   781|prti11|2019-03-01|2019|    3|2019-03-01 00:00:00|   3|
|      dhd|   781| prti6|2019-02-28|2019|    2|2019-02-01 00:00:00|   2|
|      dhd|   781| prti7|2019-02-02|2019|    2|2019-02-01 00:00:00|   2|
|      dhd|   781|  prt4|2019-01-31|2019|    1|2019-01-01 00:00:00|   1|
+---------+------+------+----------+----+-----+-------------------+----+

Set up your sliding windows for each month, and create new column with collect list:

->First window only for current month, will take all names in that month.

-> Range windows will take all entries in that particular month irrespective of day. I have included last_0_month to last_3_month

->rangebetween has -1 for last_1_month, -2 for last_2_month, -3 for last_3_month

w1= Window().partitionBy(F.col("client_id"),F.col("Year"),F.col("Month"))
w2= Window().partitionBy(F.col("client_id")).orderBy(F.col("Rank")).rangeBetween(-1, 0)
w3= Window().partitionBy(F.col("client_id")).orderBy(F.col("Rank")).rangeBetween(-2, 0)
w4= Window().partitionBy(F.col("client_id")).orderBy(F.col("Rank")).rangeBetween(-3, 0)
df2=df1.withColumn("last_0_month", F.collect_list(F.col("name1")).over(w1))\
   .withColumn("last_1_month", F.collect_list(F.col("name1")).over(w2))\
   .withColumn("last_2_month", F.collect_list(F.col("name1")).over(w3))\
   .withColumn("last_3_month", F.collect_list(F.col("name1")).over(w4))\
   .orderBy(df1.a_date.desc())\
   .drop("a_date1","Month_range","Month","Year","Rank")
df2.show()

+---------+------+------+----------+--------------------+--------------------+--------------------+--------------------+
|client_id|value1| name1|    a_date|        last_0_month|        last_1_month|        last_2_month|        last_3_month|
+---------+------+------+----------+--------------------+--------------------+--------------------+--------------------+
|      dhd|   821|  rdsr|2020-01-05|  [ecdu, tygp, rdsr]|[rgvd, bhnd, prti...|[prti2, rgvd, bhn...|[prti3, prti2, rg...|
|      dhd|   575|  tygp|2020-01-05|  [ecdu, tygp, rdsr]|[rgvd, bhnd, prti...|[prti2, rgvd, bhn...|[prti3, prti2, rg...|
|      dhd|   589|  ecdu|2020-01-05|  [ecdu, tygp, rdsr]|[rgvd, bhnd, prti...|[prti2, rgvd, bhn...|[prti3, prti2, rg...|
|      dhd|   781|  prti|2019-12-18|[rgvd, bhnd, prti...|[prti2, rgvd, bhn...|[prti3, prti2, rg...|[prti4, prti3, pr...|
|      dhd|   781| prti1|2019-12-18|[rgvd, bhnd, prti...|[prti2, rgvd, bhn...|[prti3, prti2, rg...|[prti4, prti3, pr...|
|      dhd|   619|  bhnd|2019-12-15|[rgvd, bhnd, prti...|[prti2, rgvd, bhn...|[prti3, prti2, rg...|[prti4, prti3, pr...|
|      dhd|   872|  rgvd|2019-12-01|[rgvd, bhnd, prti...|[prti2, rgvd, bhn...|[prti3, prti2, rg...|[prti4, prti3, pr...|
|      dhd|   781| prti2|2019-11-18|             [prti2]|      [prti3, prti2]|[prti4, prti3, pr...|[pr2, prti4, prti...|
|      dhd|   781| prti3|2019-10-31|             [prti3]|      [prti4, prti3]| [pr2, prti4, prti3]|[prt1, pr2, prti4...|
|      dhd|   781| prti4|2019-09-30|             [prti4]|        [pr2, prti4]|  [prt1, pr2, prti4]|[pr4, prt1, pr2, ...|
|      dhd|   781|   pr2|2019-08-31|               [pr2]|         [prt1, pr2]|    [pr4, prt1, pr2]|[prti10, pr4, prt...|
|      dhd|   781|  prt1|2019-07-31|              [prt1]|         [pr4, prt1]| [prti10, pr4, prt1]|[prti9, prti10, p...|
|      dhd|   781|   pr4|2019-06-30|               [pr4]|       [prti10, pr4]|[prti9, prti10, pr4]|[prti8, prti11, p...|
|      dhd|   781|prti10|2019-05-04|            [prti10]|     [prti9, prti10]|[prti8, prti11, p...|[prti6, prti7, pr...|
|      dhd|   781| prti9|2019-04-29|             [prti9]|[prti8, prti11, p...|[prti6, prti7, pr...|[prt4, prti6, prt...|
|      dhd|   781| prti8|2019-03-29|     [prti8, prti11]|[prti6, prti7, pr...|[prt4, prti6, prt...|[prt4, prti6, prt...|
|      dhd|   781|prti11|2019-03-01|     [prti8, prti11]|[prti6, prti7, pr...|[prt4, prti6, prt...|[prt4, prti6, prt...|
|      dhd|   781| prti6|2019-02-28|      [prti6, prti7]|[prt4, prti6, prti7]|[prt4, prti6, prti7]|[prt4, prti6, prti7]|
|      dhd|   781| prti7|2019-02-02|      [prti6, prti7]|[prt4, prti6, prti7]|[prt4, prti6, prti7]|[prt4, prti6, prti7]|
|      dhd|   781|  prt4|2019-01-31|              [prt4]|              [prt4]|              [prt4]|              [prt4]|
+---------+------+------+----------+--------------------+--------------------+--------------------+--------------------+

Link of full df2 in csv:

https://github.com/murtihash/TimeTravel-Engineering/blob/master/final.csv




回答2:


This should be as optimized as it can get.

#df from below sample dataframe
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.storagelevel import StorageLevel
from pyspark.sql.types import *
w1= Window().partitionBy(F.col("client_id"),F.col("Year"),F.col("Month"))
w2= Window().partitionBy(F.col("client_id")).orderBy(F.col("Rank")).rangeBetween(-1, 0)
w3= Window().partitionBy(F.col("client_id")).orderBy(F.col("Rank")).rangeBetween(-2, 0)
w4= Window().partitionBy(F.col("client_id")).orderBy(F.col("Rank")).rangeBetween(-3, 0)
w5= Window().partitionBy(F.col("client_id")).orderBy(F.col("Rank")).rangeBetween(-6, 0)
w6= Window().partitionBy(F.col("client_id")).orderBy(F.col("Rank")).rangeBetween(-9, 0)
w7= Window().partitionBy(F.col("client_id")).orderBy(F.col("Rank")).rangeBetween(-12, 0)

w=Window().partitionBy("client_id").orderBy("Month_range")
df1=df.repartition(3200).persist(StorageLevel.MEMORY_AND_DISK).withColumn("a_date1", F.to_date("a_date")).withColumn("Year", F.year("a_date")).withColumn("Month",F.month("a_date")).withColumn("Month_range",F.trunc("a_date", "month").cast('timestamp'))\
.withColumn("Rank",F.dense_rank().over(w))\
.withColumn("last_0_month", F.collect_list(F.col("name1")).over(w1))\
.withColumn("last_1_month", F.collect_list(F.col("name1")).over(w2))\
.withColumn("last_2_month", F.collect_list(F.col("name1")).over(w3))\
.withColumn("last_3_month", F.collect_list(F.col("name1")).over(w4))\
.withColumn("last_6_month", F.collect_list(F.col("name1")).over(w5))\
.withColumn("last_9_month", F.collect_list(F.col("name1")).over(w6))\
.withColumn("last_12_month", F.collect_list(F.col("name1")).over(w7))\
.drop("a_date1","Month_range","Month","Year","Rank")\
.orderBy(df.a_date.desc())
#df1 is final output


来源:https://stackoverflow.com/questions/60085374/pyspark-dataframe-aggregate-a-column-by-sliding-time-window

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!