问题
I would like to transform a column to multiple columns in a pyspark dataframe.
the original dataframe:
client_id value1 name1 a_date
dhd 589 ecdu 2020-1-1
dhd 575 tygp 2020-1-1
dhd 821 rdsr 2020-1-1
dhd 872 rgvd 2019-12-10
dhd 619 bhnd 2019-12-10
dhd 781 prti 2019-12-10
UPDATE: the range of the dates at two months may be less than 30 days. The range between two dates in two months is not fixed. It can be in the range between 30 days and 15 days. e.g. 2020-1-1 and 2019-12-18. There are millions of "client_id"s and each "client_id" can have a different range of the dates in the past 2 years.
Also, there are some "client_id"s that have no values for "name1" in some months.
What I need :
id value1 last_0_month last_1_month last_2_month ....
dhd 589 [ecdu, tygp, rdsr] [ecdu, tygp, rdsr, rgvd, bhnd, prti] ...
In the result table, I need to aggregate "name1" into an array by year and month. Different "client_id"s may have different "a_date" range, which means that a user may have "name1" values from 2019-12-1 to 2015-5-5. Another "client_id" may have "name1" values from 2017-10-7 to 2012-7-9.
Each "client_id" only get "name1"s once per month so date is not important. e.g, client "dhd" got [ecdu, tygp, rdsr] in 2015-12.
I need to get the results table from current year and month to the past 13 months. e.g, from 2020-1 to 2019-7. If a client has not values of "name1" during the time window, just fill in by "null".
So, it will have some columns like "last_0_month", "last_1_month", "last_2_month", "last_3_month", "last_6_month", "last_9_month" and "last_12_month".
I think pivot can be used to work on this but I do not know how to apply the time window to the data.
thanks
UPDATE
from pyspark import StorageLevel
CORES = spark.sparkContext.defaultParallelism
df1 = df.repartition(CURRENT_CORES * 4).persist(StorageLevel.MEMORY_ONLY)
w = Window().partitionBy("client_id", "value1").orderBy("Month_range")
df1 = df1.withColumn("Year", F.year("a_date")).withColumn("Month",F.month("a_date")).withColumn("Month_range",F.trunc("a_date", "month").cast('timestamp')).withColumn("Rank",F.dense_rank().over(w)).orderBy(F.col("Month_range").desc())
windows, month_offsets = [], [1, 2, 3, 6, 9, 12]
windows.append(Window().partitionBy(F.col("client_id"), F.col("value1"), F.col("Year"),F.col("Month")))
for offset in month_offsets:
windows.append(Window().partitionBy(F.col("userId"), F.col("value1")).orderBy(F.col("Rank")).rangeBetween(-(offset + 1), 0))
df1 = df1.withColumn("last_0_month", F.collect_list(F.col("name1")).over(windows[0]))
for i in range(len(offsets)):
df1 = df1.withColumn("last_" + str(offsets[i]) + "_month", F.collect_list(F.col("name1")).over(windows[i]))
回答1:
The approach should be to use collect_list with sliding windows.
(This will only work if each month has atleast one entry)
I have added more entries to check for more months if they work properly
from pyspark.sql import functions as F
from pyspark.sql.window import Window
data= [['dhd',589,'ecdu','2020-1-5'],
['dhd',575,'tygp','2020-1-5'],
['dhd',821,'rdsr','2020-1-5'],
['dhd',872,'rgvd','2019-12-1'],
['dhd',619,'bhnd','2019-12-15'],
['dhd',781,'prti','2019-12-18'],
['dhd',781,'prti1','2019-12-18'],
['dhd',781,'prti2','2019-11-18'],
['dhd',781,'prti3','2019-10-31'],
['dhd',781,'prti4','2019-09-30'],
['dhd',781,'prt1','2019-07-31'],
['dhd',781,'pr4','2019-06-30'],
['dhd',781,'pr2','2019-08-31'],
['dhd',781,'prt4','2019-01-31'],
['dhd',781,'prti6','2019-02-28'],
['dhd',781,'prti7','2019-02-02'],
['dhd',781,'prti8','2019-03-29'],
['dhd',781,'prti9','2019-04-29'],
['dhd',781,'prti10','2019-05-04'],
['dhd',781,'prti11','2019-03-01']]
columns= ['client_id','value1','name1','a_date']
df= spark.createDataFrame(data,columns)
Change a_date from string to timestamp for sliding window:
->I had to truncate the date for the window to only include year, month and day as 01 so that I could include all dates within that month
-> I used a window with dense_rank to create a rank column which I could use in the rangebetween function to go back by exact months(instead of varying days 30,31,29,28).
->NOTE: This code(dense_rank window) will only work if each and every month has an atleast one entry.
w=Window().partitionBy("client_id").orderBy("Month_range")
df1=df.withColumn("a_date", F.to_date("a_date")).withColumn("Year", F.year("a_date")).withColumn("Month",F.month("a_date")).withColumn("Month_range",F.trunc("a_date", "month").cast('timestamp'))\
.withColumn("Rank",F.dense_rank().over(w))
df1.show()
+---------+------+------+----------+----+-----+-------------------+----+
|client_id|value1| name1| a_date|Year|Month| Month_range|Rank|
+---------+------+------+----------+----+-----+-------------------+----+
| dhd| 821| rdsr|2020-01-05|2020| 1|2020-01-01 00:00:00| 13|
| dhd| 575| tygp|2020-01-05|2020| 1|2020-01-01 00:00:00| 13|
| dhd| 589| ecdu|2020-01-05|2020| 1|2020-01-01 00:00:00| 13|
| dhd| 781| prti|2019-12-18|2019| 12|2019-12-01 00:00:00| 12|
| dhd| 781| prti1|2019-12-18|2019| 12|2019-12-01 00:00:00| 12|
| dhd| 872| rgvd|2019-12-01|2019| 12|2019-12-01 00:00:00| 12|
| dhd| 619| bhnd|2019-12-15|2019| 12|2019-12-01 00:00:00| 12|
| dhd| 781| prti2|2019-11-18|2019| 11|2019-11-01 00:00:00| 11|
| dhd| 781| prti3|2019-10-31|2019| 10|2019-10-01 00:00:00| 10|
| dhd| 781| prti4|2019-09-30|2019| 9|2019-09-01 00:00:00| 9|
| dhd| 781| pr2|2019-08-31|2019| 8|2019-08-01 00:00:00| 8|
| dhd| 781| prt1|2019-07-31|2019| 7|2019-07-01 00:00:00| 7|
| dhd| 781| pr4|2019-06-30|2019| 6|2019-06-01 00:00:00| 6|
| dhd| 781|prti10|2019-05-04|2019| 5|2019-05-01 00:00:00| 5|
| dhd| 781| prti9|2019-04-29|2019| 4|2019-04-01 00:00:00| 4|
| dhd| 781| prti8|2019-03-29|2019| 3|2019-03-01 00:00:00| 3|
| dhd| 781|prti11|2019-03-01|2019| 3|2019-03-01 00:00:00| 3|
| dhd| 781| prti6|2019-02-28|2019| 2|2019-02-01 00:00:00| 2|
| dhd| 781| prti7|2019-02-02|2019| 2|2019-02-01 00:00:00| 2|
| dhd| 781| prt4|2019-01-31|2019| 1|2019-01-01 00:00:00| 1|
+---------+------+------+----------+----+-----+-------------------+----+
Set up your sliding windows for each month, and create new column with collect list:
->First window only for current month, will take all names in that month.
-> Range windows will take all entries in that particular month irrespective of day. I have included last_0_month to last_3_month
->rangebetween has -1 for last_1_month, -2 for last_2_month, -3 for last_3_month
w1= Window().partitionBy(F.col("client_id"),F.col("Year"),F.col("Month"))
w2= Window().partitionBy(F.col("client_id")).orderBy(F.col("Rank")).rangeBetween(-1, 0)
w3= Window().partitionBy(F.col("client_id")).orderBy(F.col("Rank")).rangeBetween(-2, 0)
w4= Window().partitionBy(F.col("client_id")).orderBy(F.col("Rank")).rangeBetween(-3, 0)
df2=df1.withColumn("last_0_month", F.collect_list(F.col("name1")).over(w1))\
.withColumn("last_1_month", F.collect_list(F.col("name1")).over(w2))\
.withColumn("last_2_month", F.collect_list(F.col("name1")).over(w3))\
.withColumn("last_3_month", F.collect_list(F.col("name1")).over(w4))\
.orderBy(df1.a_date.desc())\
.drop("a_date1","Month_range","Month","Year","Rank")
df2.show()
+---------+------+------+----------+--------------------+--------------------+--------------------+--------------------+
|client_id|value1| name1| a_date| last_0_month| last_1_month| last_2_month| last_3_month|
+---------+------+------+----------+--------------------+--------------------+--------------------+--------------------+
| dhd| 821| rdsr|2020-01-05| [ecdu, tygp, rdsr]|[rgvd, bhnd, prti...|[prti2, rgvd, bhn...|[prti3, prti2, rg...|
| dhd| 575| tygp|2020-01-05| [ecdu, tygp, rdsr]|[rgvd, bhnd, prti...|[prti2, rgvd, bhn...|[prti3, prti2, rg...|
| dhd| 589| ecdu|2020-01-05| [ecdu, tygp, rdsr]|[rgvd, bhnd, prti...|[prti2, rgvd, bhn...|[prti3, prti2, rg...|
| dhd| 781| prti|2019-12-18|[rgvd, bhnd, prti...|[prti2, rgvd, bhn...|[prti3, prti2, rg...|[prti4, prti3, pr...|
| dhd| 781| prti1|2019-12-18|[rgvd, bhnd, prti...|[prti2, rgvd, bhn...|[prti3, prti2, rg...|[prti4, prti3, pr...|
| dhd| 619| bhnd|2019-12-15|[rgvd, bhnd, prti...|[prti2, rgvd, bhn...|[prti3, prti2, rg...|[prti4, prti3, pr...|
| dhd| 872| rgvd|2019-12-01|[rgvd, bhnd, prti...|[prti2, rgvd, bhn...|[prti3, prti2, rg...|[prti4, prti3, pr...|
| dhd| 781| prti2|2019-11-18| [prti2]| [prti3, prti2]|[prti4, prti3, pr...|[pr2, prti4, prti...|
| dhd| 781| prti3|2019-10-31| [prti3]| [prti4, prti3]| [pr2, prti4, prti3]|[prt1, pr2, prti4...|
| dhd| 781| prti4|2019-09-30| [prti4]| [pr2, prti4]| [prt1, pr2, prti4]|[pr4, prt1, pr2, ...|
| dhd| 781| pr2|2019-08-31| [pr2]| [prt1, pr2]| [pr4, prt1, pr2]|[prti10, pr4, prt...|
| dhd| 781| prt1|2019-07-31| [prt1]| [pr4, prt1]| [prti10, pr4, prt1]|[prti9, prti10, p...|
| dhd| 781| pr4|2019-06-30| [pr4]| [prti10, pr4]|[prti9, prti10, pr4]|[prti8, prti11, p...|
| dhd| 781|prti10|2019-05-04| [prti10]| [prti9, prti10]|[prti8, prti11, p...|[prti6, prti7, pr...|
| dhd| 781| prti9|2019-04-29| [prti9]|[prti8, prti11, p...|[prti6, prti7, pr...|[prt4, prti6, prt...|
| dhd| 781| prti8|2019-03-29| [prti8, prti11]|[prti6, prti7, pr...|[prt4, prti6, prt...|[prt4, prti6, prt...|
| dhd| 781|prti11|2019-03-01| [prti8, prti11]|[prti6, prti7, pr...|[prt4, prti6, prt...|[prt4, prti6, prt...|
| dhd| 781| prti6|2019-02-28| [prti6, prti7]|[prt4, prti6, prti7]|[prt4, prti6, prti7]|[prt4, prti6, prti7]|
| dhd| 781| prti7|2019-02-02| [prti6, prti7]|[prt4, prti6, prti7]|[prt4, prti6, prti7]|[prt4, prti6, prti7]|
| dhd| 781| prt4|2019-01-31| [prt4]| [prt4]| [prt4]| [prt4]|
+---------+------+------+----------+--------------------+--------------------+--------------------+--------------------+
Link of full df2 in csv:
https://github.com/murtihash/TimeTravel-Engineering/blob/master/final.csv
回答2:
This should be as optimized as it can get.
#df from below sample dataframe
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.storagelevel import StorageLevel
from pyspark.sql.types import *
w1= Window().partitionBy(F.col("client_id"),F.col("Year"),F.col("Month"))
w2= Window().partitionBy(F.col("client_id")).orderBy(F.col("Rank")).rangeBetween(-1, 0)
w3= Window().partitionBy(F.col("client_id")).orderBy(F.col("Rank")).rangeBetween(-2, 0)
w4= Window().partitionBy(F.col("client_id")).orderBy(F.col("Rank")).rangeBetween(-3, 0)
w5= Window().partitionBy(F.col("client_id")).orderBy(F.col("Rank")).rangeBetween(-6, 0)
w6= Window().partitionBy(F.col("client_id")).orderBy(F.col("Rank")).rangeBetween(-9, 0)
w7= Window().partitionBy(F.col("client_id")).orderBy(F.col("Rank")).rangeBetween(-12, 0)
w=Window().partitionBy("client_id").orderBy("Month_range")
df1=df.repartition(3200).persist(StorageLevel.MEMORY_AND_DISK).withColumn("a_date1", F.to_date("a_date")).withColumn("Year", F.year("a_date")).withColumn("Month",F.month("a_date")).withColumn("Month_range",F.trunc("a_date", "month").cast('timestamp'))\
.withColumn("Rank",F.dense_rank().over(w))\
.withColumn("last_0_month", F.collect_list(F.col("name1")).over(w1))\
.withColumn("last_1_month", F.collect_list(F.col("name1")).over(w2))\
.withColumn("last_2_month", F.collect_list(F.col("name1")).over(w3))\
.withColumn("last_3_month", F.collect_list(F.col("name1")).over(w4))\
.withColumn("last_6_month", F.collect_list(F.col("name1")).over(w5))\
.withColumn("last_9_month", F.collect_list(F.col("name1")).over(w6))\
.withColumn("last_12_month", F.collect_list(F.col("name1")).over(w7))\
.drop("a_date1","Month_range","Month","Year","Rank")\
.orderBy(df.a_date.desc())
#df1 is final output
来源:https://stackoverflow.com/questions/60085374/pyspark-dataframe-aggregate-a-column-by-sliding-time-window