How to insert a custom function within For loop in pyspark?

﹥>﹥吖頭↗ 提交于 2021-02-18 19:41:53

问题


I am facing a challenge in spark within Azure databricks. I have a dataset as

+------------------+----------+-------------------+---------------+
|     OpptyHeaderID|   OpptyID|               Date|BaseAmountMonth|
+------------------+----------+-------------------+---------------+
|0067000000i6ONPAA2|OP-0164615|2014-07-27 00:00:00|    4375.800000|
|0065w0000215k5kAAA|OP-0218055|2020-12-23 00:00:00|    4975.000000|
+------------------+----------+-------------------+---------------+

Now I need to use a loop function to append rows to this dataframe. I want to replicate the below function in pyspark.

Result = ()
for i in (1:12)
{
   select a.PootyHeaderID
          ,a.OpptyID
          ,dateadd(MONTH, i, a.Date) as Date
          ,BaseAmountMonth
   from FinalOut
   Result = Result.Append()
   print(i)  
}

The date in each of the appended rows must have a succeeding month (rolling 12 months). It should look like this.

+------------------+----------+-------------------+---------------+
|     OpptyHeaderID|   OpptyID|               Date|BaseAmountMonth|
+------------------+----------+-------------------+---------------+
|0067000000i6ONPAA2|OP-0164615|2014-07-27 00:00:00|    4375.800000|
|0067000000i6ONPAA2|OP-0164615|2014-08-27 00:00:00|    4375.800000|
|0067000000i6ONPAA2|OP-0164615|2014-09-27 00:00:00|    4375.800000|
                              .
                              .
                              .
|0067000000i6ONPAA2|OP-0164615|2015-06-27 00:00:00|    4375.800000|
|0065w0000215k5kAAA|OP-0218055|2020-12-23 00:00:00|    4975.000000|
|0065w0000215k5kAAA|OP-0218055|2021-01-23 00:00:00|    4975.000000|    
|0065w0000215k5kAAA|OP-0218055|2021-02-23 00:00:00|    4975.000000|    
                               .
                               .
                               .    
|0065w0000215k5kAAA|OP-0218055|2021-11-23 00:00:00|    4975.000000|    
+------------------+----------+-------------------+---------------+

[EDIT 1]

How will I make the interval lengths dynamic based on another field?

+------------------+----------+-------------------+---------------+--------+
|     OpptyHeaderID|   OpptyID|               Date|BaseAmountMonth|Interval|
+------------------+----------+-------------------+---------------+--------+
|0067000000i6ONPAA2|OP-0164615|2014-07-27 00:00:00|    4375.800000|      12|
|0065w0000215k5kAAA|OP-0218055|2020-12-23 00:00:00|    4975.000000|       7|
+------------------+----------+-------------------+---------------+--------+

回答1:


You can explode a sequence of timestamps:

import pyspark.sql.functions as F

df2 = df.withColumn(
    'Date',
    F.expr("""
        explode(
            sequence(
                timestamp(Date),
                add_months(timestamp(Date), `Interval` - 1),
                interval 1 month
            )
        )
    """)
)

df2.show(99)
+------------------+----------+-------------------+---------------+--------+
|     OpptyHeaderID|   OpptyID|               Date|BaseAmountMonth|Interval|
+------------------+----------+-------------------+---------------+--------+
|0067000000i6ONPAA2|OP-0164615|2014-07-27 00:00:00|    4375.800000|      12|
|0067000000i6ONPAA2|OP-0164615|2014-08-27 00:00:00|    4375.800000|      12|
|0067000000i6ONPAA2|OP-0164615|2014-09-27 00:00:00|    4375.800000|      12|
|0067000000i6ONPAA2|OP-0164615|2014-10-27 00:00:00|    4375.800000|      12|
|0067000000i6ONPAA2|OP-0164615|2014-11-27 00:00:00|    4375.800000|      12|
|0067000000i6ONPAA2|OP-0164615|2014-12-27 00:00:00|    4375.800000|      12|
|0067000000i6ONPAA2|OP-0164615|2015-01-27 00:00:00|    4375.800000|      12|
|0067000000i6ONPAA2|OP-0164615|2015-02-27 00:00:00|    4375.800000|      12|
|0067000000i6ONPAA2|OP-0164615|2015-03-27 00:00:00|    4375.800000|      12|
|0067000000i6ONPAA2|OP-0164615|2015-04-27 00:00:00|    4375.800000|      12|
|0067000000i6ONPAA2|OP-0164615|2015-05-27 00:00:00|    4375.800000|      12|
|0067000000i6ONPAA2|OP-0164615|2015-06-27 00:00:00|    4375.800000|      12|
|0065w0000215k5kAAA|OP-0218055|2020-12-23 00:00:00|    4975.000000|       7|
|0065w0000215k5kAAA|OP-0218055|2021-01-23 00:00:00|    4975.000000|       7|
|0065w0000215k5kAAA|OP-0218055|2021-02-23 00:00:00|    4975.000000|       7|
|0065w0000215k5kAAA|OP-0218055|2021-03-23 00:00:00|    4975.000000|       7|
|0065w0000215k5kAAA|OP-0218055|2021-04-23 00:00:00|    4975.000000|       7|
|0065w0000215k5kAAA|OP-0218055|2021-05-23 00:00:00|    4975.000000|       7|
|0065w0000215k5kAAA|OP-0218055|2021-06-23 00:00:00|    4975.000000|       7|
+------------------+----------+-------------------+---------------+--------+


来源:https://stackoverflow.com/questions/66170127/how-to-insert-a-custom-function-within-for-loop-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!