Get Last Monday in Spark

匿名 (未验证) 提交于 2019-12-03 01:20:02

问题:

I am using Spark 2.0 with the Python API.

I have a dataframe with a column of type DateType(). I would like to add a column to the dataframe containing the most recent Monday.

I can do it like this:

reg_schema = pyspark.sql.types.StructType([     pyspark.sql.types.StructField('AccountCreationDate', pyspark.sql.types.DateType(), True),     pyspark.sql.types.StructField('UserId', pyspark.sql.types.LongType(), True) ]) reg = spark.read.schema(reg_schema).option('header', True).csv(path_to_file) reg = reg.withColumn('monday',     pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate,'E') == 'Mon',         reg.AccountCreationDate).otherwise(     pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate,'E') == 'Tue',         pyspark.sql.functions.date_sub(reg.AccountCreationDate, 1)).otherwise(     pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate, 'E') == 'Wed',         pyspark.sql.functions.date_sub(reg.AccountCreationDate, 2)).otherwise(     pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate, 'E') == 'Thu',         pyspark.sql.functions.date_sub(reg.AccountCreationDate, 3)).otherwise(     pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate, 'E') == 'Fri',         pyspark.sql.functions.date_sub(reg.AccountCreationDate, 4)).otherwise(     pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate, 'E') == 'Sat',         pyspark.sql.functions.date_sub(reg.AccountCreationDate, 5)).otherwise(     pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate, 'E') == 'Sun',         pyspark.sql.functions.date_sub(reg.AccountCreationDate, 6))         ))))))) 

However, this seems like a lot of code for something that should be rather simple. Is there a more concise way of doing this?

回答1:

You can determine next date using next_day and subtract a week. Required functions can be imported as follows:

from pyspark.sql.functions import next_day, date_sub 

And as:

def previous_day(date, dayOfWeek):     return date_sub(next_day(date, "monday"), 7) 

Finally an example:

from pyspark.sql.functions import to_date  df = sc.parallelize([     ("2016-10-26", ) ]).toDF(["date"]).withColumn("date", to_date("date"))  df.withColumn("last_monday", previous_day("date", "monday")) 

With result:

+----------+-----------+ |      date|last_monday| +----------+-----------+ |2016-10-26| 2016-10-24| +----------+-----------+ 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!