I am using Spark 2.0 with the Python API.
I have a dataframe with a column of type DateType(). I would like to add a column to the dataframe containing the most recent Monday.
I can do it like this:
reg_schema = pyspark.sql.types.StructType([ pyspark.sql.types.StructField('AccountCreationDate', pyspark.sql.types.DateType(), True), pyspark.sql.types.StructField('UserId', pyspark.sql.types.LongType(), True) ]) reg = spark.read.schema(reg_schema).option('header', True).csv(path_to_file) reg = reg.withColumn('monday', pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate,'E') == 'Mon', reg.AccountCreationDate).otherwise( pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate,'E') == 'Tue', pyspark.sql.functions.date_sub(reg.AccountCreationDate, 1)).otherwise( pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate, 'E') == 'Wed', pyspark.sql.functions.date_sub(reg.AccountCreationDate, 2)).otherwise( pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate, 'E') == 'Thu', pyspark.sql.functions.date_sub(reg.AccountCreationDate, 3)).otherwise( pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate, 'E') == 'Fri', pyspark.sql.functions.date_sub(reg.AccountCreationDate, 4)).otherwise( pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate, 'E') == 'Sat', pyspark.sql.functions.date_sub(reg.AccountCreationDate, 5)).otherwise( pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate, 'E') == 'Sun', pyspark.sql.functions.date_sub(reg.AccountCreationDate, 6)) )))))))
However, this seems like a lot of code for something that should be rather simple. Is there a more concise way of doing this?