问题
I have a date string from a source in the format 'Fri May 24 00:00:00 BST 2019' that I would convert to a date and store in my dataframe as '2019-05-24' using code like my example which works for me under spark 2.0
from pyspark.sql.functions import to_date, unix_timestamp, from_unixtime
df = spark.createDataFrame([("Fri May 24 00:00:00 BST 2019",)], ['date_str'])
df2 = df.select('date_str', to_date(from_unixtime(unix_timestamp('date_str', 'EEE MMM dd HH:mm:ss zzz yyyy'))).alias('date'))
df2.show(1, False)
In my sandbox environment I've updated to spark 3.0 and now get the following error for the above code, is there a new method of doing this in 3.0 to convert my string to a date
: org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter.
- You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0.
- You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
回答1:
If you want to use the legacy format in a newer version of spark(>3), you need to set spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
or
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
, which will resolve the issue.
回答2:
Thanks for responses, excellent advice, for the moment I'll be going with the LEGACY setting. I have a workaround with Spark 3.0 by substringing out the EEE element but I've noticed a bug with how BST timezone converts incorrectly offseting by 10 hours while under LEGACY it correctly remains the same as I'm currently in BST zone. I can do something with this but will wait till the clocks change in the autumn to confirm.
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
df = spark.createDataFrame([('Fri May 24 00:00:00 BST 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime'),
to_timestamp(df.mydate, 'EEE MMM dd HH:mm:ss zzz yyyy').alias('LEGACYdatetime')
).show(1, False)
df = spark.createDataFrame([('Fri May 24 00:00:00 GMT 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime'),
to_timestamp(df.mydate, 'EEE MMM dd HH:mm:ss zzz yyyy').alias('LEGACYdatetime')
).show(1, False)
spark.sql("set spark.sql.legacy.timeParserPolicy=CORRECTED")
df = spark.createDataFrame([('Fri May 24 00:00:00 BST 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime')
).show(1, False)
df = spark.createDataFrame([('Fri May 24 00:00:00 GMT 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime')
).show(1, False)
+----------------------------+-------------------+-------------------+
|mydate |datetime |LEGACYdatetime |
+----------------------------+-------------------+-------------------+
|Fri May 24 00:00:00 BST 2019|2019-05-24 00:00:00|2019-05-24 00:00:00|
+----------------------------+-------------------+-------------------+
+----------------------------+-------------------+-------------------+
|mydate |datetime |LEGACYdatetime |
+----------------------------+-------------------+-------------------+
|Fri May 24 00:00:00 GMT 2019|2019-05-24 01:00:00|2019-05-24 01:00:00|
+----------------------------+-------------------+-------------------+
+----------------------------+-------------------+
|mydate |datetime |
+----------------------------+-------------------+
|Fri May 24 00:00:00 BST 2019|2019-05-23 14:00:00|
+----------------------------+-------------------+
+----------------------------+-------------------+
|mydate |datetime |
+----------------------------+-------------------+
|Fri May 24 00:00:00 GMT 2019|2019-05-24 01:00:00|
+----------------------------+-------------------+
来源:https://stackoverflow.com/questions/62602720/string-to-date-migration-from-spark-2-0-to-3-0-gives-fail-to-recognize-eee-mmm