问题
When trying to convert ISO8601 strings with time zone information into a TimestampType using a cast(TimestampType) only strings using the time zone format +01:00 is accepted. If the time zone is defined in the ISO8601 legal way +0100 (without the colon) the parse fails and returns null. I need to convert the string to a TimestampType while preserving the ms part.
2019-02-05T14:06:31.556+0100 Returns null
2019-02-05T14:06:31.556+01:00 Returns a correctly parsed TimestampType
I have tried to use the to_timestamp() and unix_timestamp().cast(TimestampType) functions. Unfortunately, they truncate the ms section of the timestamp which I need to preserve. Also, you need to apply them to a new column and can't make an in-place replace of an attribute in a complex type (which is possible if I make the ApiReceived property a TimestampType in the schema for the from_json function).
df
.select($"body".cast(StringType))
.select(from_json($"body", schema).as("Payload"))
.select($"Payload.Metadata.ApiReceived".as("Time"))
.withColumn("NewTime", to_timestamp($"Time", "yyyy-MM-dd'T'HH:mm:ss.SSSZ"))
.withColumn("NewTime2", unix_timestamp($"Time", "yyyy-MM-dd'T'HH:mm:ss.SSSZ").cast(TimestampType))
.withColumn("NewTime3", $"Time".cast(TimestampType))
The output type of the above DataFrame
df:org.apache.spark.sql.DataFrame
Time:string
NewTime:timestamp
NewTime2:timestamp
NewTime3:timestamp
And output values
Time 2019-02-05T14:06:31.556+0100
NewTime 2019-02-05 13:06:31
NewTime2 2019-02-05 13:06:31
NewTime3 null
Is there a way to make Spark handle the conversion without resorting to UDF:s?
Update
After a more thorough investigation, I found Sparks datetime parsing is somewhat inconsistent. :)
val df = Seq(
//Extended format
("2019-02-05T14:06:31.556+01:00"),
("2019-02-05T14:06:31.556+01"),
("2019-02-05T14:06:31.556"),
//Basic Format
("20190205T140631556+0100"),
("20190205T140631556+01"),
("20190205T140631556"),
//Mixed extended with basic
("2019-02-05T14:06:31.556+0100"),
("20190205T140631556+01:00")
).toDF
val formatStrings = Seq(
("yyyy-MM-dd'T'HH:mm:ss.SSSZ"),
("yyyy-MM-dd'T'HH:mm:ss.SSSX"),
("yyyyMMdd'T'HHmmssSSSZ"),
("yyyyMMdd'T'HHmmssSSSX")
)
val format = formatStrings(0)
val df2 = df
.select($"value".as("Time"))
.withColumn("NewTime3", $"Time".cast(TimestampType))
.withColumn("NewTime", to_timestamp($"Time", format))
.withColumn("NewTime2", unix_timestamp($"Time", format).cast(TimestampType))
.withColumn("NewTime4", date_format($"Time", format))
display(df2)
I you run these dataframes and compare the output it's somewhat disheartening. The most permissive formatString is the second SSSX
The only reasonable way to handle this is an UDF that makes sure all ISO8601 strings adhere to a standard that the function you plan to use understands.
Still, haven't found a way to preserve the millisecond part on both formats.
2019-02-05T14:06:31.556+01:00 and
2019-02-05T14:06:31.556+0100
Update 2
https://issues.apache.org/jira/browse/SPARK-17545?jql=project%20%3D%20SPARK%20AND%20text%20~%20iso8601
Apparently it is NOT in accordance with the ISO8601 standard to mix basic and extended forms. The string "2019-02-05T14:06:31.556+0100" is not in standard format. It seems to be correct according to RFC822 though.
If I understand the JIRA ticket correctly the standard parsing (ie cast() on a string column) only handels correctly formated ISO8601 strings, not RFC822 or other edge cases (ie mixing extended and basic formats). If you have an edge case you have to supply the format string and use another parsing method.
I don't have access to the ISO8601:2004 standard so I can't check but if the comment in the JIRA is correct the internet needs an update. Alot of web pages conflate RFC822 and ISO8601 and lists "2019-02-05T14:06:31.556+0100" as a legal ISO8601 string.
来源:https://stackoverflow.com/questions/54601917/spark-2-4-0-unable-to-parse-iso8601-string-into-timestamptype-preserving-ms