Spark 2.4.0 - unable to parse ISO8601 string into TimestampType preserving ms

问题

When trying to convert ISO8601 strings with time zone information into a TimestampType using a cast(TimestampType) only strings using the time zone format +01:00 is accepted. If the time zone is defined in the ISO8601 legal way +0100 (without the colon) the parse fails and returns null. I need to convert the string to a TimestampType while preserving the ms part.

2019-02-05T14:06:31.556+0100    Returns null
2019-02-05T14:06:31.556+01:00   Returns a correctly parsed TimestampType

I have tried to use the to_timestamp() and unix_timestamp().cast(TimestampType) functions. Unfortunately, they truncate the ms section of the timestamp which I need to preserve. Also, you need to apply them to a new column and can't make an in-place replace of an attribute in a complex type (which is possible if I make the ApiReceived property a TimestampType in the schema for the from_json function).

df
.select($"body".cast(StringType))
.select(from_json($"body", schema).as("Payload"))
.select($"Payload.Metadata.ApiReceived".as("Time"))
.withColumn("NewTime", to_timestamp($"Time", "yyyy-MM-dd'T'HH:mm:ss.SSSZ"))
.withColumn("NewTime2", unix_timestamp($"Time", "yyyy-MM-dd'T'HH:mm:ss.SSSZ").cast(TimestampType))
.withColumn("NewTime3", $"Time".cast(TimestampType))

The output type of the above DataFrame

df:org.apache.spark.sql.DataFrame
  Time:string
  NewTime:timestamp
  NewTime2:timestamp
  NewTime3:timestamp

And output values

Time        2019-02-05T14:06:31.556+0100
NewTime     2019-02-05 13:06:31
NewTime2    2019-02-05 13:06:31
NewTime3    null

Is there a way to make Spark handle the conversion without resorting to UDF:s?

Update

After a more thorough investigation, I found Sparks datetime parsing is somewhat inconsistent. :)

val df = Seq(
  //Extended format
  ("2019-02-05T14:06:31.556+01:00"),
  ("2019-02-05T14:06:31.556+01"),
  ("2019-02-05T14:06:31.556"),
  //Basic Format
  ("20190205T140631556+0100"),
  ("20190205T140631556+01"),
  ("20190205T140631556"),
  //Mixed extended with basic
  ("2019-02-05T14:06:31.556+0100"),
  ("20190205T140631556+01:00")
).toDF

val formatStrings = Seq(
  ("yyyy-MM-dd'T'HH:mm:ss.SSSZ"),
  ("yyyy-MM-dd'T'HH:mm:ss.SSSX"),
  ("yyyyMMdd'T'HHmmssSSSZ"),
  ("yyyyMMdd'T'HHmmssSSSX")
)

val format = formatStrings(0)

val df2 = df
.select($"value".as("Time"))
.withColumn("NewTime3", $"Time".cast(TimestampType))
.withColumn("NewTime", to_timestamp($"Time", format))
.withColumn("NewTime2", unix_timestamp($"Time", format).cast(TimestampType))
.withColumn("NewTime4", date_format($"Time", format))

display(df2)

I you run these dataframes and compare the output it's somewhat disheartening. The most permissive formatString is the second SSSX

The only reasonable way to handle this is an UDF that makes sure all ISO8601 strings adhere to a standard that the function you plan to use understands.

Still, haven't found a way to preserve the millisecond part on both formats.

2019-02-05T14:06:31.556+01:00 and
2019-02-05T14:06:31.556+0100

Update 2

https://issues.apache.org/jira/browse/SPARK-17545?jql=project%20%3D%20SPARK%20AND%20text%20~%20iso8601

Apparently it is NOT in accordance with the ISO8601 standard to mix basic and extended forms. The string "2019-02-05T14:06:31.556+0100" is not in standard format. It seems to be correct according to RFC822 though.

If I understand the JIRA ticket correctly the standard parsing (ie cast() on a string column) only handels correctly formated ISO8601 strings, not RFC822 or other edge cases (ie mixing extended and basic formats). If you have an edge case you have to supply the format string and use another parsing method.

I don't have access to the ISO8601:2004 standard so I can't check but if the comment in the JIRA is correct the internet needs an update. Alot of web pages conflate RFC822 and ISO8601 and lists "2019-02-05T14:06:31.556+0100" as a legal ISO8601 string.

来源：https://stackoverflow.com/questions/54601917/spark-2-4-0-unable-to-parse-iso8601-string-into-timestamptype-preserving-ms

标签

apache-spark

databricks

azure-databricks