How to force inferSchema for CSV to consider integers as dates (with “dateFormat” option)?

陌路散爱 提交于 2019-11-29 07:36:16

If my understanding is correct, the code implies the following order of type inference (with the first types being checked against first):

  • NullType
  • IntegerType
  • LongType
  • DecimalType
  • DoubleType
  • TimestampType
  • BooleanType
  • StringType

With that, I think the issue is that 20171001 matches IntegerType before even considering TimestampType (which uses timestampFormat not dateFormat option).

One solution would be to define the schema and use it with schema operator (of DataFrameReader) or let Spark SQL infer the schema and use cast operator.

I'd choose the former if the number of fields is not high.

user6910411

In this case you simply cannot depend on the schema inference due to format ambiguity.

Since input can be parsed both as IntegerType (or any higher precision numeric format) as well as TimestamType and the former one has higher precedence (internally Spark tries IntegerType -> LongType -> DecimaType -> DoubleType -> TimestampType) inference mechanism will never reach TimestampType case.

To be specific, with schema inference enabled, Spark will call tryParseInteger, which will correctly parse the input and stop. Subsequent call will match the second case and finish at the same tryParseInteger call.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!