Impala timestamps don't match Hive - a timezone issue?

后端 未结 4 2061
时光取名叫无心
时光取名叫无心 2020-12-29 00:29

I have some eventlog data in HDFS that, in its raw format, looks like this:

2015-11-05 19:36:25.764 INFO    [...etc...]

An external table p

相关标签:
4条回答
  • 2020-12-29 01:01

    be VERY careful with the answers above due to https://issues.apache.org/jira/browse/IMPALA-2716

    As for now, the best workaround is not to use TIMESTAMP data type and store timestamps as strings.

    0 讨论(0)
  • 2020-12-29 01:07

    As mentioned in https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_timestamp.html

    You can use ----use_local_tz_for_unix_timestamp_conversions=true and --convert_legacy_hive_parquet_utc_timestamps=true to match Hive results.

    The first one ensures it converts to local timezone when you use any datetime function. You can set it as Impala Daemon startup options as mentioned in this document.

    https://docs.cloudera.com/documentation/enterprise/5-6-x/topics/impala_config_options.html

    0 讨论(0)
  • 2020-12-29 01:09

    Hive writes timestamps to Parquet differently. You can use the impalad flag -convert_legacy_hive_parquet_utc_timestamps to tell Impala to do the conversion on read. See the TIMESTAMP documentation for more details.

    This blog post has a brief description of the issue:

    When Hive stores a timestamp value into Parquet format, it converts local time into UTC time, and when it reads data out, it converts back to local time. Impala, however on the other hand, does no conversion when reads the timestamp field out, hence, UTC time is returned instead of local time.

    The impalad flag tells Impala to do the conversion when reading timestamps in Parquet produced by Hive. It does incur some small cost, so you should consider writing your timestamps with Impala if that is an issue for you (though it likely is minimal).

    0 讨论(0)
  • 2020-12-29 01:10

    On a related note, as of Hive v1.2, you can also disable the timezone conversion behaviour with this flag:

    hive.parquet.timestamp.skip.conversion
    

    "Current Hive implementation of parquet stores timestamps to UTC, this flag allows skipping of the conversion on reading parquet files from other tools."

    This was added in as part of https://issues.apache.org/jira/browse/HIVE-9482

    Lastly, not timezone exactly, but for compatibility of Spark (v1.3 and up) and Impala on Parquet files, there's this flag:

    spark.sql.parquet.int96AsTimestamp
    

    https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#configuration

    Other: https://issues.apache.org/jira/browse/SPARK-12297

    0 讨论(0)
提交回复
热议问题