How to handle TIMESTAMP_MICROS parquet fields in Presto/Athena

帅比萌擦擦* 提交于 2019-12-08 03:56:17

问题


Presently, we have a DMS task that will take the contents of a MySQL DB and dump files to S3 in parquet format.

The format for the timestamps in parquet ends up being TIMESTAMP_MICROS.

This is a problem as Presto (the underlying implementation of Athena) does not support timestamps in microsecond precision and makes the assumption that all timestamps are in millisecond precision.

This does not cause any errors directly but it makes the times display as some extreme future date as it is interpreting the number of microseconds as number of milliseconds.

We are currently working around this by creating Athena views on top of the Athena tables that take these these timestamps, convert them to a unix timestamp (i.e. number), divide by 1000 to convert to millisecond resolution and then convert back to a timestamp.

This was fine for a proof of concept and identifying the problem but this won't work for us as we have lots of tables with lots of timestamp columns and don't want to introduce this fragile view layer on top.

Support tickets were created for both the Athena and DMS teams and both confirmed the problem and created feature requests on their end to create a native solution for this incompatible but neither support engineer could offer a built-in workaround. DMS does not seem to support changing the column type to be millisecond precision or a varchar string. For Athena, this is simply a documented Presto limitation.

Has anyone run into this before and found a solution?

Changing the precision in the MySQL database is not really an option as it is used by many clients and we don't want to risk breaking the interface.

I am considering creating a job that will run anytime a new parquet is created to look for TIMESTAMP_MICRO fields and column them to TIMESTAMP_MILLIS. Not exactly sure the best way to go about this (Glue/Spark job?)

DMS was great and has made the dump to parquet super simple, there may be other ways to extract the data from MySQL. Any other tools that can do this without much custom development?


回答1:


With the target endpoint, you can set "Extra Connection Attributes". If you set parquetTimestampInMillisecond to true, the timestamps will be in milliseconds.

Source



来源:https://stackoverflow.com/questions/56430840/how-to-handle-timestamp-micros-parquet-fields-in-presto-athena

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!