Tables missing on filesystem in AWS Athena

走远了吗. 提交于 2020-08-08 06:39:05

问题


I've created a table with auto partitioning with this code on Athena.

CREATE EXTERNAL TABLE IF NOT EXISTS matchdata.stattable (
  `matchResult` string,
  ...
) PARTITIONED BY (
  year int ,
  month int,
  day int
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://******/data/year=2019/month=8/day=2/'
TBLPROPERTIES ('has_encrypted_data'='false');

and I ran MSCK REPAIR TABLE stattable, but got Tables missing on filesystem and query result is zero records returned. matchdata.stattable gets same result.

Another table without partitioning, the query works fine. But as the service continues and dataset gets grow, I must go with partitioning.

The example data path is data/2019/8/2/1SxFHaUeHfesLtPs._BjDk.gz. How can I settle this issue?


回答1:


As you've discovered (but with some more context for the people having the same issue) MSCK REPAIR TABLE … only understands Hive style partitioning, e.g. /data/year=2019/month=08/day=10/file.json. What the command really does is scan through the prefix on S3 corresponding to the table's LOCATION directive and look for path components that look like that.

This is just a limitation with MSCK REPAIR TABLE …, you can manually add partitions with other path styles like this:

ALTER TABLE the_table ADD PARTITION (year = '2019', month = '08', day = '10') LOCATION 's3://some-bucket/data/2019/08/10/'

Also see https://docs.aws.amazon.com/athena/latest/ug/alter-table-add-partition.html

I would go so far as to say that you should avoid using MSCK REPAIR TABLE … altogether. It's slow, and only gets slower the more partitions you have. It's much more efficient to run ALTER TABLE … ADD PARTITION … when you add new data on S3, because you know what you just added and where it is, so telling Athena to scan through your whole prefix is unnecessary. Even faster is using the Glue API directly, but that's more code, unfortunately.




回答2:


I solved this problem by renaming s3 files' prefix.

You can't actually rename or move file in s3 directly. By mv command, you should create another key and delete existing one.

By run this code on console, you can make Hive can understand location of partition.

aws s3 --recursive mv s3://***/data/2019/8/7/ s3://***/data/year=2019/month=8/day=7/


来源:https://stackoverflow.com/questions/57395856/tables-missing-on-filesystem-in-aws-athena

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!