Tables missing on filesystem in AWS Athena

问题

I've created a table with auto partitioning with this code on Athena.

CREATE EXTERNAL TABLE IF NOT EXISTS matchdata.stattable (
  `matchResult` string,
  ...
) PARTITIONED BY (
  year int ,
  month int,
  day int
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://******/data/year=2019/month=8/day=2/'
TBLPROPERTIES ('has_encrypted_data'='false');

and I ran MSCK REPAIR TABLE stattable, but got Tables missing on filesystem and query result is zero records returned. matchdata.stattable gets same result.

Another table without partitioning, the query works fine. But as the service continues and dataset gets grow, I must go with partitioning.

The example data path is data/2019/8/2/1SxFHaUeHfesLtPs._BjDk.gz. How can I settle this issue?

回答1:

As you've discovered (but with some more context for the people having the same issue) MSCK REPAIR TABLE … only understands Hive style partitioning, e.g. /data/year=2019/month=08/day=10/file.json. What the command really does is scan through the prefix on S3 corresponding to the table's LOCATION directive and look for path components that look like that.

This is just a limitation with MSCK REPAIR TABLE …, you can manually add partitions with other path styles like this:

ALTER TABLE the_table ADD PARTITION (year = '2019', month = '08', day = '10') LOCATION 's3://some-bucket/data/2019/08/10/'

Also see https://docs.aws.amazon.com/athena/latest/ug/alter-table-add-partition.html

I would go so far as to say that you should avoid using MSCK REPAIR TABLE … altogether. It's slow, and only gets slower the more partitions you have. It's much more efficient to run ALTER TABLE … ADD PARTITION … when you add new data on S3, because you know what you just added and where it is, so telling Athena to scan through your whole prefix is unnecessary. Even faster is using the Glue API directly, but that's more code, unfortunately.

回答2:

I solved this problem by renaming s3 files' prefix.

You can't actually rename or move file in s3 directly. By mv command, you should create another key and delete existing one.

By run this code on console, you can make Hive can understand location of partition.

aws s3 --recursive mv s3://***/data/2019/8/7/ s3://***/data/year=2019/month=8/day=7/

来源：https://stackoverflow.com/questions/57395856/tables-missing-on-filesystem-in-aws-athena

标签

sql

Hive

amazon-athena