问题
I'm storing daily reports per client for query with Athena.
At first I thought I'd use a client=c_1/month=12/day=01/
or client=c2/date=2020-12-01/
folder structure, and run MSCK REPAIR TABLE
daily to make new day partition available for query.
Then I realized there's the $path
special column, so if I store files as 2020-12-01.csv
I could run a query with WHERE $path LIKE '%12-01%
thus saving a partition and the need to detect/add it daily.
I can see this having an impact on performance if there was a lot of daily data,
But in my case the day
partition will include one file at most, so a partition is mostly to have a field to query, not reduce query dataset.
Any other downside?
回答1:
When using $path
column, all table (partition) location needs to be fully listed.
if you have large number of objects in S3, this listing can become a bottleneck.
Partitions avoid this problem.
Of course, having large number of partitions is also a problem.
I don't know what the cardinality of client
column, so hard to tell how many partitions to expect with this approach.
回答2:
Currently Athena does not apply any optimisations for $path
, which means that there is no meaningful difference between WHERE "$path" LIKE '%12-01%
and WHERE "date" = '2020-12-01'
(assuming you have a column date
which contains the same date as the file name). Your data probably already has a date or datetime column, and your queries will be more readable using it rather than $path
.
You are definitely on the right track questioning whether or not you need the date part of your current partitioning scheme. There are lots of different considerations when partitioning data sets, and it's not easy to always say what is right without analysing the situation in detail.
I would recommend having some kind of time-based partition key. Otherwise you will have no way to limit the amount of data read by queries, and they will be slower and more expensive as time goes. Partitioning on date is probably too fine grained for your use case, but perhaps year or month would work.
However, if there will only be data for a client for a short time (less than one thousand files in total, the size of one S3 listing page), or queries always read all the data for a client, you don't need a time-based partition key.
To do a deeper analysis on how to partition your data I would need to know more about the types of queries you will be running, how the data is updated, how much data files are expected to contain, and how much difference there will be from client to client.
来源:https://stackoverflow.com/questions/62290584/athena-path-vs-partition