Hive external table optimal partition size

纵饮孤独 提交于 2020-12-26 03:20:34

问题


What is the optimal size for external table partition? I am planning to partition table by year/month/day and we are getting about 2GB of data daily.


回答1:


Optimal table partitioning is such that matching to your table usage scenario. Partitioning should be chosen based on:

  1. how the data is being queried (if you need to work mostly with daily data then partition by date).
  2. how the data is being loaded (parallel threads should load their own partitions, not overlapped)

2Gb is not too much even for one file, though it again depends on your usage scenario. Avoid unnecessary complex and redundant partitions like (year, month, date) - in this case date is enough for partition pruning.




回答2:


Hive partitions definition will be stored in the metastore, therefore too many partitions will take much space in the metastore.

Partitions will be stored as directories in the HDFS, therefore many partitions keys will produce hirarchical directories which make their scanning slower.

Your query will be executed as a MapReduce job, therefore it's useless to make too tiny partitions.

It's case depending, think how your data will be queried. For your case I prefer one key defined as 'yyyymmdd', hence we will get 365 partitions / year, only one level in the table directory and 2G data / partition which is nice for a MapReduce job.

For the completness of the answer, if you use Hive < 0.12, make your partition key string typed, see here.

Usefull blog here.




回答3:


Hive partitioning is most effective in cases where the data is sparse. By sparse I mean that the data internally has visible partitions such as by year, month or day.

In your case, partitioning by date doesn't make much sense as each day will have 2 Gb of data which is not too big to handle. Partitioning by week or month makes more sense as it will optimize the query time and will not create too many small partition files.



来源:https://stackoverflow.com/questions/37575615/hive-external-table-optimal-partition-size

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!