get latest data from hive table with multiple partition columns

风流意气都作罢 提交于 2021-02-19 05:36:06

问题


I have a hive table with below structure

ID string,
Value string,
year int,
month int,
day int,
hour int,
minute int

This table is refreshed every 15 mins and it is partitioned with year/month/day/hour/minute columns. Please find below samples on partitions.

year=2019/month=12/day=29/hour=19/minute=15
year=2019/month=12/day=30/hour=00/minute=45
year=2019/month=12/day=30/hour=08/minute=45
year=2019/month=12/day=30/hour=09/minute=30
year=2019/month=12/day=30/hour=09/minute=45

I want to select only latest partition data from the table. I tried to use max() statements with those partition columns, but its not very efficient as data size is huge. Please let me know, how can i get the data in a convenient way using hive sql.


回答1:


If the latest partition is always in current date, then you can filter current date partition and use rank() to find records with latest hour, minute:

select * --list columns here
from
(
select s.*, rank() over(order by hour desc, minute desc) rnk
  from your_table s
 where s.year=year(current_date)   --filter current day (better pass variables calculated if possible)
   and s.month=lpad(month(current_date),2,0) 
   and s.day=lpad(day(current_date),2,0)
   -- and s.hour=lpad(hour(current_timestamp),2,0) --consider also adding this
) s 
where rnk=1 --latest hour, minute

And if the latest partition is not necessarily equals current_date then you can use rank() over (order by s.year desc, s.month desc, s.day desc, hour desc, minute desc), without filter on date this will scan all the table and is not efficient.

It will perform the best if you can calculate partition filters in the shell and pass as parameters. See comments in the code.



来源:https://stackoverflow.com/questions/59674580/get-latest-data-from-hive-table-with-multiple-partition-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!