I have following table in hive
user-id, user-name, user-address,clicks,impressions,page-id,page-name
I need to find out top 5 users[user-id,user-name,user-ad
You can use each_top_k function of hivemall for an efficient top-k computation on Apache Hive.
select
page-id,
user-id,
clicks
from (
select
each_top_k(5, page-id, clicks, page-id, user-id)
as (rank, clicks, page-id, user-id)
from (
select
page-id, user-id, clicks
from
mytable
DISTRIBUTE BY page-id SORT BY page-id
) t1
) t2
order by page-id ASC, clicks DESC
The each_top_k UDTF is very fast when compared to other methods running top-k queries (e.g., distributed by/rank) in Hive because it does not hold the whole ranking for the intermediate result.