Hive getting top n records in group by query

后端 未结 6 1980
终归单人心
终归单人心 2020-12-07 17:47

I have following table in hive

user-id, user-name, user-address,clicks,impressions,page-id,page-name

I need to find out top 5 users[user-id,user-name,user-ad

6条回答
  •  青春惊慌失措
    2020-12-07 18:43

    You can use each_top_k function of hivemall for an efficient top-k computation on Apache Hive.

    select
      page-id, 
      user-id,
      clicks
    from (
      select
        each_top_k(5, page-id, clicks, page-id, user-id)
          as (rank, clicks, page-id, user-id)
      from (
        select
          page-id, user-id, clicks
        from
          mytable
        DISTRIBUTE BY page-id SORT BY page-id
      ) t1
    ) t2
    order by page-id ASC, clicks DESC
    

    The each_top_k UDTF is very fast when compared to other methods running top-k queries (e.g., distributed by/rank) in Hive because it does not hold the whole ranking for the intermediate result.

提交回复
热议问题