Hive - multiple (average) count distincts over layered groups

穿精又带淫゛_ 提交于 2019-12-24 10:45:59

问题


Given the following source data (say the table name is user_activity):

+---------+-----------+------------+
| user_id | user_type | some_date  |
+---------+-----------+------------+
| 1       | a         | 2018-01-01 |
| 1       | a         | 2018-01-02 |
| 2       | a         | 2018-01-01 |
| 3       | a         | 2018-01-01 |
| 4       | b         | 2018-01-01 |
| 4       | b         | 2018-01-02 |
| 5       | b         | 2018-01-02 |
+---------+-----------+------------+

I'd like to get the following result:

+-----------+------------+---------------------+
| user_type | user_count | average_daily_users |
+-----------+------------+---------------------+
| a         | 3          | 2                   |
| b         | 2          | 1.5                 |
+-----------+------------+---------------------+

using a single query without multiple subqueries on the same table.


Using multiple queries, I can get:

  • user_count:

    select
      user_type,
      count(distinct user_id)
    from user_activity
    group by user_type
    
  • For average_daily_users:

    select
      user_type,
      avg(distinct_users) as average_daily_users
    from (
      select
        count(distinct user_id) as distinct_users
      from user_activity
      group by user_type, some_date
    )
    group by user_type
    

But I can't seem to write a query that does what I want in one go. I am concerned about the performance impact of multiple subqueries on the same table (it will have to scan through the table twice... right?) I have a rather large data source and would like to minimize running time.

NOTE: The question is titled Hive because that is what I'm working with, but I think it is a generic enough SQL problem so I'm not ruling out answers in other languages.

NOTE2: This question shares details with my other question on partition by columns in window functions (for computing the average daily users column).


回答1:


This should do what you want:

select ua.user_type,
       count(distinct ua.user_id) as user_count,
       count(distinct some_date || ':' || ua.user_id) / count(distinct some_date)
from user_activity ua
group by ua.user_type;


来源:https://stackoverflow.com/questions/51959175/hive-multiple-average-count-distincts-over-layered-groups

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!