Hive - multiple (average) count distincts over layered groups

问题

Given the following source data (say the table name is user_activity):

+---------+-----------+------------+
| user_id | user_type | some_date  |
+---------+-----------+------------+
| 1       | a         | 2018-01-01 |
| 1       | a         | 2018-01-02 |
| 2       | a         | 2018-01-01 |
| 3       | a         | 2018-01-01 |
| 4       | b         | 2018-01-01 |
| 4       | b         | 2018-01-02 |
| 5       | b         | 2018-01-02 |
+---------+-----------+------------+

I'd like to get the following result:

+-----------+------------+---------------------+
| user_type | user_count | average_daily_users |
+-----------+------------+---------------------+
| a         | 3          | 2                   |
| b         | 2          | 1.5                 |
+-----------+------------+---------------------+

using a single query without multiple subqueries on the same table.

Using multiple queries, I can get:

user_count:

select
  user_type,
  count(distinct user_id)
from user_activity
group by user_type

For average_daily_users:

select
  user_type,
  avg(distinct_users) as average_daily_users
from (
  select
    count(distinct user_id) as distinct_users
  from user_activity
  group by user_type, some_date
)
group by user_type

But I can't seem to write a query that does what I want in one go. I am concerned about the performance impact of multiple subqueries on the same table (it will have to scan through the table twice... right?) I have a rather large data source and would like to minimize running time.

NOTE: The question is titled Hive because that is what I'm working with, but I think it is a generic enough SQL problem so I'm not ruling out answers in other languages.

NOTE2: This question shares details with my other question on partition by columns in window functions (for computing the average daily users column).

回答1:

This should do what you want:

select ua.user_type,
       count(distinct ua.user_id) as user_count,
       count(distinct some_date || ':' || ua.user_id) / count(distinct some_date)
from user_activity ua
group by ua.user_type;

来源：https://stackoverflow.com/questions/51959175/hive-multiple-average-count-distincts-over-layered-groups

标签

sql

Hive

group-by

hiveql