Hive - over (partition by …) with a column not in group by

问题

Is it possible to do something like:

select
  avg(count(distinct user_id))
    over (partition by some_date) as average_users_per_day
from user_activity
group by user_type

(notably, the partition by column, some_date, is not in the group by columns)

The idea I'm going for is something like: the average users per day by user type.

I know how to do it using subqueries (see below), but I'd like to know if there is a nice way using only over (partition by ...) and group by.

Notes:

From reading this answer, my understanding (correct me if I'm wrong) is that the following query:

select
  avg(count(distinct a)) over (partition by b)
from foo
group by b

can be expanded equivalently to:

select
  avg(count_distinct_a)
from (
  select
    b,
    count(distinct a) as count_distinct_a
  from foo
  group by b
)
group by b

And from that, I can tweak it a bit to achieve what I want:

select
  avg(count_distinct_user_id) as average_users_per_day
from (
  select
    user_type,
    count(distinct user_id) as count_distinct_user_id
  from user_activity
  group by user_type, some_date
)
group by user_type

(notably, the inner group by user_type, some_date differs from the outer group by user_type)

I'd like to be able to tell the partition by-group by interaction to use a "sub-group-by" for the windowing part. Please let me know if my understanding of partition by/group by is completely off.

EDIT: Some sample data and desired output.

Source table:

+---------+-----------+-----------+
| user_id | user_type | some_date |
+---------+-----------+-----------+
| 1       | a         | 1         |
| 1       | a         | 2         |
| 2       | a         | 1         |
| 3       | a         | 2         |
| 3       | a         | 2         |
| 4       | b         | 2         |
| 5       | b         | 1         |
| 5       | b         | 3         |
| 5       | b         | 3         |
| 6       | c         | 1         |
| 7       | c         | 1         |
| 8       | c         | 4         |
| 9       | c         | 2         |
| 9       | c         | 3         |
| 9       | c         | 4         |
+---------+-----------+-----------+

Sample intermediate table (for reasoning with):

+-----------+-----------+---------------------+
| user_type | some_date | distinct_user_count |
+-----------+-----------+---------------------+
| a         | 1         | 2                   |
| a         | 2         | 2                   |
| b         | 1         | 1                   |
| b         | 2         | 1                   |
| b         | 3         | 1                   |
| c         | 1         | 2                   |
| c         | 2         | 1                   |
| c         | 3         | 1                   |
| c         | 4         | 2                   |
+-----------+-----------+---------------------+

SQL is: select user_type, some_date, count(distinct user_id) from user_activity group by user_type, some_date.

Desired result:

+-----------+---------------------+
| user_type | average_daily_users |
+-----------+---------------------+
| a         | 2                   |
| b         | 1                   |
| c         | 1.5                 |
+-----------+---------------------+

来源：https://stackoverflow.com/questions/51956608/hive-over-partition-by-with-a-column-not-in-group-by

标签

Hive

aggregate-functions

hiveql

window-functions