问题
I'm trying to get a cumulative count of distinct objects in Redshift over a time series. The straightforward thing would be to use COUNT(DISTINCT myfield) OVER (ORDER BY timefield DESC ROWS UNBOUNDED PRECEDING), but Redshift gives a "Window definition is not supported" error.
For example, the code below is trying to find the cumulative distinct users for every week from the first week to the present. However, I get the "Window function not supported" error.
SELECT user_time.weeks_ago,
COUNT(distinct user_time.user_id) OVER
(ORDER BY weeks_ago desc ROWS UNBOUNDED PRECEDING) as count
FROM (SELECT FLOOR(EXTRACT(DAY FROM sysdate - ev.time) / 7) AS weeks_ago,
ev.user_id as user_id
FROM events as ev
WHERE ev.action='some_user_action') as user_time
The goal is to build a cumulative time series of unique users who have performed an action. Any ideas on how to do this?
回答1:
Figured out the answer. The trick turned out to be a set of nested subqueries, the inner one calculates the time of each user's first action. The middle subquery counts the total actions per time period, and the final outer query performs the cumulative sums over the time series:
(SELECT engaged_per_week.week as week,
SUM(engaged_per_week.total) over (order by engaged_per_week.week DESC ROWS UNBOUNDED PRECEDING) as total
FROM
-- COUNT OF FIRST TIME ENGAGEMENTS PER WEEK
(SELECT engaged.first_week AS week,
count(engaged.first_week) AS total
FROM
-- WEEK OF FIRST ENGAGEMENT FOR EACH USER
(SELECT MAX(FLOOR(EXTRACT(DAY FROM sysdate - ev.time) / 7)) as first_week
FROM events ev
WHERE ev.name='some_user_action'
GROUP BY ev.user_id) AS engaged
GROUP BY week) as engaged_per_week
ORDER BY week DESC) as cumulative_engaged
回答2:
Here's how to apply it to an example cited here, plus I've added another row duplicating 'table' for '2015-01-01' to demonstrate how this counts distincts.
The author of the example is wrong about the solution, but I'm just using his example.
create table public.test
(
"date" date,
item varchar(8),
measure int
)
insert into public.test
values
('2015-01-01', 'table', 12),
('2015-01-01', 'table', 120),
('2015-01-01', 'chair', 51),
('2015-01-01', 'lamp', 8),
('2015-01-02', 'table', 17),
('2015-01-02', 'chair', 72),
('2015-01-02', 'lamp', 23),
('2015-01-02', 'bed', 1),
('2015-01-02', 'dresser', 2),
('2015-01-03', 'bed', 1);
WITH x AS (
SELECT
*,
DENSE_RANK()
OVER (PARTITION BY date
ORDER BY item) AS dense_rank
FROM public.test
)
SELECT
"date",
item,
measure,
max(dense_rank)
OVER (PARTITION BY "date")
FROM x
ORDER BY 1;
The CTE gets you the dense rank of each item per date, then the main query gets you the max of that dense rank per date, i.e., the distinct count of items per date.
You need the dense rank rather than straight rank to count distincts.
回答3:
You should use DENSE_RANK instead of count (distinct):
DENSE_RANK() OVER(PARTITION BY weeks_ago ORDER BY user_time.user_id)
回答4:
It seems to be working when you use count distinct inside a sum like this:
SELECT user_time.weeks_ago,
SUM(COUNT(distinct user_time.user_id)) OVER
(ORDER BY weeks_ago desc ROWS UNBOUNDED PRECEDING) as test
FROM (SELECT FLOOR(EXTRACT(DAY FROM sysdate - ev.time) / 7) AS weeks_ago
,ev.user_id as user_id
FROM events as ev
WHERE ev.action='some_user_action'
) user_time
GROUP BY user_time.weeks_ago
回答5:
I'm faced the same issue but I've applied this with DENSE_RANK()
and MAX() over(partition by)
as below Code, hope it'll be helpful if anyone still struggling with this issue:
-- IN NZ
select
id,NAME,count(distinct name) OVER (
PARTITION BY id)
from
edw.admin.test;
/*
create table edw.admin.test
as
(
select 1 as id,'Anne' as name,500.0 as amt,'iv' as IID
union ALL
select 1,'Jeni',550.0,'is'
union ALL
select 1,'Arna',250.0,'is'
union ALL
select 2,'Raj',290.0,'is'
union ALL
select 1,'Anne',350.0,'ir'
union ALL
select 1,NULL,350.0,'ir'
union ALL
select 3,NULL,350.0,'ir'
union ALL
select 3,NULL,350.0,'ir');
Output in NZ:
-------------------------
ID NAME COUNT
1 NULL 3
1 Anne 3
1 Anne 3
1 Arna 3
1 Jeni 3
2 Raj 1
3 NULL 0
3 NULL 0
*/
-- IN AWS RS
select id, name, max(DENSE_COUNT) over(partition by id)
from(
select
id,name,CASE WHEN name IS NULL THEN 0 ELSE DENSE_RANK() OVER (
PARTITION BY id
order by name) END AS DENSE_COUNT
from
(
select 1 as id,'Anne' as name,500.0 as amt,'iv' as IID
union ALL
select 1,'Jeni',550.0,'is'
union ALL
select 1,'Arna',250.0,'is'
union ALL
select 2,'Raj',290.0,'is'
union ALL
select 1,'Anne',350.0,'ir'
union ALL
select 1,NULL,350.0,'ir'
union ALL
select 3,NULL,350.0,'ir'
union ALL
select 3,NULL,350.0,'ir'));
/*
Output in RS:
-------------------------
id name max
1 Anne 3
1 Anne 3
1 Arna 3
1 Jeni 3
1 NULL 3
2 Raj 1
3 NULL 0
3 NULL 0
*/
来源:https://stackoverflow.com/questions/20210902/trying-to-count-cumulative-distinct-entities-using-redshift-sql