How to calculate median in AWS Redshift?

匿名 (未验证) 提交于 2019-12-03 08:41:19

问题:

Most databases have a built in function for calculating the median but I don't see anything for median in Amazon Redshift.

You could calculate the median using a combination of the nth_value() and count() analytic functions but that seems janky. I would be very surprised if an analytics db didn't have a built in method for computing median so I'm assuming I'm missing something.

http://docs.aws.amazon.com/redshift/latest/dg/r_Examples_of_NTH_WF.html http://docs.aws.amazon.com/redshift/latest/dg/c_Window_functions.html

回答1:

And as of 2014-10-17, Redshift supports the MEDIAN window function:

# select min(median) from (select median(num) over () from temp);  min  -----  4.0 


回答2:

Try the NTILE function.

You would divide your data into 2 ranked groups and pick the minimum value from the first group. That's because in datasets with an odd number of values, the first ntile will have 1 more value than the second. This approximation should work very well for large datasets.

create table temp (num smallint); insert into temp values (1),(5),(10),(2),(4);  select num, ntile(2) over(order by num desc) from temp ;  num | ntile  -----+-------   10 |     1    5 |     1    4 |     1    2 |     2    1 |     2  select min(num) as median from (select num, ntile(2) over(order by num desc) from temp) where ntile = 1;  median  --------       4 


回答3:

I had difficulty with this also, but got some help from Amazon. Since the 2014-06-30 version of Redshift, you can do this with the PERCENTILE_CONT or PERCENTILE_DISC window functions.

They're slightly weird to use, as they will tack the median (or whatever percentile you choose) onto every row. You put that in a subquery and then take the MIN (or whatever) of the median column.

# select count(num), min(median) as median from (select num, percentile_cont (0.5) within group (order by num) over () as median from temp); count | median -------+-------- 5 | 4.0

(The reason it's complicated is that window functions can also do their own mini-group-by and ordering to give you the median of many groups all at once, and other tricks.)

In the case of an even number of values, CONT(inuous) will interpolate between the two middle values, where DISC(rete) will pick one of them.



回答4:

I typically use the NTILE function to split the data into two groups if I’m looking for an answer that’s close enough. However, if I want the exact median (e.g. the midpoint of an even set of rows), I use a technique suggested on the AWS Redshift Discussion Forum.

This technique orders the rows in both ascending and descending order, then if there is an odd number of rows, it returns the average of the middle row (that is, where row_num_asc = row_num_desc), which is simply the middle row itself.

CREATE TABLE temp (num SMALLINT);  INSERT INTO temp VALUES (1),(5),(10),(2),(4);  SELECT   AVG(num) AS median FROM (SELECT   num,   SUM(1) OVER (ORDER BY num ASC) AS row_num_asc,   SUM(1) OVER (ORDER BY num DESC) AS row_num_desc FROM   temp) AS ordered WHERE   row_num_asc IN (row_num_desc, row_num_desc - 1, row_num_desc + 1);   median  --------       4 

If there is an even number of rows, it returns the average of the two middle rows.

INSERT INTO temp VALUES (9);  SELECT   AVG(num) AS median FROM (SELECT   num,   SUM(1) OVER (ORDER BY num ASC) AS row_num_asc,   SUM(1) OVER (ORDER BY num DESC) AS row_num_desc FROM   temp) AS ordered WHERE   row_num_asc IN (row_num_desc, row_num_desc - 1, row_num_desc + 1);   median  --------     4.5 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!