How can I return the numerical boxplot data of all results using 1 mySQL query?

纵饮孤独 提交于 2019-12-07 01:04:25

问题


[tbl_votes]
- id <!-- unique id of the vote) -->
- item_id <!-- vote belongs to item <id> -->
- vote <!-- number 1-10 -->

Of course we can fix this by getting:

  • the smallest observation (so)
  • the lower quartile (lq)
  • the median (me)
  • the upper quartile (uq)
  • and the largest observation (lo)

..one-by-one using multiple queries but I am wondering if it can be done with a single query.

In Oracle I can use COUNT OVER and RATIO_TO_REPORT, but this is not supported in mySQL.

For those who don't know what a boxplot is: http://en.wikipedia.org/wiki/Box_plot

Any help would be appreciated.


回答1:


Here is an example of calculation of the quartiles for e256 value ranges within e32 groups, an index on (e32, e256) in this case is a must:

SELECT
  @group:=IF(e32=@group, e32, GREATEST(@index:=-1, e32)) as e32_,
  MIN(e256) as so,
  MAX(IF(lq_i=(@index:=@index+1), e256, NULL)) as lq,
  MAX(IF(me_i=@index, e256, NULL)) as me,
  MAX(IF(uq_i=@index, e256, NULL)) as uq,
  MAX(e256) as lo
FROM (SELECT @index:=NULL, @group:=NULL) as init, test t
JOIN (
  SELECT e32,
    COUNT(*) as cnt,
    (COUNT(*) div 4) as lq_i,    -- lq value index within the group
    (COUNT(*) div 2) as me_i,    -- me value index within the group
    (COUNT(*) * 3 div 4) as uq_i -- uq value index within the group
  FROM test
  GROUP BY e32
) as cnts
USING (e32)
GROUP BY e32;

If there is no need in groupings, the query will be slightly simplier.

P.S. test is my playground table of random values where e32 is the result of Python's int(random.expovariate(1.0) * 32), etc.




回答2:


I've found a solution in PostgreSQL using using PL/Python.

However, I leave the question open in case someone else comes up with a solution in mySQL.

CREATE TYPE boxplot_values AS (
  min       numeric,
  q1        numeric,
  median    numeric,
  q3        numeric,
  max       numeric
);

CREATE OR REPLACE FUNCTION _final_boxplot(strarr numeric[])
   RETURNS boxplot_values AS
$$
    x = strarr.replace("{","[").replace("}","]")
    a = eval(str(x))

    a.sort()
    i = len(a)
    return ( a[0], a[i/4], a[i/2], a[i*3/4], a[-1] )
$$
LANGUAGE 'plpythonu' IMMUTABLE;

CREATE AGGREGATE boxplot(numeric) (
  SFUNC=array_append,
  STYPE=numeric[],
  FINALFUNC=_final_boxplot,
  INITCOND='{}'
);

Example:

SELECT customer_id as cid, (boxplot(price)).*
FROM orders
GROUP BY customer_id;

   cid |   min   |   q1    | median  |   q3    |   max
-------+---------+---------+---------+---------+---------
  1001 | 7.40209 | 7.80031 |  7.9551 | 7.99059 | 7.99903
  1002 | 3.44229 | 4.38172 | 4.72498 | 5.25214 | 5.98736

Source: http://www.christian-rossow.de/articles/PostgreSQL_boxplot_median_quartiles_aggregate_function.php




回答3:


Well I can do it in two queries. Do the first query to get the positions of the quartiles and then use the limit function to get the answers in the second query.

mysql> select (select floor(count(*)/4)) as first_q, (select floor(count(*)/2) from customer_data) as mid_pos, (select floor(count(*)/4*3) from customer_data) as third_q from customer_data order by measure limit 1;

mysql> select min(measure),(select measure from customer_data order by measure limit 0,1) as firstq, (select measure from customer_data order by measure limit 5,1) as median, (select measure from customer_data order by measure limit 8,1) as last_q, max(measure) from customer_data;



来源:https://stackoverflow.com/questions/8639073/how-can-i-return-the-numerical-boxplot-data-of-all-results-using-1-mysql-query

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!