How can I return the numerical boxplot data of all results using 1 mySQL query?

╄→гoц情女王★ 提交于 2019-12-05 06:12:57

Here is an example of calculation of the quartiles for e256 value ranges within e32 groups, an index on (e32, e256) in this case is a must:

SELECT
  @group:=IF(e32=@group, e32, GREATEST(@index:=-1, e32)) as e32_,
  MIN(e256) as so,
  MAX(IF(lq_i=(@index:=@index+1), e256, NULL)) as lq,
  MAX(IF(me_i=@index, e256, NULL)) as me,
  MAX(IF(uq_i=@index, e256, NULL)) as uq,
  MAX(e256) as lo
FROM (SELECT @index:=NULL, @group:=NULL) as init, test t
JOIN (
  SELECT e32,
    COUNT(*) as cnt,
    (COUNT(*) div 4) as lq_i,    -- lq value index within the group
    (COUNT(*) div 2) as me_i,    -- me value index within the group
    (COUNT(*) * 3 div 4) as uq_i -- uq value index within the group
  FROM test
  GROUP BY e32
) as cnts
USING (e32)
GROUP BY e32;

If there is no need in groupings, the query will be slightly simplier.

P.S. test is my playground table of random values where e32 is the result of Python's int(random.expovariate(1.0) * 32), etc.

I've found a solution in PostgreSQL using using PL/Python.

However, I leave the question open in case someone else comes up with a solution in mySQL.

CREATE TYPE boxplot_values AS (
  min       numeric,
  q1        numeric,
  median    numeric,
  q3        numeric,
  max       numeric
);

CREATE OR REPLACE FUNCTION _final_boxplot(strarr numeric[])
   RETURNS boxplot_values AS
$$
    x = strarr.replace("{","[").replace("}","]")
    a = eval(str(x))

    a.sort()
    i = len(a)
    return ( a[0], a[i/4], a[i/2], a[i*3/4], a[-1] )
$$
LANGUAGE 'plpythonu' IMMUTABLE;

CREATE AGGREGATE boxplot(numeric) (
  SFUNC=array_append,
  STYPE=numeric[],
  FINALFUNC=_final_boxplot,
  INITCOND='{}'
);

Example:

SELECT customer_id as cid, (boxplot(price)).*
FROM orders
GROUP BY customer_id;

   cid |   min   |   q1    | median  |   q3    |   max
-------+---------+---------+---------+---------+---------
  1001 | 7.40209 | 7.80031 |  7.9551 | 7.99059 | 7.99903
  1002 | 3.44229 | 4.38172 | 4.72498 | 5.25214 | 5.98736

Source: http://www.christian-rossow.de/articles/PostgreSQL_boxplot_median_quartiles_aggregate_function.php

Well I can do it in two queries. Do the first query to get the positions of the quartiles and then use the limit function to get the answers in the second query.

mysql> select (select floor(count(*)/4)) as first_q, (select floor(count(*)/2) from customer_data) as mid_pos, (select floor(count(*)/4*3) from customer_data) as third_q from customer_data order by measure limit 1;

mysql> select min(measure),(select measure from customer_data order by measure limit 0,1) as firstq, (select measure from customer_data order by measure limit 5,1) as median, (select measure from customer_data order by measure limit 8,1) as last_q, max(measure) from customer_data;

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!