Simple way to calculate median with MySQL

后端 未结 30 1603
北荒
北荒 2020-11-22 04:20

What\'s the simplest (and hopefully not too slow) way to calculate the median with MySQL? I\'ve used AVG(x) for finding the mean, but I\'m having a hard time fi

30条回答
  •  深忆病人
    2020-11-22 04:41

    I have a database containing about 1 billion rows that we require to determine the median age in the set. Sorting a billion rows is hard, but if you aggregate the distinct values that can be found (ages range from 0 to 100), you can sort THIS list, and use some arithmetic magic to find any percentile you want as follows:

    with rawData(count_value) as
    (
        select p.YEAR_OF_BIRTH
            from dbo.PERSON p
    ),
    overallStats (avg_value, stdev_value, min_value, max_value, total) as
    (
      select avg(1.0 * count_value) as avg_value,
        stdev(count_value) as stdev_value,
        min(count_value) as min_value,
        max(count_value) as max_value,
        count(*) as total
      from rawData
    ),
    aggData (count_value, total, accumulated) as
    (
      select count_value, 
        count(*) as total, 
            SUM(count(*)) OVER (ORDER BY count_value ROWS UNBOUNDED PRECEDING) as accumulated
      FROM rawData
      group by count_value
    )
    select o.total as count_value,
      o.min_value,
        o.max_value,
        o.avg_value,
        o.stdev_value,
        MIN(case when d.accumulated >= .50 * o.total then count_value else o.max_value end) as median_value,
        MIN(case when d.accumulated >= .10 * o.total then count_value else o.max_value end) as p10_value,
        MIN(case when d.accumulated >= .25 * o.total then count_value else o.max_value end) as p25_value,
        MIN(case when d.accumulated >= .75 * o.total then count_value else o.max_value end) as p75_value,
        MIN(case when d.accumulated >= .90 * o.total then count_value else o.max_value end) as p90_value
    from aggData d
    cross apply overallStats o
    GROUP BY o.total, o.min_value, o.max_value, o.avg_value, o.stdev_value
    ;
    

    This query depends on your db supporting window functions (including ROWS UNBOUNDED PRECEDING) but if you do not have that it is a simple matter to join aggData CTE with itself and aggregate all prior totals into the 'accumulated' column which is used to determine which value contains the specified precentile. The above sample calcuates p10, p25, p50 (median), p75, and p90.

    -Chris

提交回复
热议问题