Fast algorithm for repeated calculation of percentile?

前端 未结 6 752
余生分开走
余生分开走 2020-12-12 17:21

In an algorithm I have to calculate the 75th percentile of a data set whenever I add a value. Right now I am doing this:

  1. Get value x
  2. Inse
相关标签:
6条回答
  • 2020-12-12 17:51

    If you have a known set of values, following will be very fast:

    Create a large array of integers (even bytes will work) with number of elements equal to maximum value of your data. For example, if the maximum value of t is 100,000 create an array

    int[] index = new int[100000]; // 400kb
    

    Now iterate over the entire set of values, as

    for each (int t : set_of_values) {
      index[t]++;
    }
    
    // You can do a try catch on ArrayOutOfBounds just in case :)
    

    Now calculate percentile as

    int sum = 0, i = 0;
    while (sum < 0.9*set_of_values.length) {
      sum += index[i++];
    }
    
    return i;
    

    You can also consider using a TreeMap instead of array, if the values don't confirm to these restrictions.

    0 讨论(0)
  • 2020-12-12 18:00

    You can use binary search to do find the correct position in O(log n). However, shifting the array up is still O(n).

    0 讨论(0)
  • 2020-12-12 18:01

    A simple Order Statistics Tree is enough for this.

    A balanced version of this tree supports O(logn) time insert/delete and access by Rank. So you not only get the 75% percentile, but also the 66% or 50% or whatever you need without having to change your code.

    If you access the 75% percentile frequently, but only insert less frequently, you can always cache the 75% percentile element during an insert/delete operation.

    Most standard implementations (like Java's TreeMap) are order statistic trees.

    0 讨论(0)
  • 2020-12-12 18:10

    You can do it with two heaps. Not sure if there's a less 'contrived' solution, but this one provides O(logn) time complexity and heaps are also included in standard libraries of most programming languages.

    First heap (heap A) contains smallest 75% elements, another heap (heap B) - the rest (biggest 25%). First one has biggest element on the top, second one - smallest.

    1. Adding element.

    See if new element x is <= max(A). If it is, add it to heap A, otherwise - to heap B.
    Now, if we added x to heap A and it became too big (holds more than 75% of elements), we need to remove biggest element from A (O(logn)) and add it to heap B (also O(logn)).
    Similar if heap B became too big.

    1. Finding "0.75 median"

    Just take the largest element from A (or smallest from B). Requires O(logn) or O(1) time, depending on heap implementation.

    edit
    As Dolphin noted, we need to specify precisely how big each heap should be for every n (if we want precise answer). For example, if size(A) = floor(n * 0.75) and size(B) is the rest, then, for every n > 0, array[array.size * 3/4] = min(B).

    0 讨论(0)
  • 2020-12-12 18:10

    Here is a javaScript solution . Copy-paste it in browser console and it works . $scores contains the List of scores and , $percentilegives the n-th percentile of the list . So 75th percentile is 76.8 and 99 percentile is 87.9.

    function get_percentile($percentile, $array) {
        $array = $array.sort();
        $index = ($percentile/100) * $array.length;
        if (Math.floor($index) === $index) {
             $result = ($array[$index-1] + $array[$index])/2;
        }
        else {
            $result = $array[Math.floor($index)];
        }
        return $result;
    }
    
    $scores = [22.3, 32.4, 12.1, 54.6, 76.8, 87.3, 54.6, 45.5, 87.9];
    
    get_percentile(75, $scores);
    get_percentile(90, $scores);
    
    0 讨论(0)
  • 2020-12-12 18:11

    If you can do with an approximate answer, you can use a histogram instead of keeping entire values in memory.

    For each new value, add it to the appropriate bin. Calculate percentile 75th by traversing bins and summing counts until 75% of the population size is reached. Percentile value is between bin's (which you stopped at) low bound to high bound.

    This will provide O(B) complexity where B is the count of bins, which is range_size/bin_size. (use bin_size appropriate to your user case).

    I have implemented this logic in a JVM library: https://github.com/IBM/HBPE which you can use as a reference.

    0 讨论(0)
提交回复
热议问题