C++ Efficiently Calculating a Running Median [duplicate]

This question already has an answer here:

Find running median from a stream of integers 8 answers

Those of you that have read my previous questions know about my work at understanding and implementing quicksort and quickselect, as well as some other basic algorithms.

Quickselect is used to calculate the kth smallest element in an unsorted list, and this concept can also be used to find the median in an unsorted list.

This time, I need aid in devising an efficient technique to calculate the running median, because quickselect isn't a good choice as it needs to re-calculate every time the list changes. Because quickselect has to restart everytime, it can't take advantage of previous calculations done, so I'm looking for a different algorithm that's similar (possibly) but is more efficient in the area of running medians.

The streaming median is computed using two heaps. All the numbers less than or equal to the current median are in the left heap, which is arranged so that the maximum number is at the root of the heap. All the numbers greater than or equal to the current median are in the right heap, which is arranged so that the minimum number is at the root of the heap. Note that numbers equal to the current median can be in either heap. The count of numbers in the two heaps never differs by more than 1.

When the process begins the two heaps are initially empty. The first number in the input sequence is added to one of the heaps, it doesn’t matter which, and returned as the first streaming median. The second number in the input sequence is then added to the other heap, if the root of the right heap is less than the root of the left heap the two heaps are swapped, and the average of the two numbers is returned as the second streaming median.

Then the main algorithm begins. Each subsequent number in the input sequence is compared to the current median, and added to the left heap if it is less than the current median or to the right heap if it is greater than the current median; if the input number is equal to the current median, it is added to whichever heap has the smaller count, or to either heap arbitrarily if they have the same count. If that causes the counts of the two heaps to differ by more than 1, the root of the larger heap is removed and inserted in the smaller heap. Then the current median is computed as the root of the larger heap, if they differ in count, or the average of the roots of the two heaps, if they are the same size.

Code in Scheme and Python is available at my blog.

The Jeff McClintock running median estimate. Requires keeping only two values. This example iterates over an array of sampled values (CPU consumption). Seems to converge relatively quickly (about 100 samples) to an estimate of the median. The idea is at each iteration the median inches toward the input signal at a constant rate. The rate depends on what magnitude you estimate the median to be. I use the average as an estimate of the magnitude of the median, to determines the size of each increment of the median. If you need your median accurate to about 1%, use a step-size of 0.01 * the average.

float median = 0.0f;
float average = 0.0f;

// for each sample
{
    average += ( sample - average ) * 0.1f; // rough running average.
    median += _copysign( average * 0.01, sample - median );
}

One solution would be to maintain an order statistic tree, inserting each element of the sequence in turn, then compute the median of the elements in the tree.

This would take O(lg n) time per insertion and O(lg n) time per median, for a total of O(n lg n) time, plus O(n) space.

Here is a C++ balanced tree structure that provides the ability to query by index in the sorted list. Since it maintains all values in sorted order, this is not be quite as efficient as the two-heaps approach, but it offers some additional flexibility. For example, it could also give you a running quartile.

template <typename T>
class Node
{
public:
    T key;
    Node* left;
    Node* right;
    size_t size;

    Node(T k) : key(k)
    {
        isolate();
    }

    ~Node()
    {
        delete(left);
        delete(right);
    }

    void isolate()
    {
        left = NULL;
        right = NULL;
        size = 1;
    }

    void recount()
    {
        size = 1 + (left ? left->size : 0) + (right ? right->size : 0);
    }

    Node<T>* rotateLeft()
    {
        Node<T>* c = right;
        Node<T>* gc = right->left;
        right = gc;
        c->left = this;
        recount();
        c->recount();
        return c;
    }

    Node<T>* rotateRight()
    {
        Node<T>* c = left;
        Node<T>* gc = left->right;
        left = gc;
        c->right = this;
        recount();
        c->recount();
        return c;
    }

    Node<T>* balance()
    {
        size_t lcount = left ? left->size : 0;
        size_t rcount = right ? right->size : 0;
        if((lcount + 1) * 2 < (rcount + 1))
        {
            size_t lcount2 = right->left ? right->left->size : 0;
            size_t rcount2 = right->right ? right->right->size : 0;
            if(lcount2 > rcount2)
                right = right->rotateRight();
            return rotateLeft();
        }
        else if((rcount + 1) * 2 <= (lcount + 1))
        {
            size_t lcount2 = left->left ? left->left->size : 0;
            size_t rcount2 = left->right ? left->right->size : 0;
            if(lcount2 < rcount2)
                left = left->rotateLeft();
            return rotateRight();
        }
        else
        {
            recount();
            return this;
        }
    }

    Node<T>* insert(Node<T>* newNode)
    {
        if(newNode->key < key)
        {
            if(left)
                left = left->insert(newNode);
            else
                left = newNode;
        }
        else
        {
            if(right)
                right = right->insert(newNode);
            else
                right = newNode;
        }
        return balance();
    }

    Node<T>* get(size_t index)
    {
        size_t lcount = left ? left->size : 0;
        if(index < lcount)
            return left->get(index);
        else if(index > lcount)
            return right ? right->get(index - lcount - 1) : NULL;
        else
            return this;
    }

    Node<T>* find(T k, size_t start, size_t* outIndex)
    {
        if(k < key)
            return left ? left->find(k, start, outIndex) : NULL;
        else if(k > key)
            return right ? right->find(k, left ? start + left->size + 1 : start + 1, outIndex) : NULL;
        else
        {
            if(outIndex)
                *outIndex = start + (left ? left->size : 0);
            return this;
        }
    }

    Node<T>* remove_by_index(size_t index, Node<T>** outNode)
    {
        size_t lcount = left ? left->size : 0;
        if(index < lcount)
            left = left->remove_by_index(index, outNode);
        else if(index > lcount)
            right = right->remove_by_index(index - lcount - 1, outNode);
        else
        {
            *outNode = this;
            size_t rcount = right ? right->size : 0;
            if(lcount < rcount)
                return left ? right->insert(left) : right;
            else
                return right ? left->insert(right) : left;
        }
        return balance();
    }

    Node<T>* remove_by_value(T k, Node<T>** outNode)
    {
        if(k < key)
        {
            if(!left)
                throw "not found";
            left = left->remove_by_value(k, outNode);
        }
        else if(k > key)
        {
            if(!right)
                throw "not found";
            right = right->remove_by_value(k, outNode);
        }
        else
        {
            *outNode = this;
            size_t lcount = left ? left->size : 0;
            size_t rcount = right ? right->size : 0;
            if(lcount < rcount)
                return left ? right->insert(left) : right;
            else
                return right ? left->insert(right) : left;
        }
        return balance();
    }
};

template <typename T>
class MyReasonablyEfficientRunningSortedIndexedCollection
{
private:
    Node<T>* root;
    Node<T>* spare;

public:
    MyReasonablyEfficientRunningSortedIndexedCollection() : root(NULL), spare(NULL)
    {
    }

    ~MyReasonablyEfficientRunningSortedIndexedCollection()
    {
        delete(root);
        delete(spare);
    }

    void insert(T key)
    {
        if(spare)
            spare->key = key;
        else
            spare = new Node<T>(key);
        if(root)
            root = root->insert(spare);
        else
            root = spare;
        spare = NULL;
    }

    void drop_by_index(size_t index)
    {
        if(!root || index >= root->size)
            throw "out of range";
        delete(spare);
        root = root->remove_by_index(index, &spare);
        spare->isolate();
    }

    void drop_by_value(T key)
    {
        if(!root)
            throw "out of range";
        delete(spare);
        root = root->remove_by_value(key, &spare);
        spare->isolate();
    }

    T get(size_t index)
    {
        if(!root || index >= root->size)
            throw "out of range";
        return root->get(index)->key;
    }

    size_t find(T key)
    {
        size_t outIndex;
        Node<T>* node = root ? root->find(key, 0, &outIndex) : NULL;
        if(node)
            return outIndex;
        else
            throw "not found";
    }

    size_t size()
    {
        return root ? root->size : 0;
    }
};

Algorithm of rolling window median:

median is a sorted array where you take from it the middle value.

simple rolling implementation is with a queue(dqueue) and a sorted_array (any implementation, binary tree, skiparray).

d_queue is an array where you can push to tail and shift (pop) from the front of the array.

sorted_array is an array where you insert by order at position found using binary search.

I used a queue (first-in-first-out array) to track order of added values to know which items to remove from the median array, when they the queue is longer than the wanted size. to fall off elements by date time or some running index, it is possible to add another queue and check the first element is too old, and decide if to remove first value from both queues.

To calculate a median efficiently I use a sorted array technique. it is when you insert new items into its sorted place, so the array is always sorted.

The insert:
- Insert at ordered place in the sorted_array,
- and push a value into a queue.
The remove:
- If d_queue first element is off the window, or if in an another queue you can have with indexes, the index is too old, then:
  - remove the first item from d_queue(s),
  - and binary search for it in the sorted array and remove it.
To have the median:
- Use the value(s) in the middle of the sorted_array.
- If sorted_array length is even use the item in the middle.
- If sorted_array length is odd use an average of two items in the middle.

Abdel Hechavarria Diaz

#include<cstdio>
#include<iostream>
#include<queue>
#include <vector>         
#include <functional>

typedef priority_queue<unsigned int> type_H_low;
typedef priority_queue<unsigned int, std::vector<unsigned int>, std::greater<unsigned int> > type_H_high;

size_t signum(int left, int right) {
    if (left == right){
        return 0;
    }
    return (left < right)?-1:1;
}

void get_median( unsigned int x_i, unsigned int &m, type_H_low *l, type_H_high *r) {

    switch (signum( l->size(), r->size() )) {
    case 1: // There are more elements in left (max) heap
        if (x_i < m) {
            r->push(l->top());
            l->pop();
            l->push(x_i);
        } else {
            r->push(x_i);
        }
        break;

    case 0: // The left and right heaps contain same number of elements
        if (x_i < m){
            l->push(x_i);
        } else {
            r->push(x_i);
        }
        break;

    case -1: // There are more elements in right (min) heap
        if (x_i < m){
            l->push(x_i);
        } else {
            l->push(r->top());
            r->pop();
            r->push(x_i);
        }
        break;
    }

    if (l->size() == r->size()){
        m = l->top();
    } else if (l->size() > r->size()){
        m = l->top();
    } else {
        m = r->top();
    }

    return;
}

void print_median(vector<unsigned int> v) {
    unsigned int median = 0;
    long int sum = 0;
    type_H_low  H_low;
    type_H_high H_high;

    for (vector<unsigned int>::iterator x_i = v.begin(); x_i != v.end(); x_i++) {
        get_median(*x_i, median, &H_low, &H_high);
        std::cout << median << std::endl;
    }
}

来源：https://stackoverflow.com/questions/10930732/c-efficiently-calculating-a-running-median

标签

c++

algorithm

median