Interview Question: Find Median From Mega Number Of Integers

半腔热情 提交于 2019-12-02 14:10:24
Rex Kerr

Create an array of 8-byte longs that has 2^16 entries. Take your input numbers, shift off the bottom sixteen bits, and create a histogram.

Now you count up in that histogram until you reach the bin that covers the midpoint of the values.

Pass through again, ignoring all numbers that don't have that same set of top bits, and make a histogram of the bottom bits.

Count up through that histogram until you reach the bin that covers the midpoint of the (entire list of) values.

Now you know the median, in O(n) time and O(1) space (in practice, under 1 MB).

Here's some sample Scala code that does this:

def medianFinder(numbers: Iterable[Int]) = {
  def midArgMid(a: Array[Long], mid: Long) = {
    val cuml = a.scanLeft(0L)(_ + _).drop(1)
    cuml.zipWithIndex.dropWhile(_._1 < mid).head
  }
  val topHistogram = new Array[Long](65536)
  var count = 0L
  numbers.foreach(number => {
    count += 1
    topHistogram(number>>>16) += 1
  })
  val (topCount,topIndex) = midArgMid(topHistogram, (count+1)/2)
  val botHistogram = new Array[Long](65536)
  numbers.foreach(number => {
    if ((number>>>16) == topIndex) botHistogram(number & 0xFFFF) += 1
  })
  val (botCount,botIndex) =
    midArgMid(botHistogram, (count+1)/2 - (topCount-topHistogram(topIndex)))
  (topIndex<<16) + botIndex
}

and here it is working on a small set of input data:

scala> medianFinder(List(1,123,12345,1234567,123456789))
res18: Int = 12345

If you have 64 bit integers stored, you can use the same strategy in 4 passes instead.

starblue

You can use the Medians of Medians algorithm.

If the file is in text format, you may be able to fit it in memory just by converting things to integers as you read them in, since an integer stored as characters may take more space than an integer stored as an integer, depending on the size of the integers and the type of text file. EDIT: You edited your original question; I can see now that you can't read them into memory, see below.

If you can't read them into memory, this is what I came up with:

  1. Figure out how many integers you have. You may know this from the start. If not, then it only takes one pass through the file. Let's say this is S.

  2. Use your 2G of memory to find the x largest integers (however many you can fit). You can do one pass through the file, keeping the x largest in a sorted list of some sort, discarding the rest as you go. Now you know the x-th largest integer. You can discard all of these except for the x-th largest, which I'll call x1.

  3. Do another pass through, finding the next x largest integers less than x1, the least of which is x2.

  4. I think you can see where I'm going with this. After a few passes, you will have read in the (S/2)-th largest integer (you'll have to keep track of how many integers you've found), which is your median. If S is even then you'll average the two in the middle.

Make a pass through the file and find count of integers and minimum and maximum integer value.

Take midpoint of min and max, and get count, min and max for values either side of the midpoint - by again reading through the file.

partition count > count => median lies within that partition.

Repeat for the partition, taking into account size of 'partitions to the left' (easy to maintain), and also watching for min = max.

Am sure this'd work for an arbitrary number of partitions as well.

Chris Schmich
  1. Do an on-disk external mergesort on the file to sort the integers (counting them if that's not already known).
  2. Once the file is sorted, seek to the middle number (odd case), or average the two middle numbers (even case) in the file to get the median.

The amount of memory used is adjustable and unaffected by the number of integers in the original file. One caveat of the external sort is that the intermediate sorting data needs to be written to disk.

Given n = number of integers in the original file:

  • Running time: O(nlogn)
  • Memory: O(1), adjustable
  • Disk: O(n)

Check out Torben's method in here:http://ndevilla.free.fr/median/median/index.html. It also has implementation in C at the bottom of the document.

My best guess that probabilistic median of medians would be the fastest one. Recipe:

  1. Take next set of N integers (N should be big enough, say 1000 or 10000 elements)
  2. Then calculate median of these integers and assign it to variable X_new.
  3. If iteration is not first - calculate median of two medians:

    X_global = (X_global + X_new) / 2

  4. When you will see that X_global fluctuates not much - this means that you found approximate median of data.

But there some notes :

  • question arises - Is median error acceptable or not.
  • integers must be distributed randomly in a uniform way, for solution to work

EDIT: I've played a bit with this algorithm, changed a bit idea - in each iteration we should sum X_new with decreasing weight, such as:

X_global = k*X_global + (1.-k)*X_new :

k from [0.5 .. 1.], and increases in each iteration.

Point is to make calculation of median to converge fast to some number in very small amount of iterations. So that very approximate median (with big error) is found between 100000000 array elements in only 252 iterations !!! Check this C experiment:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#define ARRAY_SIZE 100000000
#define RANGE_SIZE 1000

// probabilistic median of medians method
// should print 5000 as data average
// from ARRAY_SIZE of elements
int main (int argc, const char * argv[]) {
    int iter = 0;
    int X_global = 0;
    int X_new = 0;
    int i = 0;
    float dk = 0.002;
    float k = 0.5;
    srand(time(NULL));

    while (i<ARRAY_SIZE && k!=1.) {
        X_new=0;
        for (int j=i; j<i+RANGE_SIZE; j++) {
            X_new+=rand()%10000 + 1;
        }
        X_new/=RANGE_SIZE;

        if (iter>0) {
            k += dk;
            k = (k>1.)? 1.:k;
            X_global = k*X_global+(1.-k)*X_new;

        }
        else {
            X_global = X_new;
        }

        i+=RANGE_SIZE+1;
        iter++;
        printf("iter %d, median = %d \n",iter,X_global);
    }

    return 0;

}

Opps seems i'm talking about mean, not median. If it is so, and you need exactly median, not mean - ignore my post. In any case mean and median are very related concepts.

Good luck.

Here is the algorithm described by @Rex Kerr implemented in Java.

/**
 * Computes the median.
 * @param arr Array of strings, each element represents a distinct binary number and has the same number of bits (padded with leading zeroes if necessary)
 * @return the median (number of rank ceil((m+1)/2) ) of the array as a string
 */
static String computeMedian(String[] arr) {

    // rank of the median element
    int m = (int) Math.ceil((arr.length+1)/2.0);

    String bitMask = "";
    int zeroBin = 0;

    while (bitMask.length() < arr[0].length()) {

        // puts elements which conform to the bitMask into one of two buckets
        for (String curr : arr) {
            if (curr.startsWith(bitMask))
                if (curr.charAt(bitMask.length()) == '0')
                    zeroBin++;
        }

        // decides in which bucket the median is located
        if (zeroBin >= m)
            bitMask = bitMask.concat("0");
        else {
            m -= zeroBin;
            bitMask = bitMask.concat("1");
        }

        zeroBin = 0;
    }

    return bitMask;
}

Some test cases and updates to the algorithm can be found here.

I was also asked the same question and i couldn't tell an exact answer so after the interview i went through some books on interviews and here is what i found from Cracking The Coding interview book.

Example: Numbers are randomly generated and stored into an (expanding) array. How wouldyoukeep track of the median?

Our data structure brainstorm might look like the following:

• Linked list? Probably not. Linked lists tend not to do very well with accessing and sorting numbers.

• Array? Maybe, but you already have an array. Could you somehow keep the elements sorted? That's probably expensive. Let's hold off on this and return to it if it's needed.

• Binary tree? This is possible, since binary trees do fairly well with ordering. In fact, if the binary search tree is perfectly balanced, the top might be the median. But, be careful—if there's an even number of elements, the median is actually the average of the middle two elements. The middle two elements can't both be at the top. This is probably a workable algorithm, but let's come back to it.

• Heap? A heap is really good at basic ordering and keeping track of max and mins. This is actually interesting—if you had two heaps, you could keep track of the bigger half and the smaller half of the elements. The bigger half is kept in a min heap, such that the smallest element in the bigger half is at the root.The smaller half is kept in a max heap, such that the biggest element of the smaller half is at the root. Now, with these data structures, you have the potential median elements at the roots. If the heaps are no longer the same size, you can quickly "rebalance" the heaps by popping an element off the one heap and pushing it onto the other.

Note that the more problems you do, the more developed your instinct on which data structure to apply will be. You will also develop a more finely tuned instinct as to which of these approaches is the most useful.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!