One of my friend has been asked with a question
Retrieving the max top 100 numbers from one hundred million of numbers
in a rece
By TOP 100
, do you mean 100 largest? If so:
SELECT TOP 100 Number FROM RidiculouslyLargeTable ORDER BY Number DESC
Make sure you tell the interviewer that you assume the table is indexed properly.
int numbers[100000000000] = {...};
int result[100] = {0};
for( int i = 0 ; i < 100000000000 ; i++ )
{
for( int j = 0 ; j < 100 ; j++ )
{
if( numbers[i] > result[j] )
{
if( j < 99 )
{
memcpy(result+j+1, result+j, (100-j)*sizeof(int));
}
result[j] = numbers[i];
break;
}
}
}
Mergesort in batches of 100, then only keep the top 100.
Incidentally, you can scale this in all sorts of directions, including concurrently.
@darius can actually be improved !!!
By "pruning" or deferring the heap-replace operation as required
Suppose we have a=1000 at the top of the heap
It has c,b siblings
We know that c,b>1000
a=1000
+-----|-----+
b>a c>a
We now read the next number x=1035
Since x>a we should discard a.
Instead we store (x=1035, a=1000) at the root
We do not (yet) bubble down the new value of 1035
Note that we still know that b,c<a but possibly b,c>x
Now, we get the next number y
when y<a<x then obviously we can discard it
when y>x>a then we replace x with y (the root now has (y, a=1000))
=> we saved log(m) steps here, since x will never have to bubble down
when a>y>x then we need to bubble down y recursively as required
Worst run time is still O(n log m)
But average run time i think might be O(n log log m) or something
In any case, it is obviously a faster implementation
I store first 100 numbers in Max -Heap of size 100.
At last level ,I keep track of minimum number and new number I insert and check with min number.Whether incoming number is candidate for top 100.
-- Again I call reheapify so I always have max heap of top 100.
So its complexity is O(nlogn).
Ok, here is a really stupid answer, but it is a valid one:
Reasoning:
It is an ok solution for some sort of one time operation. It would suck running it x times per second or something. But then, we need more context - as mclientk also had with his simple SQL statement - assuming 100 million numbersdo not exist in memory is a feasible question, because... they may come from a database and most of the time will, when talking about business relevant numbers.
As such, the question is really hard to answer - efficiency first has to be defined.