Write a program to find 100 largest numbers out of an array of 1 billion numbers

前端 未结 30 2126
深忆病人
深忆病人 2020-11-29 14:04

I recently attended an interview where I was asked \"write a program to find 100 largest numbers out of an array of 1 billion numbers.\"

I was only able to give a br

相关标签:
30条回答
  • 2020-11-29 14:52

    The simplest solution is to scan the billion numbers large array and hold the 100 largest values found so far in a small array buffer without any sorting and remember the smallest value of this buffer. First I thought this method was proposed by fordprefect but in a comment he said that he assumed the 100 number data structure being implemented as a heap. Whenever a new number is found that is larger then the minimum in the buffer is overwritten by the new value found and the buffer is searched for the current minimum again. If the numbers in billion number array are randomly distributed most of the time the value from the large array is compared to the minimum of the small array and discarded. Only for a very very small fraction of number the value must be inserted into the small array. So the difference of manipulating the data structure holding the small numbers can be neglected. For a small number of elements it is hard to determine if the usage of a priority queue is actually faster than using my naive approach.

    I want to estimate the number of inserts in the small 100 element array buffer when the 10^9 element array is scanned. The program scans the first 1000 elements of this large array and has to insert at most 1000 elements in the buffer. The buffer contains 100 element of the 1000 elements scanned, that is 0.1 of the element scanned. So we assume that the probability that a value from the large array is larger than the current minimum of the buffer is about 0.1 Such an element has to be inserted in the buffer . Now the program scans the next 10^4 elements from the large array. Because the minimum of the buffer will increase every time a new element is inserted. We estimated that the ratio of elements larger than our current minimum is about 0.1 and so there are 0.1*10^4=1000 elements to insert. Actually the expected number of elements that are inserted into the buffer will be smaller. After the scan of this 10^4 elements fraction of the numbers in the buffer will be about 0.01 of the elements scanned so far. So when scanning the next 10^5 numbers we assume that not more than 0.01*10^5=1000 will be inserted in the buffer. Continuing this argumentation we have inserted about 7000 values after scanning 1000+10^4+10^5+...+10^9 ~ 10^9 elements of the large array. So when scanning an array with 10^9 elements of random size we expect not more than 10^4 (=7000 rounded up) insertions in the buffer. After each insertion into the buffer the new minimum must be found. If the buffer is a simple array we need 100 comparison to find the new minimum. If the buffer is another data structure (like a heap) we need at least 1 comparison to find the minimum. To compare the elements of the large array we need 10^9 comparisons. So all in all we need about 10^9+100*10^4=1.001 * 10^9 comparisons when using an array as buffer and at least 1.000 * 10^9 comparisons when using another type of data structure (like a heap). So using a heap brings only a gain of 0.1% if performance is determined by the number of comparison. But what is the difference in execution time between inserting an element in a 100 element heap and replacing an element in an 100 element array and finding its new minimum?

    • At the theoretical level: How many comparisons are needed for inserting in a heap. I know it is O(log(n)) but how large is the constant factor? I

    • At the machine level: What is the impact of caching and branch prediction on the execution time of a heap insert and a linear search in an array.

    • At the implementation level: What additional costs are hidden in a heap data structure supplied by a library or a compiler?

    I think these are some of the questions that have to be answered before one can try to estimate the real difference between the performance of a 100 element heap or a 100 element array. So it would make sense to make an experiment and measure the real performance.

    0 讨论(0)
  • 2020-11-29 14:53

    I have written up a simple solution in Python in case anyone is interested. It uses the bisect module and a temporary return list which it keeps sorted. This is similar to a priority queue implementation.

    import bisect
    
    def kLargest(A, k):
        '''returns list of k largest integers in A'''
        ret = []
        for i, a in enumerate(A):
            # For first k elements, simply construct sorted temp list
            # It is treated similarly to a priority queue
            if i < k:
                bisect.insort(ret, a) # properly inserts a into sorted list ret
            # Iterate over rest of array
            # Replace and update return array when more optimal element is found
            else:
                if a > ret[0]:
                    del ret[0] # pop min element off queue
                    bisect.insort(ret, a) # properly inserts a into sorted list ret
        return ret
    

    Usage with 100,000,000 elements and worst-case input which is a sorted list:

    >>> from so import kLargest
    >>> kLargest(range(100000000), 100)
    [99999900, 99999901, 99999902, 99999903, 99999904, 99999905, 99999906, 99999907,
     99999908, 99999909, 99999910, 99999911, 99999912, 99999913, 99999914, 99999915,
     99999916, 99999917, 99999918, 99999919, 99999920, 99999921, 99999922, 99999923,
     99999924, 99999925, 99999926, 99999927, 99999928, 99999929, 99999930, 99999931,
     99999932, 99999933, 99999934, 99999935, 99999936, 99999937, 99999938, 99999939,
     99999940, 99999941, 99999942, 99999943, 99999944, 99999945, 99999946, 99999947,
     99999948, 99999949, 99999950, 99999951, 99999952, 99999953, 99999954, 99999955,
     99999956, 99999957, 99999958, 99999959, 99999960, 99999961, 99999962, 99999963,
     99999964, 99999965, 99999966, 99999967, 99999968, 99999969, 99999970, 99999971,
     99999972, 99999973, 99999974, 99999975, 99999976, 99999977, 99999978, 99999979,
     99999980, 99999981, 99999982, 99999983, 99999984, 99999985, 99999986, 99999987,
     99999988, 99999989, 99999990, 99999991, 99999992, 99999993, 99999994, 99999995,
     99999996, 99999997, 99999998, 99999999]
    

    It took about 40 seconds to calculate this for 100,000,000 elements so I'm scared to do it for 1 billion. To be fair though, I was feeding it the worst-case input (ironically an array that is already sorted).

    0 讨论(0)
  • 2020-11-29 14:53

    I see a lot of O(N) discussions, so I propose something different just for the thought exercise.

    Is there any known information about the nature of these numbers? If it's random in nature, then go no further and look at the other answers. You won't get any better results than they do.

    However! See if whatever list-populating mechanism populated that list in a particular order. Are they in a well-defined pattern where you can know with certainty that the largest magnitude of numbers will be found in a certain region of the list or on a certain interval? There may be a pattern to it. If that is so, for example if they are guaranteed to be in some sort of normal distribution with the characteristic hump in the middle, always have repeating upward trends among defined subsets, have a prolonged spike at some time T in the middle of the data set like perhaps an incidence of insider trading or equipment failure, or maybe just have a "spike" every Nth number as in analysis of forces after a catastrophe, you can reduce the number of records you have to check significantly.

    There's some food for thought anyway. Maybe this will help you give future interviewers a thoughtful answer. I know I would be impressed if someone asked me such a question in response to a problem like this - it would tell me that they are thinking of optimization. Just recognize that there may not always be a possibility to optimize.

    0 讨论(0)
  • 2020-11-29 14:54

    I would find out who had the time to put a billion numbers into an array and fire him. Must work for government. At least if you had a linked list you could insert a number into the middle without moving half a billion to make room. Even better a Btree allows for a binary search. Each comparison eliminates half of your total. A hash algorithm would allow you to populate the data structure like a checkerboard but not so good for sparse data. As it is your best bet is to have a solution array of 100 integers and keep track of the lowest number in your solution array so you can replace it when you come across a higher number in the original array. You would have to look at every element in the original array assuming it is not sorted to begin with.

    0 讨论(0)
  • 2020-11-29 14:55

    You can use Quick select algorithm to find the number at the(by order) index [billion-101] and then iterate over the numbers and to find the numbers that biger from that number.

    array={...the billion numbers...} 
    result[100];
    
    pivot=QuickSelect(array,billion-101);//O(N)
    
    for(i=0;i<billion;i++)//O(N)
       if(array[i]>=pivot)
          result.add(array[i]);
    

    This algorithm Time is: 2 X O(N) = O(N) (Average case performance)

    The second option like Thomas Jungblut suggest is:

    Use Heap building the MAX heap will take O(N),then the top 100 max numbers will be in the top of the Heap, all you need is to get them out from the heap(100 X O(Log(N)).

    This algorithm Time is:O(N) + 100 X O(Log(N)) = O(N)

    0 讨论(0)
  • 2020-11-29 14:55
    1. Use nth-element to get the 100'th element O(n)
    2. Iterate the second time but only once and output every element that is greater than this specific element.

    Please note esp. the second step might be easy to compute in parallel! And it will also be efficiently when you need a million biggest elements.

    0 讨论(0)
提交回复
热议问题