Retrieving the top 100 numbers from one hundred million of numbers

前端 未结 12 2144
北荒
北荒 2020-11-30 19:36

One of my friend has been asked with a question

Retrieving the max top 100 numbers from one hundred million of numbers

in a rece

相关标签:
12条回答
  • 2020-11-30 19:51

    Run them all through a min-heap of size 100: for each input number k, replace the current min m with max(k, m). Afterwards the heap holds the 100 largest inputs.

    A search engine like Lucene can use this method, with refinements, to choose the most-relevant search answers.

    Edit: I fail the interview -- I got the details wrong twice (after having done this before, in production). Here's code to check it; it's almost the same as Python's standard heapq.nlargest():

    import heapq
    
    def funnel(n, numbers):
        if n == 0: return []
        heap = numbers[:n]
        heapq.heapify(heap)
        for k in numbers[n:]:
            if heap[0] < k:
                heapq.heapreplace(heap, k)
        return heap
    
    >>> funnel(4, [3,1,4,1,5,9,2,6,5,3,5,8])
    [5, 8, 6, 9]
    
    0 讨论(0)
  • 2020-11-30 19:51

    First iteration:

    Quicksort, take top 100. O(n log n). Simple, easy to code. Very obvious.

    Better? We are working with numbers, do a radix sort (linear time) take the top 100. I would expect this is what the interviewer is looking for.

    Any other considerations? Well, a million numbers isn't a lot of memory, but if you want to minimize memory, you keep a max 100 numbers encountered so far and then just scan the numbers. What would be the a best way?

    Some have mentioned a heap, but a bit better solution might be a doubly-linked list, where you keep the pointer to the minimum of the top 100 found so far. If you encounter a number a that is bigger than the current smallest in the listed, compared with the next element, and move the number from next to the current until you find a place for the new number. (This is basically just a specialized heap for the situation). With some tuning (if the number is greater the current minimum, compare with current maximum to see which direction to walk list to find insertion point) this would be relatively effective, and would only take like 1.5k of memory.

    0 讨论(0)
  • 2020-11-30 19:57

    If the data is already in an array that you can modify, you could use a variant of Hoare's Select algorithm, which is (in turn) a variant of Quicksort.

    The basic idea is pretty simple. In Quicksort, you partition the array into two pieces, one of items larger than the pivot, and the other of items smaller than the pivot. Then you recursively sort each partition.

    In the Select algorithm, you do the partitioning step exactly as before -- but instead of recursively sorting both partitions, you look at which partition contains the elements you want, and recursively select ONLY in that partition. E.g., assuming your 100 million items partition nearly in half, the first several iterations you're going to look only at the upper partition.

    Eventually, you're likely to reach a point where the portion you want "bridges" two partitions -- e.g., you have a partition of ~150 numbers, and when you partition that you end up with two pieces of ~75 apiece. At that point, only one minor detail changes: instead of rejecting one partition and continuing work only the other, you accept the upper partition of 75 items, and then continue looking for the top 25 in the lower partition.

    If you were doing this in C++, you could do this with std::nth_element (which will normally be implemented approximately as described above). On average, this has linear complexity, which I believe is about as good as you can hope for (absent some preexisting order, I don't see any way to find the top N elements without looking at all the elements).

    If the data's not already in an array, and you're (for example) reading the data from a file, you usually want to use a heap. You basically read an item, insert it into the heap, and if the heap is larger than you target (100 items, in this case) you remove one and re-heapify.

    What's probably not so obvious (but is actually true) is that you don't normally want to use a max-heap for this task. At first glance, it seems pretty obvious: if you want to get the maximum items you should use a max heap.

    It's simpler, however, to think in terms of the items you're "removing" from the heap. A max heap lets you find the one largest item in the heap quickly. It is not, however, optimized for finding the smallest item in the heap.

    In this case, we're interested primarily in the smallest item in the heap. In particular, when we read each item in from the file, we want to compare it to the smallest item in the heap. If (and only if) it's larger than the smallest item in the heap, we want to replace that smallest item currently in the heap with the new item. Since that's (by definition) larger than the existing item, we'll then need to sift that into the correct position in the heap.

    But note: if the items in the file are randomly ordered, as we read through the file, we fairly quickly reach a point at which most items we read into the file will be smaller than the smallest item in our heap. Since we have easy access to the smallest item in the heap, it's fairly quick and easy to do that comparison, and for smaller items never insert in the heap at all.

    0 讨论(0)
  • 2020-11-30 20:00

    Suppose mylist is a list of hundred million data. so we can sort the list and take the last hundred data from mylist.

    mylist.sort()

    mylist[-100:]

    Second way:

    import heapq

    heapq.nlargest(100, mylist)

    0 讨论(0)
  • 2020-11-30 20:04

    Heapify the array in O(n). Then take out top 100 elements.

    0 讨论(0)
  • 2020-11-30 20:08

    There's no reason to sort the whole list. This should be doable in O(n) time. In pseudocode:

    List top = new List
    
    for each num in entireList
        for i = 0 to top.Length
            if num > top[i] then
                top.InsertBefore(num, i)
                if top.Length > 100 then
                    top.Remove(top.Length - 1)
                end if
                exit for
            else
                if i = top.Length - 1 and i < 100 then
                    top.Add(num)
                end if
            end if
        next
    next
    
    0 讨论(0)
提交回复
热议问题