可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
One of my colleague was asked following question in an interview.
Given a huge array which stores unsigned int. Length of array is 100000000. Find the effective way to count the unique number of elements present in the array.
E.g arr = {2,34,5,6,7,2,2,5,1,34,5} O/p: Count of 2 is 3, Count of 34 is 2 and so on.
What is the effective algorithms to do this? I thought at first dictionary/hash would be one of the option, but since array is very large it is in-efficient. Is there any way to do this?
Thanks, chota
回答1:
Heap sort is O(nlogn) and in-place. In-place is necessary when dealing with large data sets. Once sorted you can make one pass through the array tallying occurrences of each value. Because the array is sorted, once a value changes you know you've seen all occurrences of the previous value.
回答2:
Many other posters have suggested sorting the data and then finding the number of adjacent values, but no one has mentioned using radix sort yet to get the runtime to be O(n lg U) (where U is the maximum value in the array) instead of O(n lg n). Since lg U = O(lg n), assuming that integers take up one machine word, this approach is asymptotically faster than heapsort.
Non-comparison sorts are always fun in interviews. :-)
回答3:
Sort it, then scan it from the beginning to determine the counts for each item.
This approach requires no additional storage, and can be done in O(n log n) time (for the sort).
回答4:
If the range of the int values is limited, then you may allocate an array, which serves to count the occurrences for each possible value. Then you just iterate through your huge array and increment the counters.
foreach x in huge_array { counter[x]++; }
Thus you find the solution in linear time (O(n)), but at the expense of memory consumption. That is, if your ints span the whole range allowed by 32-bit ints, you would need to allocate an array of 4G ints, which is impractical...
回答5:
How about using a BloomFilter impl: like http://code.google.com/p/java-bloomfilter/ first do a bloom.contains(element) if true continue if false bloom.add(element).
At the end count the number of elements added. Bloomfilter needs approx. 250mb memory to store 100000000 elements at 10bits per element.
Problem is that false positives are possible in BloomFilters and can only be minimized by increasing the number of bits per element. This could be addressed by two BloomFilters with different hashing that need to agree.
回答6:
Hashing in this case is not inneficient. The cost will be approximately O(N)
(O(N)
for iterating over the array and ~O(N)
for iterating over the hashtable). Since you need O(N)
for checking each element, the complexity is good.
回答7:
Sorting is a good idea. However type of sorting depends on range of possible values. For small range counting sort would be good. While dealing with such a big array it would be efficient to use multiple cores - radix sort might be good.