Counting unique element in large array

匿名 (未验证) 提交于 2019-12-03 02:47:02

问题:

One of my colleague was asked following question in an interview.

Given a huge array which stores unsigned int. Length of array is 100000000. Find the effective way to count the unique number of elements present in the array.
E.g arr = {2,34,5,6,7,2,2,5,1,34,5} O/p: Count of 2 is 3, Count of 34 is 2 and so on.

What is the effective algorithms to do this? I thought at first dictionary/hash would be one of the option, but since array is very large it is in-efficient. Is there any way to do this?

Thanks, chota

回答1:

Heap sort is O(nlogn) and in-place. In-place is necessary when dealing with large data sets. Once sorted you can make one pass through the array tallying occurrences of each value. Because the array is sorted, once a value changes you know you've seen all occurrences of the previous value.



回答2:

Many other posters have suggested sorting the data and then finding the number of adjacent values, but no one has mentioned using radix sort yet to get the runtime to be O(n lg U) (where U is the maximum value in the array) instead of O(n lg n). Since lg U = O(lg n), assuming that integers take up one machine word, this approach is asymptotically faster than heapsort.

Non-comparison sorts are always fun in interviews. :-)



回答3:

Sort it, then scan it from the beginning to determine the counts for each item.

This approach requires no additional storage, and can be done in O(n log n) time (for the sort).



回答4:

If the range of the int values is limited, then you may allocate an array, which serves to count the occurrences for each possible value. Then you just iterate through your huge array and increment the counters.

foreach x in huge_array {    counter[x]++; } 

Thus you find the solution in linear time (O(n)), but at the expense of memory consumption. That is, if your ints span the whole range allowed by 32-bit ints, you would need to allocate an array of 4G ints, which is impractical...



回答5:

How about using a BloomFilter impl: like http://code.google.com/p/java-bloomfilter/ first do a bloom.contains(element) if true continue if false bloom.add(element).

At the end count the number of elements added. Bloomfilter needs approx. 250mb memory to store 100000000 elements at 10bits per element.

Problem is that false positives are possible in BloomFilters and can only be minimized by increasing the number of bits per element. This could be addressed by two BloomFilters with different hashing that need to agree.



回答6:

Hashing in this case is not inneficient. The cost will be approximately O(N) (O(N) for iterating over the array and ~O(N) for iterating over the hashtable). Since you need O(N) for checking each element, the complexity is good.



回答7:

Sorting is a good idea. However type of sorting depends on range of possible values. For small range counting sort would be good. While dealing with such a big array it would be efficient to use multiple cores - radix sort might be good.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!