Intersection algorithm for two unsorted, small array

问题

I'm looking for an algorithm for intersection of two small, unsorted array in very specific condition.

Type of array item is just integer or integer-like type.
Significant amount of time (about 30~40%?), one or both array might be empty.
Arrays are usually very small - usually 1~3 items, I don't expect more than 10.
The intersection function will be called very frequently.
I don't care about platform dependent solution - I'm working on x86/windows/C++

Both brute-force/sort-and-intersect solutions are not that bad, but I don't think that they're fast enough. Is there more optimal solution?

回答1:

As the arrays are of primitive types, and short enough to be in cache lines, a fast implementation would focus on the tactical mechanics of comparisons rather than the big O complexity e.g. avoid hash-tables as these would generally involve hashing and indirection and would always involve a lot of management overhead.

If you have two sorted arrays, then intersection is O(n+m). You say that sort-then-intersect is 'brute-force' but you can't do it quicker.

If the arrays are stored sorted, of course, you gain further as you say you are calling the intersection often.

The intersection itself can be done with SSE.

回答2:

Here's a potential optimization: check whether both arrays have max element <=32 (or 64, or maybe even 16). If they do, then fill two bitmaps of that size (of type uint32_t, etc.) and intersect using a binary AND, &. If they're not, resort to sorting.

Or, instead of sorting, use the highly efficient integer set representation due to Briggs and Torczon that allows linear time intersect with O(m + n) construction and O(min(m, n)) intersect. That should be much faster than a hash table with better bounds than sorting.

回答3:

In order to determine the intersection of both sets you have to inspect all elements at least once, so that means the most optimal class of solutions yield O(n + m) where n is the number of elements in one set and m the number of elements in the other.

You can achieve that by using a hash table. Given that your items are of type integer, you can count on finding a fast hash function. A simple algorithm would be:

Iterate first set and add all elements to hash table
Iterate second set and for each element, check if it exists in the hash table, if so, add it to the intersection set or just print it.

This would be O(n + m) assuming your hashing and your hash lookup are O(1).

Given that you know the sets are frequently empty, you can optimize this by first checking to see if one of the sets is empty, if so, just return an empty set. That's of course assuming that you know the count upfront and can calculate it without iterating the set. If that happens to be the case, you can optimize further by always first reading and hashing the smaller set, ensuring that your hash table memory usage will be the smaller of the two.

回答4:

Well, since your arrays are quite small, using insertion sort will be the fastest way to sort these two arrays, C++ STL uses insertion sort for arrays smaller than 16 items as well. Then you can use iterators over these two arrays to compare and intersect the arrays.

There may be other algorithms which would perform faster, however the overhead of these algorithms will probably be too large for 3-4 items per array.

来源：https://stackoverflow.com/questions/14702155/intersection-algorithm-for-two-unsorted-small-array

标签

arrays

algorithm

set

intersection