Binary Search and Hashtable Search

问题

I wanted to find out the trade off point between a Dictionary lookup and a binary search lookup of an array. I was expecting constant time lookups for the Dictionary, and logarithmic time lookups for the binary search depending on the size of the collection, with the binary search performing better for smaller sized collections.

However, I was surprised when I saw the following results:

I was surprised at: 1. Binary search is growing logarithmically at first, and then grows much faster. 2. Hash is pretty consistent at first, but then also starts growing slowly. 3. Binary search is never better than the hash lookup. Below is my code. What did I do wrong?

class Program
{
    static void Main(string[] args)
    {
        var r = new Random();
        var targets = Enumerable.Range(0, 1000 * 1000).Select(_ => r.Next(int.MaxValue)).ToList();

        for (int totalCount = 1; totalCount < 1000*1000*10; totalCount*=2)
        {
            var a = Enumerable.Range(0, totalCount).Select(_ => r.Next(int.MaxValue)).Distinct().Select(v => new thing(v)).OrderBy(t => t.value).ToArray();
            var d = a.ToDictionary(t => t.value);

            var watch = new System.Diagnostics.Stopwatch();

            {
                watch.Start();
                var found = targets.Select(t => BinarySearch(t, a)).Where(t => t != null).Count();
                watch.Stop();
                Console.WriteLine(string.Format("found {0} things out of {2} in {1} ms with binary search", found, watch.ElapsedMilliseconds, a.Length));
            }
            {
                watch.Restart();
                var found =  targets.Select(t => HashSearch(t, d)).Where(t => t != null).Count();
                watch.Stop();
                Console.WriteLine(string.Format("found {0} things out of {2} in {1} ms with hash search", found, watch.ElapsedMilliseconds, d.Keys.Count));
            }
        }
        Console.ReadLine();
    }

    static thing HashSearch(int needle, Dictionary<int, thing> hash)
    {
        if (!hash.ContainsKey(needle))
            return null;
        return hash[needle];
    }

    static thing BinarySearch(int needle, thing[] sortedHaystack)
    {
        return BinarySearch(needle, sortedHaystack, 0, sortedHaystack.Length - 1);
    }
    static thing BinarySearch(int needle, thing[] sortedHaystack, int minimum, int maximum)
    {
        if (minimum > maximum)
            return null;
        var middle = (minimum + maximum) / 2;
        if (needle == sortedHaystack[middle].value)
            return sortedHaystack[middle];
        if (needle < sortedHaystack[middle].value)
            return BinarySearch(needle, sortedHaystack, minimum, middle - 1);
        return BinarySearch(needle, sortedHaystack, middle + 1, maximum);
    }

    class thing
    {
        public int value;
        public thing(int v)
        {
            value = v;
        }
    }
}

回答1:

(Pretty much as noted in comments.)

I suspect you're mostly seeing the effects of cache misses. When the collection is large, you'll get a lot of cache misses - particularly with the binary search, which potentially needs to touch a lot of points in the collection to find an element.

At the low sizes, I suspect you're seeing cache misses too, but this time on your targets list - and also the overhead of LINQ itself. LINQ is fast, but it can still be significant when all you're doing is performing a single search of a tiny collection in the middle.

I'd suggest rewriting your loops to something like:

{
    // Use the same seed each time for consistency. Doesn't have to be 0.
    Random random = new Random(0);
    watch.Start();
    int found = 0;
    for (int i = 0; i < 1000 * 1000; i++)
    {
        if (BinarySearch(t, random.Next(int.MaxValue)) != null)
        {
            found++;
        }
    }
    watch.Stop();
    Console.WriteLine(string.Format
         "found {0} things out of {2} in {1} ms with binary search",
         found, watch.ElapsedMilliseconds, a.Length));
}

Of course you've then got the problem of including random number generation in the loop instead... you might want to look at using a random number generator which is faster than System.Random if you can find one. Or use some other way of determining which elements to look up.

Oh, and I'd personally rewrite the binary search to use iteration rather than recursion, but that's a different matter. I wouldn't expect it to have a significant effect.

来源：https://stackoverflow.com/questions/19374092/binary-search-and-hashtable-search

标签

.net

data-structures

hashtable

binary-search