What's the best hashing algorithm to use on a stl string when using hash_map?

前端 未结 11 2294
生来不讨喜
生来不讨喜 2020-12-04 10:41

I\'ve found the standard hashing function on VS2005 is painfully slow when trying to achieve high performance look ups. What are some good examples of fast and efficient has

相关标签:
11条回答
  • 2020-12-04 11:18

    I worked with Paul Larson of Microsoft Research on some hashtable implementations. He investigated a number of string hashing functions on a variety of datasets and found that a simple multiply by 101 and add loop worked surprisingly well.

    unsigned int
    hash(
        const char* s,
        unsigned int seed = 0)
    {
        unsigned int hash = seed;
        while (*s)
        {
            hash = hash * 101  +  *s++;
        }
        return hash;
    }
    
    0 讨论(0)
  • 2020-12-04 11:22

    I did a little searching, and funny thing, Paul Larson's little algorithm showed up here http://www.strchr.com/hash_functions as having the least collisions of any tested in a number of conditions, and it's very fast for one that it's unrolled or table driven.

    Larson's being the simple multiply by 101 and add loop above.

    0 讨论(0)
  • 2020-12-04 11:25

    One classic suggestion for a string hash is to step through the letters one by one adding their ascii/unicode values to an accumulator, each time multiplying the accumulator by a prime number. (allowing overflow on the hash value)

      template <> struct myhash{};
    
      template <> struct myhash<string>
        {
        size_t operator()(string &to_hash) const
          {
          const char * in = to_hash.c_str();
          size_t out=0;
          while(NULL != *in)
            {
            out*= 53; //just a prime number
            out+= *in;
            ++in;
            }
          return out;
          }
        };
    
      hash_map<string, int, myhash<string> > my_hash_map;
    

    It's hard to get faster than that without throwing out data. If you know your strings can be differentiated by only a few characters and not their whole content, you can do faster.

    You might try caching the hash value better by creating a new subclass of basic_string that remembers its hash value, if the value gets calculated too often. hash_map should be doing that internally, though.

    0 讨论(0)
  • 2020-12-04 11:29

    That always depends on your data-set.

    I for one had surprisingly good results by using the CRC32 of the string. Works very good with a wide range of different input sets.

    Lots of good CRC32 implementations are easy to find on the net.

    Edit: Almost forgot: This page has a nice hash-function shootout with performance numbers and test-data:

    http://smallcode.weblogs.us/ <-- further down the page.

    0 讨论(0)
  • 2020-12-04 11:33

    Python 3.4 includes a new hash algorithm based on SipHash. PEP 456 is very informative.

    0 讨论(0)
  • 2020-12-04 11:35

    I've use the Jenkins hash to write a Bloom filter library, it has great performance.

    Details and code are available here: http://burtleburtle.net/bob/c/lookup3.c

    This is what Perl uses for its hashing operation, fwiw.

    0 讨论(0)
提交回复
热议问题