I\'ve found the standard hashing function on VS2005 is painfully slow when trying to achieve high performance look ups. What are some good examples of fast and efficient has
I worked with Paul Larson of Microsoft Research on some hashtable implementations. He investigated a number of string hashing functions on a variety of datasets and found that a simple multiply by 101 and add loop worked surprisingly well.
unsigned int
hash(
const char* s,
unsigned int seed = 0)
{
unsigned int hash = seed;
while (*s)
{
hash = hash * 101 + *s++;
}
return hash;
}
I did a little searching, and funny thing, Paul Larson's little algorithm showed up here http://www.strchr.com/hash_functions as having the least collisions of any tested in a number of conditions, and it's very fast for one that it's unrolled or table driven.
Larson's being the simple multiply by 101 and add loop above.
One classic suggestion for a string hash is to step through the letters one by one adding their ascii/unicode values to an accumulator, each time multiplying the accumulator by a prime number. (allowing overflow on the hash value)
template <> struct myhash{};
template <> struct myhash<string>
{
size_t operator()(string &to_hash) const
{
const char * in = to_hash.c_str();
size_t out=0;
while(NULL != *in)
{
out*= 53; //just a prime number
out+= *in;
++in;
}
return out;
}
};
hash_map<string, int, myhash<string> > my_hash_map;
It's hard to get faster than that without throwing out data. If you know your strings can be differentiated by only a few characters and not their whole content, you can do faster.
You might try caching the hash value better by creating a new subclass of basic_string that remembers its hash value, if the value gets calculated too often. hash_map should be doing that internally, though.
That always depends on your data-set.
I for one had surprisingly good results by using the CRC32 of the string. Works very good with a wide range of different input sets.
Lots of good CRC32 implementations are easy to find on the net.
Edit: Almost forgot: This page has a nice hash-function shootout with performance numbers and test-data:
http://smallcode.weblogs.us/ <-- further down the page.
Python 3.4 includes a new hash algorithm based on SipHash. PEP 456 is very informative.
I've use the Jenkins hash to write a Bloom filter library, it has great performance.
Details and code are available here: http://burtleburtle.net/bob/c/lookup3.c
This is what Perl uses for its hashing operation, fwiw.