Can i use GetHashCode() for all string compares?

问题

i want to cache some search results based on the object to search and some search settings.

However: this creates quite a long cache key, and i thought i'd create a shortcut for it, and i thought i'd use GetHashCode() for it.

So i was wondering, does GetHashCode() always generate a different number, even when i have very long strings or differ only by this: 'ä' in stead of 'a'

I tried some strings and it seemed the answer is yes, but not understanding the GetHashCode() behaviour doesn't give me the true feeling i am right.

And because it is one of those things which will pop up when you're not prepared (the client is looking at cached results for the wrong search) i want to be sure...

EDIT: if MD5 would work, i can change my code not to use the GetHashCode ofcourse, the goals is to get a short(er) string than the original (> 1000 chars)

回答1:

You CANNOT count on `GetHashCode()` being unique.

There is an excellent article which investigates the likelihood of collisions available at http://kenneththorman.blogspot.com/2010/09/c-net-equals-and-gethashcode.html . The findings were that "The smallest number of calls to GetHashCode() to return the same hashcode for a different string was after 565 iterations and the highest number of iterations before getting a hashcode collision was 296390 iterations. "

So that you can understand the contract for GetHashCode implementations, the following is an excerpt from MSDN documentation for Object.GetHashCode():

A hash function must have the following properties:

If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.
The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object's Equals method. Note that this is true only for the current execution of an application, and that a different hash code can be returned if the application is run again.
For the best performance, a hash function must generate a random distribution for all input.

Eric Lippert of the C# compiler team explains the rationale for the GetHashCode implementation rules on his blog at http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/ .

回答2:

Logically GetHashCode cannot be unique since there are only 2^32 ints and an infinite number of strings (see the pigeon hole principle).

As @Henk pointed out in the comment even though there are an infinite number of strings there are a finite number of System.Strings. However the pigeon hole principle still stands as the later is much larger than int.MaxValue.

回答3:

If one were store the hash code of each string along with the string itself, one could compare the hashcodes of strings as a "first step" to comparing them for equality. If two strings have different hashcodes, they're not equal, and one needn't bother doing anything else. If one expects to be comparing many pairs of strings which are of the same length, and which are "almost" but not quite equal, checking the hashcodes before checking the content may be a useful performance optimization. Note that this "optimization" would not be worthwhile if one did not have cached hashcodes, since computing the hashcodes of two strings would almost certainly be slower than comparing them. If, however, one has had to compute and cache the hashcodes for some other purpose, checking hash codes as a first step to comparing strings may be useful.

回答4:

You always risk collisions when using GetHashCode() because you are operating within a limited number space, Int32, and this will also be exacerbated by the fact that hashing algorithms will not perfectly distribute within this space.

If you look at the implementation of HashTable or Dictionary you will see that GetHashCode is used to assign the keys into buckets to cut down the number of comparisons required, however, the equality comparisons are still necessary if there are multiple items in the same bucket.

回答5:

No. GetHasCode just provides a hash code. There will be collisions. Having different hashes means the strings are different, but having the same hash does not mean the strings are the same.

Read these guidlelines by Eric Lippert for correct use of GetHashCode, they are quite instructing.

If you want to compare strings, just do so! stringA == stringB works fine. If you want to ensure a string is unique in a large set, using the power of hash code to do so, use a HashSet<string>.

来源：https://stackoverflow.com/questions/12366828/can-i-use-gethashcode-for-all-string-compares

标签

hashcode