Generate a Hashcode for a string that is platform independent

问题

We have an application that

Generates a hash code on a string
Saves that hash code into a DB along with associated data
Later, it queries the DB using the string hash code for retrieving the data

This is obviously a bug because the value returned from string.GetHashCode() varies from .NET versions and architectures (32/64 bit). To complicate matters, we're too close to a release to refactor our application to stop serializing hash codes and just query on the strings instead. What we'd like to do is come up with a quick and dirty fix for now, and refactor the code later to do it the right way.

The quick and dirty fix seems like creating a static GetInvariantHashCode(string s) helper method that is consistent across architectures.

Can suggest an algorithm for generating a hashcode on a string that is equivalent on 32 bit and 64 bit architecture?

A few more notes:

I'm aware that HashCodes are not unique. If a hashcode returns a match on two different strings, we post process the results to find the exact match. It is not used as a primary key.
I believe the architect's intent was to speed up the searches by querying on a long instead of an NVarChar

回答1:

I'm aware that HashCodes are not unique. If a hashcode returns a match on two different strings, we post process the results to find the exact match. It is not used as a primary key.

I believe the architect's intent was to speed up the searches by querying on a long instead of an NVarChar

Then just let the database index the strings for you!

Look, I have no idea how large your domain is, but you're going to get collisions very rapidly with very high likelihood if it's of any decent size at all. It's the birthday problem with a lot of people relative to the number of birthdays. You're going to have collisions, and lose any gain in speed you might think you're gaining by not just indexing the strings in the first place.

Anyway, you don't need us if you're stuck a few days away from release and you really need an invariant hash code across platform. There are really dumb, really fast implementations of hash code out there that you can use. Hell, you could come up with one yourself in the blink of an eye:

string s = "Hello, world!";
int hash = 17;
foreach(char c in s) {
    unchecked { hash = hash * 23 + c.GetHashCode(); } 
}

Or you could use the old Bernstein hash. And on and on. Are they going to give you the performance gain you're looking for? I don't know, they weren't meant to be used for this purpose. They were meant to be used for balancing hash tables. You're not balancing a hash table. You're using the wrong concept.

Edit (the below was written before the question was edited with new salient information):

You can't do this, at all, theoretically, without some kind of restriction on your input space. Your problem is far more severe than String.GetHashCode differening from platform to platform.

There are a lot of instances of string. In fact, way more instances than there are instances of Int32. So, because of the piegonhole principle, you will have collisions. You can't avoid this: your strings are pigeons and your Int32 hash codes are piegonholes and there are too many pigeons to go in the pigeonholes without some pigeonhole getting more than one pigeon. Because of collision problems, you can't use hash codes as unique keys for strings. It doesn't work. Period.

The only way you can make your current proposed design work (using Int32 as an identifier for instances of string) is if you restrict your input space of strings to something that has at size less than or equal to the number of Int32s. Even then, you'll have difficulty coming up with an algorithm that maps your input space of strings to Int32 in a unique way.

Even if you try to increase the number of pigeonholes by using SHA-512 or whatever, you still have the possibility of collisions. I doubt you considered that possibility previously in your design; this design path is DOA. And that's not what SHA-512 is for anyway, it's not to be used for unique identification of messages. It's just to reduce the likelihood of message forgery.

To complicate matters, we're too close to a release to refactor our application to stop serializing hash codes and just query on the strings instead.

Well, then you have a tremendous amount of work ahead of you. I'm sorry you discovered this so late in the game.

I note the documentation for String.GetHashCode:

The behavior of GetHashCode is dependent on its implementation, which might change from one version of the common language runtime to another. A reason why this might happen is to improve the performance of GetHashCode.

And from Object.GetHashCode:

The GetHashCode method is suitable for use in hashing algorithms and data structures such as a hash table.

Hash codes are for balancing hash tables. They are not for identifying objects. You could have caught this sooner if you had used the concept for what it is meant to be used for.

回答2:

You should just use SHA512.

Note that hashes are not (and cannot be) unique.
If you need it to be unique, just use the identity function as your hash.

回答3:

You can use one of the managed cryptography classes (such as SHA512Managed) to compute a platform independent hash via ComputeHash. This will require converting the string to a byte array (ie: using Encoding.GetBytes or some other method), and be slow, but be consistent.

That being said, a hash is not guaranteed unique, and is really not a proper mechanism for uniqueness in a database. Using a hash to store data is likely to cause data to get lost, as the first hash collision will overwrite old data (or throw away new data).

来源：https://stackoverflow.com/questions/9198831/generate-a-hashcode-for-a-string-that-is-platform-independent

标签

.net

hash

hashcode