Consequences of hashcode overflow on Java String

I've been reading a bit about Java String class' hashcode here recently, and I haven't been able to find this information : what happens when string's length is higher than 32 (I know an overflow then happens, but as a hash key, what happens)? For example, I need to hash strings that are between 20 and 120 characters long to use them as hash keys. Do I need to implement my own algorithm using BigInteger?

Also, since I might have between 30k and 80k strings, maybe more, is usual String hashcode collision-free enough?

(I know an overflow then happens, but as a hash key, what happens)?

In Java, arithmetic overflows and underflows of primitive types do not raise runtime errors or exceptions. The overflowed portion of the result is simply lost.

While this can result in logic errors or other difficulties if the programmer is not aware of this property, it is the specified behavior of the JVM.

You do not need to worry about overflow or underflow of int types when calculating hashcodes. The overflowed bits are simply lost.

This does not affect the correctness of the computed hash value or its ability to distribute to hash buckets well.

Also, since I might have between 30k and 80k strings, maybe more, is usual String hashcode collision-free enough?

A couple things that can be handy to keep in mind:

Java Strings are immutable. For this reason, the hash value of a String instance is calculated only once. After that, the result is cached in the instance so that subsequent invocations of hashCode() do not result in repeated computations. This works because Strings are immutable and recomputing the value would be the same every time.
The hash code really should be computed from all the meaningful information in an instance. This means that if your String contains 20k of information, the hash code should be computed from all 20k of it (but see above). Of course, there are performance implications, so you should design your program accordingly.
Collision 'free'-ness has much, much more to do with the quality of your hashCode() implementation and less to do with the size of your Strings. Algorithms used to generate hash codes should be capable of producing good distributions. What a "good hash function" is isn't precisely known, but is a subject for mathematical theorists. Fortunately it is not hard to define a hash function that is "good enough" even if it may not be "state of the art" (see Effective Java, 2nd ed.; J. Bloch).

You are misunderstanding what hashCode() does. It calculates a 32-bit number that should be different for different values, but is not guaranteed to be so. How could it, then there might be more than 2^32 different values to hash.

For a String, the hashCode has nothing to do with the string length. Any hashCode is a valid hashCode for any string, as long as your always get the same hashCode for the same String, i.e. calling hashCode() multiple times for the same sequence of characters must return the same value.

As an example, here are some hash codes for strings.

0x00000000 = "".hashCode()
0x00000061 = "a".hashCode()
0x00000041 = "A".hashCode()
0x042628b2 = "Hello".hashCode()
0x6f8f80f1 = "Goodbye".hashCode()
0xdbacdd53 = "The quick brown fox jumps over the lazy dog".hashCode()
0x99eecd2e = "The quick brown fox jumps over the lazy dog!".hashCode()

Notice that the last two are a very long (>32) string.

There is no overflow on Strings. Strings can be as long as your process' memory can hold. The hashCode of any String is a 32-bit integer. The collision frequency should not have a correlation with the String's length. You don't need to reimplement it.

来源：https://stackoverflow.com/questions/32106460/consequences-of-hashcode-overflow-on-java-string

标签

java

string

collision

hashcode