Why is XOR the default way to combine hashes?

感情迁移 提交于 2019-11-26 11:04:01

Assuming uniformly random (1-bit) inputs, the AND function output probability distribution is 75% 0 and 25% 1. Conversely, OR is 25% 0 and 75% 1.

The XOR function is 50% 0 and 50% 1, therefore it is good for combining uniform probability distributions.

This can be seen by writing out truth tables:

 a | b | a AND b
---+---+--------
 0 | 0 |    0
 0 | 1 |    0
 1 | 0 |    0
 1 | 1 |    1

 a | b | a OR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    1

 a | b | a XOR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    0

Exercise: How many logical functions of two 1-bit inputs a and b have this uniform output distribution? Why is XOR the most suitable for the purpose stated in your question?

xor is a dangerous default function to use when hashing. It is better than and and or, but that doesn't say much.

xor is symmetric, so the order of the elements is lost. So "bad" will hash combine the same as "dab".

xor maps identical values to zero, and you should avoid mapping "common" values to zero:

So (a,a) gets mapped to 0, and (b,b) also gets mapped to 0. As such pairs are more common than randomness might imply, you end up with far to many collisions at zero than you should.

With these two problems, xor ends up being a hash combiner that looks half decent on the surface, but not after further inspection.

On modern hardware, adding usually about as fast as xor (it probably uses more power to pull this off, admittedly). Adding's truth table is similar to xor on the bit in question, but it also sends a bit to the next bit over when both values are 1. This erases less information.

So hash(a) + hash(b) is better in that if a==b, the result is instead hash(a)<<1 instead of 0.

This remains symmetric. We can break this symmetry for a modest cost:

hash(a)<<1 + hash(a) + hash(b)

aka hash(a)*3 + hash(b). (calculating hash(a) once and storing is advised if you use the shift solution). Any odd constant instead of 3 will bijectively map a size_t (or k-bit unsigned constant) to itself, as map on unsigned constants is math modulo 2^k for some k, and any odd constant is relatively prime to 2^k.

For an even fancier version, we can examine boost::hash_combine, which is effectively:

size_t hash_combine( size_t lhs, size_t rhs ) {
  lhs^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
  return lhs;
}

here we add together some shifted versions of seed with a constant (which is basically random 0s and 1s -- in particular it is the inverse of the golden ratio as a 32 bit fixed point fraction) with some addition and an xor. This breaks symmetry, and introduces some "noise" if the incoming hashed values are poor (ie, imagine every component hashes to 0 -- the above handles it well, generating a smear of 1 and 0s after each combine. Mine simply outputs a 0).

For those not familiar with C/C++, a size_t is an unsigned integer value which is big enough to describe the size of any object in memory. On a 64 bit system, it is usually a 64 bit unsigned integer. On a 32 bit system, a 32 bit unsigned integer.

In spite of its handy bit-mixing properties, XOR is not a good way to combine hashes due to its commutativity. Consider what would happen if you stored the permutations of {1, 2, …, 10} in a hash table of 10-tuples.

A much better choice is m * H(A) + H(B), where m is a large odd number.

Credit: The above combiner was a tip from Bob Jenkins.

Xor may be the "default" way to combine hashes but Greg Hewgill's answer also shows why it has its pitfalls: The xor of two identical hash values is zero. In real life, there are identical hashes are more common than one might have expected. You might then find that in these (not so infrequent) corner cases, the resulting combined hashes are always the same (zero). Hash collisions would be much, much more frequent than you expect.

In a contrived example, you might be combining hashed passwords of users from different websites you manage. Unfortunately, a large number of users reuse their passwords, and a surprising proportion of the resulting hashes are zero!

There's something I want to explicitly point out for others who find this page. AND and OR restrict output like BlueRaja - Danny Pflughoe is trying to point out, but can be better defined:

First I want to define two simple functions I'll use to explain this: Min() and Max().

Min(A, B) will return the value that is smaller between A and B, for example: Min(1, 5) returns 1.

Max(A, B) will return the value that is larger between A and B, for example: Max(1, 5) returns 5.

If you are given: C = A AND B

Then you can find that C <= Min(A, B) We know this because there is nothing you can AND with the 0 bits of A or B to make them 1s. So every zero bit stays a zero bit and every one bit has a chance to become a zero bit (and thus a smaller value).

With: C = A OR B

The opposite is true: C >= Max(A, B) With this, we see the corollary to the AND function. Any bit that is already a one cannot be ORed into being a zero, so it stays a one, but every zero bit has a chance to become a one, and thus a larger number.

This implies that the state of the input applies restrictions on the output. If you AND anything with 90, you know the output will be equal to or less than 90 regardless what the other value is.

For XOR, there is no implied restriction based on the inputs. There are special cases where you can find that if you XOR a byte with 255 than you get the inverse but any possible byte can be output from that. Every bit has a chance to change state depending on the same bit in the other operand.

If you XOR a random input with a biased input, the output is random. The same is not true for AND or OR. Example:

00101001 XOR 00000000 = 00101001
00101001 AND 00000000 = 00000000
00101001 OR  11111111 = 11111111

As @Greg Hewgill mentions, even if both inputs are random, using AND or OR will result in biased output.

The reason we use XOR over something more complex is that, well, there's no need: XOR works perfectly, and it's blazingly stupid-fast.

Cover the left 2 columns and try to work out what the inputs are using just the output.

 a | b | a AND b
---+---+--------
 0 | 0 |    0
 0 | 1 |    0
 1 | 0 |    0
 1 | 1 |    1

When you saw a 1-bit you should have worked out that both inputs were 1.

Now do the same for XOR

 a | b | a XOR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    0

XOR gives away nothing about it inputs.

The source code for various versions of hashCode() in java.util.Arrays is a great reference for solid, general use hashing algorithms. They are easily understood and translated into other programming languages.

Roughly speaking, most multi-attribute hashCode() implementations follow this pattern:

public static int hashCode(Object a[]) {
    if (a == null)
        return 0;

    int result = 1;

    for (Object element : a)
        result = 31 * result + (element == null ? 0 : element.hashCode());

    return result;
}

You can search other StackOverflow Q&As for more information about the magic behind 31, and why Java code uses it so frequently. It is imperfect, but has very good general performance characteristics.

XOR does not ignore some of the inputs sometimes like OR and AND.

If you take AND(X, Y) for example, and feed input X with false, then the input Y does not matter...and one probably would want the input to matter when combining hashes.

If you take XOR(X, Y) then BOTH inputs ALWAYS matter. There would be no value of X where Y does not matter. If either X or Y is changed then the output will reflect that.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!