问题
BigQuery conveniently includes the FARM_FINGERPRINT
function. Here's an excerpt of the documentation for this function:
Description
Computes the fingerprint of the STRING or BYTES input using the Fingerprint64 function from the open-source FarmHash library. The output of this function for a particular input will never change.
Return type
INT64
Note that the return type is an INT64, which in bigquery is a 64-bit signed int.
However, if we look at the actual implementation of Fingerprint64, we can see right in the header file that it returns an unsigned 64-bit int.
The problem A 64 bit unsigned int has twice the maximum value of a 64-bit signed int. So half the time, FARM_FINGERPRINT will generate an output that is outside the representable range of a BigQuery INT64. In such cases, what does BigQuery do? Somehow it transform the output of Fingerprint64
to fit into the range of a signed int, but the documentation doesn't say how.
One way to do this would just let the value overflow, causing the value to wrap around into the negative range of the signed int. However, as Fingerprint64
is meant to be a portable function, that seems like a poor design, because then its output in BigQuery differs from the standard output in other systems. If this discrepancy exists, it should at least be documented with a big fat warning!
回答1:
The documentation says it uses "Fingerprint64 function from the open-source FarmHash library" but doesn't say that it's exactly the same function as it is. And since int64 in BigQuery is signed, it can't have the same values than uint64 (unsigned), so Two's complement is applied in order to make them fit taking the first bit as the signed bit. (Just as @ElliottBrossard and Conrad Lee found)
来源:https://stackoverflow.com/questions/51892989/how-does-bigquerys-farm-fingerprint-represent-a-64-bit-unsigned-int