What kind of hash algorithm is used for Hive's built-in HASH() Function

前端 未结 2 2055
梦如初夏
梦如初夏 2020-12-16 02:45

What kind of hashing algorithm is used in the built-in HASH() function?

I\'m ideally looking for a SHA512/SHA256 hash, similar to what the SHA() function offers with

相关标签:
2条回答
  • 2020-12-16 03:13

    HASH function (as of Hive 0.11) uses algorithm similar to java.util.List#hashCode.

    Its code looks like this:

    int hashCode = 0; // Hive HASH uses 0 as the seed, List#hashCode uses 1. I don't know why.
    for (Object item: items) {
       hashCode = hashCode * 31 + (item == null ? 0 : item.hashCode());
    }
    

    Basically it's a classic hash algorithm as recommended in the book Effective Java. To quote a great man (and a great book):

    The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

    I digress. You can look at the HASH source here.

    If you want to use SHAxxx in Hive then you can use Apache DigestUtils class and Hive built-in reflect function (I hope that'll work):

    SELECT reflect('org.apache.commons.codec.digest.DigestUtils', 'sha256Hex', 'your_string')
    
    0 讨论(0)
  • 2020-12-16 03:21

    As of Hive 2.1.0 there is a mask_hash function that will hash string values.

    For Hive 2.x it uses md5 as the hashing algorithm. This was changed to sha256 for Hive 3.x

    0 讨论(0)
提交回复
热议问题