Hash function that produces short hashes?

后端 未结 10 1679
走了就别回头了
走了就别回头了 2020-12-07 23:58

Is there a way of encryption that can take a string of any length and produce a sub-10-character hash? I want to produce reasonably unique ID\'s but based on message content

相关标签:
10条回答
  • 2020-12-08 00:21

    You can use the hashlib library for Python. The shake_128 and shake_256 algorithms provide variable length hashes. Here's some working code (Python3):

    import hashlib
    >>> my_string = 'hello shake'
    >>> hashlib.shake_256(my_string.encode()).hexdigest(5)
    '34177f6a0a'
    

    Notice that with a length parameter x (5 in example) the function returns a hash value of length 2x.

    0 讨论(0)
  • 2020-12-08 00:25

    You need to hash the contents to come up with a digest. There are many hashes available but 10-characters is pretty small for the result set. Way back, people used CRC-32, which produces a 33-bit hash (basically 4 characters plus one bit). There is also CRC-64 which produces a 65-bit hash. MD5, which produces a 128-bit hash (16 bytes/characters) is considered broken for cryptographic purposes because two messages can be found which have the same hash. It should go without saying that any time you create a 16-byte digest out of an arbitrary length message you're going to end up with duplicates. The shorter the digest, the greater the risk of collisions.

    However, your concern that the hash not be similar for two consecutive messages (whether integers or not) should be true with all hashes. Even a single bit change in the original message should produce a vastly different resulting digest.

    So, using something like CRC-64 (and base-64'ing the result) should get you in the neighborhood you're looking for.

    0 讨论(0)
  • 2020-12-08 00:28

    You can use any commonly available hash algorithm (eg. SHA-1), which will give you a slightly longer result than what you need. Simply truncate the result to the desired length, which may be good enough.

    For example, in Python:

    >>> import hashlib
    >>> hash = hashlib.sha1("my message".encode("UTF-8")).hexdigest()
    >>> hash
    '104ab42f1193c336aa2cf08a2c946d5c6fd0fcdb'
    >>> hash[:10]
    '104ab42f11'
    
    0 讨论(0)
  • 2020-12-08 00:39

    If you need "sub-10-character hash" you could use Fletcher-32 algorithm which produces 8 character hash (32 bits), CRC-32 or Adler-32.

    CRC-32 is slower than Adler32 by a factor of 20% - 100%.

    Fletcher-32 is slightly more reliable than Adler-32. It has a lower computational cost than the Adler checksum: Fletcher vs Adler comparison.

    A sample program with a few Fletcher implementations is given below:

        #include <stdio.h>
        #include <string.h>
        #include <stdint.h> // for uint32_t
    
        uint32_t fletcher32_1(const uint16_t *data, size_t len)
        {
                uint32_t c0, c1;
                unsigned int i;
    
                for (c0 = c1 = 0; len >= 360; len -= 360) {
                        for (i = 0; i < 360; ++i) {
                                c0 = c0 + *data++;
                                c1 = c1 + c0;
                        }
                        c0 = c0 % 65535;
                        c1 = c1 % 65535;
                }
                for (i = 0; i < len; ++i) {
                        c0 = c0 + *data++;
                        c1 = c1 + c0;
                }
                c0 = c0 % 65535;
                c1 = c1 % 65535;
                return (c1 << 16 | c0);
        }
    
        uint32_t fletcher32_2(const uint16_t *data, size_t l)
        {
            uint32_t sum1 = 0xffff, sum2 = 0xffff;
    
            while (l) {
                unsigned tlen = l > 359 ? 359 : l;
                l -= tlen;
                do {
                    sum2 += sum1 += *data++;
                } while (--tlen);
                sum1 = (sum1 & 0xffff) + (sum1 >> 16);
                sum2 = (sum2 & 0xffff) + (sum2 >> 16);
            }
            /* Second reduction step to reduce sums to 16 bits */
            sum1 = (sum1 & 0xffff) + (sum1 >> 16);
            sum2 = (sum2 & 0xffff) + (sum2 >> 16);
            return (sum2 << 16) | sum1;
        }
    
        int main()
        {
            char *str1 = "abcde";  
            char *str2 = "abcdef";
    
            size_t len1 = (strlen(str1)+1) / 2; //  '\0' will be used for padding 
            size_t len2 = (strlen(str2)+1) / 2; // 
    
            uint32_t f1 = fletcher32_1(str1,  len1);
            uint32_t f2 = fletcher32_2(str1,  len1);
    
            printf("%u %X \n",    f1,f1);
            printf("%u %X \n\n",  f2,f2);
    
            f1 = fletcher32_1(str2,  len2);
            f2 = fletcher32_2(str2,  len2);
    
            printf("%u %X \n",f1,f1);
            printf("%u %X \n",f2,f2);
    
            return 0;
        }
    

    Output:

    4031760169 F04FC729                                                                                                                                                                                                                              
    4031760169 F04FC729                                                                                                                                                                                                                              
    
    1448095018 56502D2A                                                                                                                                                                                                                              
    1448095018 56502D2A                                                                                                                                                                                                                              
    

    Agrees with Test vectors:

    "abcde"  -> 4031760169 (0xF04FC729)
    "abcdef" -> 1448095018 (0x56502D2A)
    

    Adler-32 has a weakness for short messages with few hundred bytes, because the checksums for these messages have a poor coverage of the 32 available bits. Check this:

    The Adler32 algorithm is not complex enough to compete with comparable checksums.

    0 讨论(0)
提交回复
热议问题