Converting a unique seed string into a random, yet deterministic, float value in Ruby

前端 未结 3 1454
悲哀的现实
悲哀的现实 2020-12-18 20:44

I\'m having a hard time with this, conceptually.

Basically, I need to accept some arbitrary unique string, and be able to convert that to a normalized float value.

相关标签:
3条回答
  • 2020-12-18 21:09

    If security is no issue, what you are describing is in my opinion not a hash function. A hash function is a one-way function, meaning computing the hash is easy but reverting it is "hard" or, ideally, impossible.

    Your requirements instead describe an injective function Given any x1, x2 in your domain X the following holds:

    For all x1, x2 element of X, x1 != x2  => f(x1) != f(x2)
    

    f(x) = x is such a function, f(x) = x² is not. In plain English: you want to have different results if your inputs are different, same results only if the inputs are the same. It is true that this also is true for secure hashes, but they additionally provide the one-way characteristics such as the property of not being able (easily) to find x if you are only given f(x), among others. As far as I understood, you don't need these security properties.

    Trivially, such an injective mapping from String to Float would be given by simply interpreting the "String bytes" as "Float bytes" from now on, i.e. you interpret the bytes differently (think C:

    unsigned char *bytes = "...";
    double d = (double)bytes; 
    

    ). But, there is as downside to this - the real trouble is that Float has a maximum precision, so you will run into an overflow situation if your strings are too long (Floats are internally represented as double values, that's 8 bytes on a 32 bit machine). So not enough space for almost any use case. Even MD5-ing your strings first doesn't solve the problem - MD5 output is already 16 bytes long.

    So this could be a real problem, depending on your exact requirements. Although MD5 (or any other hash) will mess sufficiently with the input to make it as random as possible, you still cut the range of possible values from 16 bytes to effectively 8 bytes. (Note: Truncating random 16 byte output at 8 bytes is generally considered "secure" in terms of preserving the randomness. Elliptic Curve Cryptography does something similar. But as far as I know, nobody can really prove it, but neither could someone prove the contrary so far). So a collision is much more likely with your restricted Float range. By the birthday paradox, finding a collision takes sqrt(number of values in a finite range) tries. For MD5 this is 2^64, but for your scheme it's only 2^32. That's still very, very unlikely to produce a collision. It's probably something in the order of winning the lottery while at the same time being hit by a lightning. If you could live with this minimal possibility, go for it:

    def string_to_float(str)
      Digest::MD5.new.digest(str).unpack('D')
    end
    

    If uniqueness is of absolute priority I would recommend to move from floats to integers. Ruby has built-in support for large integers that are not restricted by the internal constraints of a long value (that's what a Fixnum boils down to). So any arbitrary hash output could be represented as a large integer number.

    0 讨论(0)
  • 2020-12-18 21:13

    Yes, you are describing a hashing algorithm. You could use a MD5 or SHA1 digest (since they just produce random bits) to generate a floating point number simply by using the String#unpack method with an argument of "G" (double-precision float, network byte order) from a digest:

    require 'digest/sha1'
    
    def string_to_float(str)
      Digest::SHA1.digest(str).unpack("G")[0]
    end
    
    string_to_float("abc-123") # => -2.86011943713676e-154
    string_to_float("def-456") # => -1.13232994606094e+214
    string_to_float("abc-123") # => -2.86011943713676e-154 OK!
    string_to_float("def-456") # => -1.13232994606094e+214 OK!
    

    Note that if you want the resulting floats to be in a particular range then you'll need to do some massaging.

    Also note that the unpacked number doesn't use all of the bits from the digest so you might want to combine into the number of bytes for a double floating point number (although you'll have to be careful not to decrease the entropy of the hash function, if you care about that sort of thing), e.g.:

    def str2float(s)
      d = Digest::SHA1.digest(s)
      x, y = d[0..9], d[10..19]
       # XOR the 1st (x) and 2nd (y) halves to use all bits.
      (0..9).map {|i| x[i] ^ y[i]}.pack("c*").unpack("G")[0]
    end
    
    0 讨论(0)
  • 2020-12-18 21:32

    The key part that you want is a way of converting a SHA1 or MD5 hash output into a float that is both deterministic and 1-1. Here's a simple solution based on md5. This could be used as integers too.

    require 'digest/md5'
    
    class String
      def float_hash
        (Digest::MD5.hexdigest(self).to_i(16)).to_f
      end
    end
    
    puts "example_string".float_hash  # returns 1.3084281619666243e+38
    

    This generates a hexadecimal hash, then converts it to an integer, then converts that to a float. Each step is deterministic.

    Note: as pointed out by @emboss, this reduces collision resistance because a double is 8 bytes and the hash is 16 bytes. It shouldn't be a big deal though by the sounds of your application.

    0 讨论(0)
提交回复
热议问题