Generating an Instagram- or Youtube-like unguessable string ID in ruby/ActiveRecord

前端 未结 3 913
被撕碎了的回忆
被撕碎了的回忆 2020-12-09 12:12

Upon creating an instance of a given ActiveRecord model object, I need to generate a shortish (6-8 characters) unique string to use as an identifier in URLs, in the style of

相关标签:
3条回答
  • 2020-12-09 12:17

    You could do something like this:

    random_attribute.rb

    module RandomAttribute
    
      def generate_unique_random_base64(attribute, n)
        until random_is_unique?(attribute)
          self.send(:"#{attribute}=", random_base64(n))
        end
      end
    
      def generate_unique_random_hex(attribute, n)
        until random_is_unique?(attribute)
          self.send(:"#{attribute}=", SecureRandom.hex(n/2))
        end
      end
    
      private
    
      def random_is_unique?(attribute)
        val = self.send(:"#{attribute}")
        val && !self.class.send(:"find_by_#{attribute}", val)
      end
    
      def random_base64(n)
        val = base64_url
        val += base64_url while val.length < n
        val.slice(0..(n-1))
      end
    
      def base64_url
        SecureRandom.base64(60).downcase.gsub(/\W/, '')
      end
    end
    Raw
    

    user.rb

    class Post < ActiveRecord::Base
    
      include RandomAttribute
      before_validation :generate_key, on: :create
    
      private
    
      def generate_key
        generate_unique_random_hex(:key, 32)
      end
    end
    
    0 讨论(0)
  • 2020-12-09 12:18

    Here's a good method with no collision already implemented in plpgsql.

    First step: consider the pseudo_encrypt function from the PG wiki. This function takes a 32 bits integer as argument and returns a 32 bits integer that looks random to the human eye but uniquely corresponds to its argument (so that's encryption, not hashing). Inside the function, you may change the formula: (((1366.0 * r1 + 150889) % 714025) / 714025.0) with another function known only by you that produces a result in the [0..1] range (just tweaking the constants will probably be good enough, see below my attempt at doing just that). Refer to the wikipedia article on the Feistel cypher for more theorical explanations.

    Second step: encode the output number in the alphabet of your choice. Here's a function that does it in base 62 with all alphanumeric characters.

    CREATE OR REPLACE FUNCTION stringify_bigint(n bigint) RETURNS text
        LANGUAGE plpgsql IMMUTABLE STRICT AS $$
    DECLARE
     alphabet text:='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789';
     base int:=length(alphabet); 
     _n bigint:=abs(n);
     output text:='';
    BEGIN
     LOOP
       output := output || substr(alphabet, 1+(_n%base)::int, 1);
       _n := _n / base; 
       EXIT WHEN _n=0;
     END LOOP;
     RETURN output;
    END $$
    

    Now here's what we'd get for the first 10 URLs corresponding to a monotonic sequence:

    select stringify_bigint(pseudo_encrypt(i)) from generate_series(1,10) as i;
    
     stringify_bigint 
    ------------------
     tWJbwb
     eDUHNb
     0k3W4b
     w9dtmc
     wWoCi
     2hVQz
     PyOoR
     cjzW8
     bIGoqb
     A5tDHb
    

    The results look random and are guaranteed to be unique in the entire output space (2^32 or about 4 billion values if you use the entire input space with negative integers as well). If 4 billion values was not wide enough, you may carefully combine two 32 bits results to get to 64 bits while not loosing unicity in outputs. The tricky parts are dealing correctly with the sign bit and avoiding overflows.

    About modifying the function to generate your own unique results: let's change the constant from 1366.0 to 1367.0 in the function body, and retry the test above. See how the results are completely different:

     NprBxb
     sY38Ob
     urrF6b
     OjKVnc
     vdS7j
     uEfEB
     3zuaT
     0fjsab
     j7OYrb
     PYiwJb
    

    Update: For those who can compile a C extension, a good replacement for pseudo_encrypt() is range_encrypt_element() from the permuteseq extension, which has of the following advantages:

    • works with any output space up to 64 bits, and it doesn't have to be a power of 2.

    • uses a secret 64-bit key for unguessable sequences.

    • is much faster, if that matters.

    0 讨论(0)
  • 2020-12-09 12:24

    You can hash the id:

    Digest::MD5.hexdigest('1')[0..9]
    => "c4ca4238a0"
    Digest::MD5.hexdigest('2')[0..9]
    => "c81e728d9d"
    

    But somebody can still guess what you're doing and iterate that way. It's probably better to hash on the content

    0 讨论(0)
提交回复
热议问题