Map incrementing integer range to six-digit base 26 max, but unpredictably

前端 未结 8 1377
时光说笑
时光说笑 2020-12-04 22:39

I want to design a URL shortener for a particular use case and type of end-user that I have targetted. I have decided that I want the URLs to be stored internally according

相关标签:
8条回答
  • 2020-12-04 23:15

    Why not just shuffle the bits around in a specific order before converting to the base 26 value? For example, bit 0 becomes bit 5, bit 1 becomes bit 2, etc. To decode, just do the reverse.

    Here's an example in Python. (Edited now to include converting the base too.)

    import random
    
    # generate a random bit order
    # you'll need to save this mapping permanently, perhaps just hardcode it
    # map how ever many bits you need to represent your integer space
    mapping = range(28)
    mapping.reverse()
    #random.shuffle(mapping)
    
    # alphabet for changing from base 10
    chars = 'abcdefghijklmnopqrstuvwxyz'
    
    # shuffle the bits
    def encode(n):
        result = 0
        for i, b in enumerate(mapping):
            b1 = 1 << i
            b2 = 1 << mapping[i]
            if n & b1:
                result |= b2
        return result
    
    # unshuffle the bits
    def decode(n):
        result = 0
        for i, b in enumerate(mapping):
            b1 = 1 << i
            b2 = 1 << mapping[i]
            if n & b2:
                result |= b1
        return result
    
    # change the base
    def enbase(x):
        n = len(chars)
        if x < n:
            return chars[x]
        return enbase(x/n) + chars[x%n]
    
    # go back to base 10
    def debase(x):
        n = len(chars)
        result = 0
        for i, c in enumerate(reversed(x)):
            result += chars.index(c) * (n**i)
        return result
    
    # test it out
    for a in range(200):
        b = encode(a)
        c = enbase(b)
        d = debase(c)
        e = decode(d)
        while len(c) < 7:
            c = ' ' + c
        print '%6d %6d %s %6d %6d' % (a, b, c, d, e)
    

    The output of this script, showing the encoding and decoding process:

       0            0       a            0    0
       1    134217728  lhskyi    134217728    1
       2     67108864  fqwfme     67108864    2
       3    201326592  qyoqkm    201326592    3
       4     33554432  cvlctc     33554432    4
       5    167772160  oddnrk    167772160    5
       6    100663296  imhifg    100663296    6
       7    234881024  ttztdo    234881024    7
       8     16777216  bksojo     16777216    8
       9    150994944  mskzhw    150994944    9
      10     83886080  hbotvs     83886080   10
      11    218103808  sjheua    218103808   11
      12     50331648  egdrcq     50331648   12
      13    184549376  pnwcay    184549376   13
      14    117440512  jwzwou    117440512   14
      15    251658240  veshnc    251658240   15
      16      8388608   sjheu      8388608   16
      17    142606336  mabsdc    142606336   17
      18     75497472  gjfmqy     75497472   18
      19    209715200  rqxxpg    209715200   19
    

    Note that zero maps to zero, but you can just skip that number.

    This is simple, efficient and should be good enough for your purposes. If you really needed something secure I obviously would not recommend this. It's basically a naive block cipher. There won't be any collisions.

    Probably best to make sure that bit N doesn't ever map to bit N (no change) and probably best if some low bits in the input get mapped to higher bits in the output, in general. In other words, you may want to generate the mapping by hand. In fact, a decent mapping would be simply reversing the bit order. (That's what I did for the sample output above.)

    0 讨论(0)
  • 2020-12-04 23:15

    Using a Hash function with a seed should make it unpredictable.
    Security is obviously not an issue (else you would use cryptography).

    Actually, you could straight-away use MD5 and select a fixed 6 characters for a simple solution that will work well. It is available in most languages and generates an alphanumeric hash a 128-bit hash that is easily written as 32 hexadecimals. That's actually just 16 characters (reduces to base 16).

    Cooking up your own algorithm for unpredictable hashing is not advised.
    Here is a Coding Horror blog entry you should read too.


    I am blatantly double quoting from Jeff's Coding Horror reference to emphasize.

    Suppose you're using something like MD5 (the GOD of HASH). MD5 takes any length string of input bytes and outputs 128 bits. The bits are consistently random, based on the input string. If you send the same string in twice, you'll get the exact same random 16 bytes coming out. But if you make even a tiny change to the input string -- even a single bit change -- you'll get a completely different output hash.

    So when do you need to worry about collisions? The working rule-of-thumb here comes from the birthday paradox. Basically you can expect to see the first collision after hashing 2n/2 items, or 2^64 for MD5.

    2^64 is a big number. If there are 100 billion urls on the web, and we MD5'd them all, would we see a collision? Well no, since 100,000,000,000 is way less than 2^64:

    2^64 18,446,744,073,709,551,616
    2^37 100,000,000,000


    Update based on comments.

    • With a 6 character hexadecimal representation like I suggest above, the probability of collisions reduces to 2^12 -- which is just 4096! (read the whole Coding Horror article for the nuances).
    • If you do not want repeatability in your shortening (same shortened form for a URL every time), a random number should be fine.
    0 讨论(0)
  • 2020-12-04 23:19

    You want to permute your initial autoincrementing ID number with a Feistel network. This message (which happens to be on the PostgreSQL lists but doesn't really have much to do with PostgreSQL) describes a simple Feistel network. There are, of course, plenty of variations, but in general this is the Right Approach.

    0 讨论(0)
  • 2020-12-04 23:19

    26^6 is around 300 million.

    Easiest just to use a random number generator, and if you have a collision (i.e. in case your randomly generated 6-letter identifier is already taken), increment until you have a free identifier.

    I mean, sure, you'll get collisions fairly early (at around 17 thousand entries), but incrementing until you have a free identifier will be plenty fast, at least until your keyspace starts to be saturated (around 12 million entries), and by then, you should be switching to 7-letter identifiers anyway.

    0 讨论(0)
  • 2020-12-04 23:22

    You need a Block Cipher with "Block Space" of 266.

    Choose an arbitrary key for the cipher, and you now have a transformation that is reversible by you, yet unpredictable for everyone else.

    Your block size is a bit unusual, so you probably won't find a ready-made good block cipher for your size. But as suggested by kquinn you can design one on your own that mimics other ciphers.

    0 讨论(0)
  • 2020-12-04 23:29

    How about an LFSR? The linear feedback shift register is used to generate pseudo-random numbers in a range - the operation is deterministic given the seed value, but it can visit every value in a range with a long cycle.

    0 讨论(0)
提交回复
热议问题