Map incrementing integer range to six-digit base 26 max, but unpredictably

前端 未结 8 1403
时光说笑
时光说笑 2020-12-04 22:39

I want to design a URL shortener for a particular use case and type of end-user that I have targetted. I have decided that I want the URLs to be stored internally according

8条回答
  •  情书的邮戳
    2020-12-04 23:15

    Using a Hash function with a seed should make it unpredictable.
    Security is obviously not an issue (else you would use cryptography).

    Actually, you could straight-away use MD5 and select a fixed 6 characters for a simple solution that will work well. It is available in most languages and generates an alphanumeric hash a 128-bit hash that is easily written as 32 hexadecimals. That's actually just 16 characters (reduces to base 16).

    Cooking up your own algorithm for unpredictable hashing is not advised.
    Here is a Coding Horror blog entry you should read too.


    I am blatantly double quoting from Jeff's Coding Horror reference to emphasize.

    Suppose you're using something like MD5 (the GOD of HASH). MD5 takes any length string of input bytes and outputs 128 bits. The bits are consistently random, based on the input string. If you send the same string in twice, you'll get the exact same random 16 bytes coming out. But if you make even a tiny change to the input string -- even a single bit change -- you'll get a completely different output hash.

    So when do you need to worry about collisions? The working rule-of-thumb here comes from the birthday paradox. Basically you can expect to see the first collision after hashing 2n/2 items, or 2^64 for MD5.

    2^64 is a big number. If there are 100 billion urls on the web, and we MD5'd them all, would we see a collision? Well no, since 100,000,000,000 is way less than 2^64:

    2^64 18,446,744,073,709,551,616
    2^37 100,000,000,000


    Update based on comments.

    • With a 6 character hexadecimal representation like I suggest above, the probability of collisions reduces to 2^12 -- which is just 4096! (read the whole Coding Horror article for the nuances).
    • If you do not want repeatability in your shortening (same shortened form for a URL every time), a random number should be fine.

提交回复
热议问题