I want to design a URL shortener for a particular use case and type of end-user that I have targetted. I have decided that I want the URLs to be stored internally according
Why not just shuffle the bits around in a specific order before converting to the base 26 value? For example, bit 0 becomes bit 5, bit 1 becomes bit 2, etc. To decode, just do the reverse.
Here's an example in Python. (Edited now to include converting the base too.)
import random
# generate a random bit order
# you'll need to save this mapping permanently, perhaps just hardcode it
# map how ever many bits you need to represent your integer space
mapping = range(28)
mapping.reverse()
#random.shuffle(mapping)
# alphabet for changing from base 10
chars = 'abcdefghijklmnopqrstuvwxyz'
# shuffle the bits
def encode(n):
result = 0
for i, b in enumerate(mapping):
b1 = 1 << i
b2 = 1 << mapping[i]
if n & b1:
result |= b2
return result
# unshuffle the bits
def decode(n):
result = 0
for i, b in enumerate(mapping):
b1 = 1 << i
b2 = 1 << mapping[i]
if n & b2:
result |= b1
return result
# change the base
def enbase(x):
n = len(chars)
if x < n:
return chars[x]
return enbase(x/n) + chars[x%n]
# go back to base 10
def debase(x):
n = len(chars)
result = 0
for i, c in enumerate(reversed(x)):
result += chars.index(c) * (n**i)
return result
# test it out
for a in range(200):
b = encode(a)
c = enbase(b)
d = debase(c)
e = decode(d)
while len(c) < 7:
c = ' ' + c
print '%6d %6d %s %6d %6d' % (a, b, c, d, e)
The output of this script, showing the encoding and decoding process:
0 0 a 0 0
1 134217728 lhskyi 134217728 1
2 67108864 fqwfme 67108864 2
3 201326592 qyoqkm 201326592 3
4 33554432 cvlctc 33554432 4
5 167772160 oddnrk 167772160 5
6 100663296 imhifg 100663296 6
7 234881024 ttztdo 234881024 7
8 16777216 bksojo 16777216 8
9 150994944 mskzhw 150994944 9
10 83886080 hbotvs 83886080 10
11 218103808 sjheua 218103808 11
12 50331648 egdrcq 50331648 12
13 184549376 pnwcay 184549376 13
14 117440512 jwzwou 117440512 14
15 251658240 veshnc 251658240 15
16 8388608 sjheu 8388608 16
17 142606336 mabsdc 142606336 17
18 75497472 gjfmqy 75497472 18
19 209715200 rqxxpg 209715200 19
Note that zero maps to zero, but you can just skip that number.
This is simple, efficient and should be good enough for your purposes. If you really needed something secure I obviously would not recommend this. It's basically a naive block cipher. There won't be any collisions.
Probably best to make sure that bit N doesn't ever map to bit N (no change) and probably best if some low bits in the input get mapped to higher bits in the output, in general. In other words, you may want to generate the mapping by hand. In fact, a decent mapping would be simply reversing the bit order. (That's what I did for the sample output above.)
Using a Hash function with a seed should make it unpredictable.
Security is obviously not an issue (else you would use cryptography).
Actually, you could straight-away use MD5 and select a fixed 6 characters for a simple solution that will work well. It is available in most languages and generates an alphanumeric hash a 128-bit hash that is easily written as 32 hexadecimals.
That's actually just 16 characters (reduces to base 16).
Cooking up your own algorithm for unpredictable hashing is not advised.
Here is a Coding Horror blog entry you should read too.
I am blatantly double quoting from Jeff's Coding Horror reference to emphasize.
Suppose you're using something like MD5 (the GOD of HASH). MD5 takes any length string of input bytes and outputs 128 bits. The bits are consistently random, based on the input string. If you send the same string in twice, you'll get the exact same random 16 bytes coming out. But if you make even a tiny change to the input string -- even a single bit change -- you'll get a completely different output hash.
So when do you need to worry about collisions? The working rule-of-thumb here comes from the birthday paradox. Basically you can expect to see the first collision after hashing 2n/2 items, or 2^64 for MD5.
2^64 is a big number. If there are 100 billion urls on the web, and we MD5'd them all, would we see a collision? Well no, since 100,000,000,000 is way less than 2^64:
2^64 18,446,744,073,709,551,616
2^37 100,000,000,000
Update based on comments.
2^12
-- which is just 4096! (read the whole Coding Horror article for the nuances).You want to permute your initial autoincrementing ID number with a Feistel network. This message (which happens to be on the PostgreSQL lists but doesn't really have much to do with PostgreSQL) describes a simple Feistel network. There are, of course, plenty of variations, but in general this is the Right Approach.
26^6 is around 300 million.
Easiest just to use a random number generator, and if you have a collision (i.e. in case your randomly generated 6-letter identifier is already taken), increment until you have a free identifier.
I mean, sure, you'll get collisions fairly early (at around 17 thousand entries), but incrementing until you have a free identifier will be plenty fast, at least until your keyspace starts to be saturated (around 12 million entries), and by then, you should be switching to 7-letter identifiers anyway.
You need a Block Cipher with "Block Space" of 266.
Choose an arbitrary key for the cipher, and you now have a transformation that is reversible by you, yet unpredictable for everyone else.
Your block size is a bit unusual, so you probably won't find a ready-made good block cipher for your size. But as suggested by kquinn you can design one on your own that mimics other ciphers.
How about an LFSR? The linear feedback shift register is used to generate pseudo-random numbers in a range - the operation is deterministic given the seed value, but it can visit every value in a range with a long cycle.